Introduction
Web Scraping - An automatic way to retrieve unstructured or semi-structured data from a website and store them in a structured format. Web Scrapers can extract all the data on particular sites or the specific data that a user wants. Ideally, it’s best if you specify the data you want so that the web scraper only extracts that data quickly.
Challenges of web scraping
- It is more complex than other ways of getting data. APIs
- It may be fragile - Frequently changing web designs
- Ethical and legal risks
- Reality - Website designs are getting better. Are you a robot?
Benefits of web scraping
- Can be run iteratively over many web pages
- Some websites have thousand or millions of pages
- Can construct large, robust data sets out of otherwise messy text that would only appear in your web browser.
Basics of Web Scraping
- 1. Finding the address - URL or URLs
- 2. Sending HTTP requests to the server
- 3. Parse the return
- 4. Store the results
Important Steps to follow
- 1. Indentify the url patterns
- 2. Inspect the source code and locate data elements
- 3. Think about the logical flow of your crawler
- 4. Persistence and creativity is often required to collect valuable data
Techniques covered in the 3 parts of web scraping
- Techniques for identifing url patterns
- How to inspect html in a browser
- How to request html with Python
- Techniques for finding the data
- Techniques for parsing html
- BeautifulSoup
The website URLS from where we will be scraping the data
- https://www.indiana-demographics.com/cities_by_population
- https://www.getcompanyinfo.com/industry/information-technology-services/
- https://quotes.toscrape.com/