Github repository here

Introduction

Web Scraping - An automatic way to retrieve unstructured or semi-structured data from a website and store them in a structured format. Web Scrapers can extract all the data on particular sites or the specific data that a user wants. Ideally, it’s best if you specify the data you want so that the web scraper only extracts that data quickly.

Challenges of web scraping

  • It is more complex than other ways of getting data. APIs
  • It may be fragile - Frequently changing web designs
  • Ethical and legal risks
  • Reality - Website designs are getting better. Are you a robot?

Benefits of web scraping

  • Can be run iteratively over many web pages
  • Some websites have thousand or millions of pages
  • Can construct large, robust data sets out of otherwise messy text that would only appear in your web browser.

Basics of Web Scraping

  • 1. Finding the address - URL or URLs
  • 2. Sending HTTP requests to the server
  • 3. Parse the return
  • 4. Store the results

Important Steps to follow

  • 1. Indentify the url patterns
  • 2. Inspect the source code and locate data elements
  • 3. Think about the logical flow of your crawler
  • 4. Persistence and creativity is often required to collect valuable data

Techniques covered in the 3 parts of web scraping

  • Techniques for identifing url patterns
  • How to inspect html in a browser
  • How to request html with Python
  • Techniques for finding the data
  • Techniques for parsing html
  • BeautifulSoup

The website URLS from where we will be scraping the data

  • https://www.indiana-demographics.com/cities_by_population
  • https://www.getcompanyinfo.com/industry/information-technology-services/
  • https://quotes.toscrape.com/