Web Scraping basics and using Beautiful Soup

Github repository here

Introduction

Web Scraping - An automatic way to retrieve unstructured or semi-structured data from a website and store them in a structured format. Web Scrapers can extract all the data on particular sites or the specific data that a user wants. Ideally, it’s best if you specify the data you want so that the web scraper only extracts that data quickly.

Challenges of web scraping

It is more complex than other ways of getting data. APIs
It may be fragile - Frequently changing web designs
Ethical and legal risks
Reality - Website designs are getting better. Are you a robot?

Benefits of web scraping

Can be run iteratively over many web pages
Some websites have thousand or millions of pages
Can construct large, robust data sets out of otherwise messy text that would only appear in your web browser.

Basics of Web Scraping

1. Finding the address - URL or URLs
2. Sending HTTP requests to the server
3. Parse the return
4. Store the results

Important Steps to follow

1. Indentify the url patterns
2. Inspect the source code and locate data elements
3. Think about the logical flow of your crawler
4. Persistence and creativity is often required to collect valuable data

Techniques covered in the 3 parts of web scraping

Techniques for identifing url patterns
How to inspect html in a browser
How to request html with Python
Techniques for finding the data
Techniques for parsing html
BeautifulSoup

The website URLS from where we will be scraping the data

https://www.indiana-demographics.com/cities_by_population
https://www.getcompanyinfo.com/industry/information-technology-services/
https://quotes.toscrape.com/