Introduction

Recently I dived into scrapping data from YouTube by use of YouTube API to extract valuable information from various podcasts channels in Kenya. What I discovered is that there is a lot of valuable information on the web that can be used to get valuable insights. Ipon making this discovery I resorted to learn what is data scraping and how can we extract data from the web about a product or lets say anything you can think of as a data analyst.

What is web scrapping?

Web Scrapping is the process of extracting valuable information from various web pages. This information can be in form of text, pictures or links found on web pages. Web scrapping is used to perform price monitoring, price intelligence, news monitoring, lead generation, and market research.

Necessary tools and libraries needed to scrape data from the web

The following tools and libraries are essential to scrape data from the web:

Python

Python is a high level programming language used to scrape data from the web since it contains powerful libraries.

Requests

Requests is a python library used to make HTTP request to a specific URL and returns the response. To install requests in Jupyter notebook:

pip install requests

Example of making a request to URL and getting response

import requests

url = 'https://en.wikipedia.org/wiki/Motion_Picture_Association_film_rating_system'

response = requests.get(url)
print(response.content)

BeautifulSoup

BeautifulSoup makes it easy to parse HTML and XML documents and extract data from them. To install beautifulsoup on jupyter notebook :-

pip install beautifulsoup4

A simple python program that extracts data using beautifulsoup

# Import requests to send a request to the url
# Import beautifulsoup to extract data from html documents
import requests
from bs4 import BeautifulSoup


# Making a GET request
url = 'https://en.wikipedia.org/wiki/Motion_Picture_Association_film_rating_system'

response = requests.get(url)

# Parsing the HTML
soup = BeautifulSoup(response.content, 'html.parser')

# find paragraphs in the class_name of your choice 
paragraphs = soup.find_all('p', class_= 'Class name from html code')

## Loop through each paragraph and print the text
for paragraph in paragraphs:
    print(paragraph.text))

Scrapy

Scrapy is free and open source web crawling framework written in python. Unlike beautifulsoup which is used in parsing html, Scrapy handles everything from requests and parsing to data storage and handling crawling rules. Scrapy allows developers to efficiently crawl web pages and extract the desired information.
To install scrapy in your jupyter notebook

pip install scrapy

Steps to Follow during web scrapping

Identify the website to extract data from and ensure it has the data you need to perform data analysis
Inspect the web page - This is done by first opening the web page, then right click and choose inspect to inspect the HTML structure of the web page and locate the data you will need.
Make a HTTP request to the web page using the request library.
Using beautiful soup parse HTML content to find the necessary data you shall need
Once you have located the data, extract it and store it in a suitable format such as CSV or JSON

Web scrapping best practices

Avoid making too many requests at the same time as this may overload the server making the website slow for other users
Write reusable functions to make your code more readable and easy to maintain
Ensure you handle and manage errors efficiently such as handling missing data
When making requests, include a User-Agent header to mimic a regular browser and avoid being blocked.

Conclusion

Web scraping can be used to extract a lot of information from the web. Ensure to scrape data ethically and responsibly by following the outlined practices above. Happy Scraping!

Check out my recent project where I used YouTube API to extract data on podcasts channels in Kenya on Linked IN

lets connect Linked IN

Beginners guide to master web scrapping in python