Introduction
Recently I dived into scrapping data from YouTube by use of YouTube API to extract valuable information from various podcasts channels in Kenya. What I discovered is that there is a lot of valuable information on the web that can be used to get valuable insights. Ipon making this discovery I resorted to learn what is data scraping and how can we extract data from the web about a product or lets say anything you can think of as a data analyst.
What is web scrapping?
Web Scrapping is the process of extracting valuable information from various web pages. This information can be in form of text, pictures or links found on web pages. Web scrapping is used to perform price monitoring, price intelligence, news monitoring, lead generation, and market research.
Necessary tools and libraries needed to scrape data from the web
The following tools and libraries are essential to scrape data from the web:
- Python
Python is a high level programming language used to scrape data from the web since it contains powerful libraries.
- Requests
Requests is a python library used to make HTTP request to a specific URL and returns the response. To install requests in Jupyter notebook:
pip install requests
Example of making a request to URL and getting response
import requests
url = 'https://en.wikipedia.org/wiki/Motion_Picture_Association_film_rating_system'
response = requests.get(url)
print(response.content)
- BeautifulSoup
BeautifulSoup makes it easy to parse HTML and XML documents and extract data from them. To install beautifulsoup on jupyter notebook :-
pip install beautifulsoup4
A simple python program that extracts data using beautifulsoup
# Import requests to send a request to the url
# Import beautifulsoup to extract data from html documents
import requests
from bs4 import BeautifulSoup
# Making a GET request
url = 'https://en.wikipedia.org/wiki/Motion_Picture_Association_film_rating_system'
response = requests.get(url)
# Parsing the HTML
soup = BeautifulSoup(response.content, 'html.parser')
# find paragraphs in the class_name of your choice
paragraphs = soup.find_all('p', class_= 'Class name from html code')
## Loop through each paragraph and print the text
for paragraph in paragraphs:
print(paragraph.text))
- Scrapy
Scrapy is free and open source web crawling framework written in python. Unlike beautifulsoup which is used in parsing html, Scrapy handles everything from requests and parsing to data storage and handling crawling rules. Scrapy allows developers to efficiently crawl web pages and extract the desired information.
To install scrapy in your jupyter notebook
pip install scrapy
Steps to Follow during web scrapping
- Identify the website to extract data from and ensure it has the data you need to perform data analysis
- Inspect the web page - This is done by first opening the web page, then right click and choose inspect to inspect the HTML structure of the web page and locate the data you will need.
- Make a HTTP request to the web page using the request library.
- Using beautiful soup parse HTML content to find the necessary data you shall need
- Once you have located the data, extract it and store it in a suitable format such as CSV or JSON
Web scrapping best practices
- Avoid making too many requests at the same time as this may overload the server making the website slow for other users
- Write reusable functions to make your code more readable and easy to maintain
- Ensure you handle and manage errors efficiently such as handling missing data
- When making requests, include a User-Agent header to mimic a regular browser and avoid being blocked.
Conclusion
Web scraping can be used to extract a lot of information from the web. Ensure to scrape data ethically and responsibly by following the outlined practices above. Happy Scraping!
Check out my recent project where I used YouTube API to extract data on podcasts channels in Kenya on Linked IN
lets connect Linked IN