Scrape Wikipedia in Python - Ultimate Tutorial

Crawlbase - Apr 2 - - Dev Community

This blog was originally posted to Crawlbase Blog

In this guide, we'll be scraping Wikipedia, the Internet's largest encyclopedia. Whether you're an academic researcher, content creator, data scientist, or simply curious about how to build a Wikipedia scraper, this tutorial is for you.

We'll provide you with step-by-step instructions on how to extract data from Wikipedia using Python, a popular programming language known for its simplicity and versatility. Additionally, we'll introduce you to Crawlbase's Crawling API, a powerful tool that streamlines the scraping process and makes it even more efficient.

By the end of this tutorial, you'll have the skills and knowledge to extract wikipedia and gather information from its articles with ease for all sorts of projects. Let’s begin.

Table of Contents

I. What are the Best Ways to Scrape Data from Wikipedia?

II. Project Scope

III. Prerequisites

IV. Installing Dependencies

V. How to Extract HTML from Wikipedia

  • Step 1: Importing the Crawlbase Library
  • Step 2: Defining the crChoosing an Endpointawl Function
  • Step 3: Initializing the CrawlingAPI Object
  • Step 4: Making a GET Request
  • Step 5: Checking the Response Status Code
  • Step 6: Extracting and Printing the HTML Content
  • Step 7: Handling Errors
  • Step 8: Main Function
  • Step 9: Specifying Wikipedia Page URL and API Token
  • Step 10: Calling the crawl Function

VI. How to Scrape Wikipedia

  • Step 1: Importing Libraries
  • Step 2: Initializing the Crawling API
  • Step 3: Scraping the Wikipedia Page Title
  • Step 4: Scraping Nickname property
  • Step 5: Scraping Citizenship property
  • Step 6: Scraping Image
  • Step 7: Scraping Profile Title
  • Step 8: Complete the Code

VII. How to Scrape a Table from Wikipedia

  • Step 1: Importing Libraries
  • Step 2: Scraping Table from Wikipedia pages
  • Step 3: Initializing the Crawling API
  • Step 4: Execute the Code to Save in CSV file

VIII. Conclusion

IX. Frequently Asked Questions

I. What are the Best Ways to Extract Wikipedia?

Wikipedia, founded in 2001, is a free online encyclopedia maintained by volunteers globally. With close to 4.3 billion unique global visitors in December 2023, it serves as a vital resource for information on diverse topics. Operated by the Wikimedia Foundation, Wikipedia allows open collaboration and democratized knowledge-sharing, making it an important platform for users worldwide.

wikipedia visits 2024 'wikipedia visits 2024'

Source

Wikipedia scraping can be achieved through various methods, each with its own advantages. Here are some effective approaches:

  1. APIs: Wikipedia provides a MediaWiki API, allowing programmatically access to a vast amount of Wikipedia content. While this API allows for retrieving specific datasets, it may have its limitations. In this tutorial, we will explore another type of API, the Crawling API, which promises greater efficiency and flexibility in data extraction.
  2. Web Scraping: Using libraries like BeautifulSoup (for Python), Scrapy, or Cheerio (for Node.js) you can scrape Wikipedia pages directly. Be cautious to comply with Wikipedia's robots.txt file and avoid overloading their servers with excessive requests. Later in this blog, we will combine BeautifulSoup with the Crawling API to achieve the most effective way to scrape Wikipedia pages.
  3. Database Dumps: Wikipedia periodically releases complete database dumps containing the entirety of its content. Although these dumps can be large and require specialized software for parsing, they offer comprehensive access to Wikipedia data.
  4. Third-Party Datasets: There are third-party datasets available that have already extracted and formatted data from Wikipedia. These datasets can be beneficial for specific use cases, but it's essential to verify their accuracy and reliability.

When utilizing scraped data, it's important to attribute the source appropriately. Additionally, consider the impact of your scraping activities on Wikipedia's servers and ensure compliance with their usage policies.

II. Project Scope

Our aim in this tutorial is to develop a Python-based wikipedia scraper to extract data from Wikipedia pages. Our project scope includes utilizing the BeautifulSoup library for HTML parsing and integrating the Crawlbase Crawling API for efficient data extraction. We'll cover scraping various elements such as page titles, nicknames, citizenship details, images, profile titles, and tabular data from Wikipedia pages.

The key components of this project include:

  1. HTML crawling: We'll utilize Python with Crawlbase Crawling API to extract the complete HTML content from Wikipedia pages, ensuring efficient data retrieval while respecting Wikipedia's usage policies.

We will be targeting this URL for this project.

Elon musk wikipedia 'Elon musk wikipedia'

  1. Wikipedia Scraping: Our focus will be on extracting specific data elements from Wikipedia pages with the help of BeautifulSoup (Python), including titles, nicknames, citizenship details, images, and profile titles.

  2. Web Scraping Wikipedia Tables: We'll discuss techniques for scraping tabular data from Wikipedia pages, identifying tables using BeautifulSoup selectors, and saving the extracted data to a CSV file.

This URL will be our example for this project.

how to scrape a table from wikipedia 'how to scrape a table from wikipedia'

  1. Error Handling: We'll address error handling during the scraping process, providing guidelines and troubleshooting tips for common issues encountered during data extraction.

Now that this is clear, let’s continue with the prerequisites of the project.

III. Prerequisites

Before diving head first into our web scraping project to scrape Wikipedia using Python, there are a few prerequisites you should take into account:

  1. Basic Knowledge of Python: Familiarity with Python programming language is essential for understanding the code examples and concepts presented in this tutorial. If you're new to Python, consider going through introductory Python tutorials or courses to grasp the fundamentals.

  2. Active Crawlbase API Account with API Credentials: To access Wikipedia pages programmatically and efficiently, you'll need an active account on Crawlbase and obtain API credentials.

Start by signing up for the Crawlbase Crawling API to get your free 1,000 requests and obtaining your API credentials from account documentation. For this particular project, we will utilize the normal request token.

Crawlbase account docs 'Crawlbase account docs'

  1. Python Installed on Your Development Machine: Make sure that Python is installed on your local development machine. You can download and install Python from the official Python website based on your operating system. Additionally, ensure that you have pip (Python package manager) installed, which is usually included with Python installations by default.

Once you have fulfilled these prerequisites, you'll be ready to start on the journey of web scraping Wikipedia pages using Python. In the next sections, we'll guide you through the process step by step, from setting up the environment to extracting and handling data effectively. Let's proceed with the project!

IV. Installing Dependencies

If you haven’t already installed Python on your system, you can download and install the latest version from the official Python website.

Packages Installation:

  1. Open your Command Prompt or Terminal:
  • For Windows users: Press Win + R, type cmd, and press Enter.
  • For macOS/Linux users: Open the Terminal application.
  1. Create a Directory for Your Python Scraper:
  • Navigate to the directory where you want to store your Python scraper project.
  1. Install Necessary Libraries:
  • Use the following commands to install the required Python libraries:
pip install crawlbase
pip install beautifulsoup4
pip install pandas
Enter fullscreen mode Exit fullscreen mode

About the Libraries:

  • Pandas: Pandas is a popular open-source Python library used for data manipulation and analysis. It provides powerful data structures and functions designed to make working with structured or tabular data seamless and efficient. The primary data structure in Pandas is the DataFrame, which is a two-dimensional labeled data structure similar to a spreadsheet or SQL table.
  • Crawlbase: The Crawlbase Python library simplifies the integration of the Crawling API, which we'll utilize to fetch HTML content from websites, including Wikipedia pages. By utilizing the Crawlbase Python library, we can streamline the process of retrieving data, enabling efficient web scraping operations for our project.
  • BeautifulSoup: BeautifulSoup is a popular Python library for web scraping and parsing HTML and XML documents. It provides a convenient way to extract data from web pages by navigating the HTML structure and searching for specific tags, attributes, or text content.

Setting Up Your Development Environment:

To begin coding your web scraper, open your preferred text editor or integrated development environment (IDE). You can use any IDE of your choice, such as PyCharm, VS Code, or Jupyter Notebook.

Create a New Python File: Open your IDE and create a new Python file. Let's name it wikipedia_scraper.py for this example. This file will serve as our scraper script.

Now that you have Python installed and the necessary libraries set up, you're ready to start coding your web scraper to extract data from Wikipedia pages. Let's proceed with building the scraper script!

V. How to Extract HTML from Wikipedia

To extract the complete HTML source code from a Wikipedia webpage, we'll utilize the Crawlbase package to make a request and retrieve the HTML content. Below are the steps to write the code:

Step 1: Importing the Crawlbase Library

from crawlbase import CrawlingAPI
Enter fullscreen mode Exit fullscreen mode
  • This line imports the CrawlingAPI class from the crawlbase library, which allows us to make requests to the Crawlbase API for web crawling and scraping tasks.

Step 2: Defining the crawl Function

def crawl(page_url, api_token):
Enter fullscreen mode Exit fullscreen mode
  • This function, named crawl, takes two parameters: page_url (the URL of the Wikipedia page to scrape) and api_token (the API token required for accessing the Crawlbase API).

Step 3: Initializing the CrawlingAPI Object

api = CrawlingAPI({'token': api_token})
Enter fullscreen mode Exit fullscreen mode
  • Here, we create an instance of the CrawlingAPI class, passing the API token as a parameter to authenticate our request to the Crawlbase API.

Step 4: Making a GET Request

response = api.get(page_url)
Enter fullscreen mode Exit fullscreen mode
  • This line sends a GET request to the specified page_url using the get method of the api object. It retrieves the HTML content of the Wikipedia page.

Step 5: Checking the Response Status Code

if response['status_code'] == 200:
Enter fullscreen mode Exit fullscreen mode
  • We check if the response status code is 200, indicating that the request was successful and the HTML content was retrieved without any errors.

Step 6: Extracting and Printing the HTML Content

print(f'{response["body"]}')
Enter fullscreen mode Exit fullscreen mode
  • If the request was successful, we print the HTML content of the Wikipedia page. The HTML content is stored in the response dictionary under the key 'body'.

Step 7: Handling Errors

else:
  print(f"Error: {response}")
Enter fullscreen mode Exit fullscreen mode
  • If the request fails (i.e., if the response status code is not 200), we print an error message along with the response details.

Step 8: Main Function

if __name__ == "__main__":
Enter fullscreen mode Exit fullscreen mode
  • This block of code ensures that the following code is executed only if the script is run directly (not imported as a module).

Step 9: Specifying Wikipedia Page URL and API Token

page_url = 'https://en.wikipedia.org/wiki/Elon_Musk'
api_token = 'Crawlbase_Token'
Enter fullscreen mode Exit fullscreen mode
  • Here, we specify the URL of the Wikipedia page we want to scrape (page_url) and the API token required for accessing the Crawlbase API (api_token). Make sure to replace Crawlbase_Token with your actual Crawling API normal request token.

Step 10: Calling the crawl Function

crawl(page_url, api_token)
Enter fullscreen mode Exit fullscreen mode
  • Finally, we call the crawl function, passing the page_url and api_token as arguments to initiate the scraping process.

Here is the complete Python code snippet you can copy and paste to your wikipedia_scraper.py file:

from crawlbase import CrawlingAPI

def crawl(page_url, api_token):
 # Initialize the CrawlingAPI object with your token
 api = CrawlingAPI({'token': api_token})

 # Get the page content
 response = api.get(page_url)

 # Check if the request was successful
 if response['status_code'] == 200:
  # Extract data
  print(f'{response["body"]}')
 else:
  print(f"Error: {response}")

if __name__ == "__main__":
 # Specify the Wikipedia page URL to scrape
 page_url = 'https://en.wikipedia.org/wiki/Elon_Musk'

 # Specify your Crawlbase API token
 api_token = 'Crawlbase_Token'

 # Call the crawl function
 crawl(page_url, api_token)
Enter fullscreen mode Exit fullscreen mode

Now let’s execute the above code snippet. To run it we need to execute the below command:

python wikipedia_scraper.py
Enter fullscreen mode Exit fullscreen mode

Upon execution, the script will make a request to the specified Wikipedia page URL ('https://en.wikipedia.org/wiki/Elon_Musk') using the Crawlbase API token provided. If the request is successful, it will print the HTML content of the page. Otherwise, it will display an error message.

The output will be the HTML content of the Wikipedia page, which can then be parsed and processed further to extract specific data of interest.

crawled html from wikipedia 'crawled html from wikipedia'

VI. How to Scrape Wikipedia

To scrape content from Wikipedia using BeautifulSoup, we'll need to import the necessary libraries and define functions to handle the scraping process. Here's how we can proceed:

Step 1: Importing Libraries

We'll need to import the CrawlingAPI class from the crawlbase module for making requests to the Crawlbase API. Additionally, we'll import BeautifulSoup from the bs4 (BeautifulSoup) module for parsing HTML, and pandas for creating and manipulating data frames.

from crawlbase import CrawlingAPI
from bs4 import BeautifulSoup
import pandas as pd
Enter fullscreen mode Exit fullscreen mode

Step 2: Initializing the Crawling API

We'll define a function named crawl to initialize the CrawlingAPI object, fetch the webpage content using the API, and handle the scraping process.

We will also define a function named scrape_data to extract specific information from the HTML content of the webpage using BeautifulSoup.

Finally, we'll specify the Wikipedia page URL to scrape, the Crawlbase API token, and call the crawl function to initiate the scraping process.

def crawl(page_url, api_token):
 # Initialize the CrawlingAPI object with your token
 api = CrawlingAPI({'token': api_token})

 # Get the page content
 response = api.get(page_url)

 # Check if the request was successful
 if response['status_code'] == 200:
  # Scrape data
  scraped_data = scrape_data(response)

  csv_filename = 'wikipedia.csv'
  df = pd.DataFrame([scraped_data])
  df.to_csv(csv_filename, index=False)
 else:
  print(f"Error: {response}")

if __name__ == "__main__":
 # Specify the Wikipedia page URL to scrape
 page_url = 'https://en.wikipedia.org/wiki/Elon_Musk'

 # Specify your Crawlbase API token
 api_token = 'Crawlbase_Token'

 # Call the crawl function
 crawl(page_url, api_token)
Enter fullscreen mode Exit fullscreen mode

Step 3: How to Scrape Wikipedia Page Title

In the HTML source code, locate the section or container that represents the title. This typically involves inspecting the structure of the webpage using browser developer tools or viewing the page source.

How to Scrape Wikipedia Page Title 'How to Scrape Wikipedia Page Title'

Utilize BeautifulSoup selectors to target the title element within the webpage. This involves specifying the appropriate class that matches the desired element.

title = soup.find('h1', id='firstHeading').get_text(strip=True)
Enter fullscreen mode Exit fullscreen mode

This line of code finds the first occurrence of the <h1> element with the ID "firstHeading" in the HTML content stored in the variable soup. It then extracts the text inside that element and removes any extra spaces or newlines using the get_text(strip=True) method. Finally, it assigns the cleaned text to the variable title.

Step 4: How to Scrape Wikipedia Nickname property

How to Scrape Wikipedia Nickname property 'How to Scrape Wikipedia Nickname property'

nick_name = soup.find('div', class_='nickname').get_text(strip=True)
Enter fullscreen mode Exit fullscreen mode

This code finds an HTML element <div> with the CSS class "nickname" within a parsed HTML document represented by the variable soup. It then extracts the text content of this element, removing any extra whitespace characters before and after the text, and assigns it to the variable nick_name.

Step 5: How to Scrape Wikipedia Citizenship property

How to Scrape Wikipedia Citizenship property 'How to Scrape Wikipedia Citizenship property'

citizenship = soup.select_one('.infobox-data .plainlist').text.strip().replace('\n', ', ')
Enter fullscreen mode Exit fullscreen mode

The above code searches for an HTML element with class "infobox-data" and "plainlist" within a parsed HTML document represented by the variable soup. It selects the first matching element and extracts its text content. Then, it removes any leading or trailing whitespace characters and replaces newline characters with commas. The resulting text represents citizenship information and is assigned to the variable citizenship.

Step 6: How to Scrape Wikipedia Image

How to Scrape Wikipedia Image 'How to Scrape Wikipedia Image'

image_src = soup.select_one('.mw-file-description .mw-file-element')['src']
Enter fullscreen mode Exit fullscreen mode

This code snippet searches for an HTML element that has both the classes "mw-file-description" and "mw-file-element" within a parsed HTML document represented by the variable soup. It selects the first matching element and retrieves the value of the "src" attribute, which typically contains the URL of an image. The URL is then assigned to the variable image_src.

Step 7: How to Scrape Wikipedia Profile Title

How to Scrape Wikipedia Profile Title 'How to Scrape Wikipedia Profile Title'

profile_title = soup.select_one('.infobox-data.title').text.strip().replace('\n', ', ')
Enter fullscreen mode Exit fullscreen mode

This code searches for an HTML element with the class "infobox-data" and "title" within a parsed HTML document represented by the variable soup. It selects the first matching element and retrieves the text content. Then, it removes any leading or trailing whitespace using strip() and replaces newline characters ('\n') with commas (','). Finally, it assigns the modified text to the variable profile_title.

Step 8: Complete the Code

Put everything together and your code should look like this. Feel free to copy and save it to your local machine:

from crawlbase import CrawlingAPI
from bs4 import BeautifulSoup

def crawl(page_url, api_token):
 # Initialize the CrawlingAPI object with your token
 api = CrawlingAPI({'token': api_token})

 # Get the page content
 response = api.get(page_url)

 # Check if the request was successful
 if response['status_code'] == 200:
  # Scraped data
  scraped_data = scrape_data(response)

  print(f'{scraped_data}')
 else:
  print(f"Error: {response}")

def scrape_data(response):
 try:
  # Parse the HTML content using Beautiful Soup
  soup = BeautifulSoup(response['body'], 'html.parser')

  # Extract the title of the Wikipedia page
  title = soup.find('h1', id='firstHeading').get_text(strip=True)

  nick_name = soup.find('div', class_='nickname').get_text(strip=True)

  citizenship = soup.select_one('.infobox-data .plainlist').text.strip().replace('\n', ', ')

  # Extract the image source
  image_src = soup.select_one('.mw-file-description .mw-file-element')['src']

  profile_title = soup.select_one('.infobox-data.title').text.strip().replace('\n', ', ')

  return {
    'title': title,
    'nickName': nick_name,
    'citizenship': citizenship,
    'image': image_src,
    'profile_title': profile_title
  }
 except Exception as e:
  print(f"An error occurred: {e}")
  return {}

if __name__ == "__main__":
 # Specify the Wikipedia page URL to scrape
 page_url = 'https://en.wikipedia.org/wiki/Elon_Musk'

 # Specify your Crawlbase API token
 api_token = 'Crawlbase_Token'

 # Call the crawl function
 crawl(page_url, api_token)
Enter fullscreen mode Exit fullscreen mode

Executing the code by using the command python wikipedia_scraper.py should provide a similar output as shown below:

scraped data from wikipedia 'scraped data from wikipedia'

VII. How to Scrape a Table from Wikipedia

To scrape a table from Wikipedia, we'll utilize BeautifulSoup along with the Crawlbase library again for fetching the HTML content. Here's how we can proceed:

Step 1: Importing Libraries

We'll import the necessary libraries for scraping the table from Wikipedia. This includes BeautifulSoup for parsing HTML, pandas for data manipulation, and the CrawlingAPI class from the Crawlbase library for making requests to fetch the HTML content.

from bs4 import BeautifulSoup
import pandas as pd
from crawlbase import CrawlingAPI
Enter fullscreen mode Exit fullscreen mode

Step 2: Scraping Table from Wikipedia pages

To scrape a table from Wikipedia, we need to inspect the HTML structure of the webpage containing the table. This can be done by right-clicking on the table, selecting "Inspect" from the context menu. This action will reveal the HTML content of the page, allowing us to identify the tags inside which our data is stored. Typically, tabular data in HTML is enclosed within <table> tags.

Let's target this URL that contains the table we want to scrape. Once we have identified the URL, we can proceed to extract the table data from the HTML content.

how to scrape a table from wikipedia using python 'how to scrape a table from wikipedia using python'

Step 3: Initializing the Crawling API

Next, we'll initialize the CrawlingAPI to fetch the table data from the Wikipedia page. We'll pass this data to the scrape_data function to create a BeautifulSoup object. Then, we'll use the select_one() method to extract the relevant information, which in this case is the <table> tag. Since a Wikipedia page may contain multiple tables, we need to specify the table by passing either the "class" or the "id" attribute of the <table> tag.

You can copy and paste the complete code below:

from bs4 import BeautifulSoup
import pandas as pd
from crawlbase import CrawlingAPI

def crawl(page_url, api_token):
 try:
  api = CrawlingAPI({'token': api_token})

  # Get the page content
  response = api.get(page_url)

  # Check if the request was successful
  if response['status_code'] == 200:
    # Scrape data
    print(f"200 response")
    return scrape_data(response)
  else:
    print(f"Error: {response}")
    return None
 except Exception as e:
  print(f"An error occurred during crawling: {e}")
  return None

def scrape_data(response):
 try:
  # Parse the HTML content using Beautiful Soup
  soup = BeautifulSoup(response['body'], 'html.parser')

  # Find the table containing the desired data
  table = soup.select_one('.wikitable')

  # Extract data from the first table found
  if table:
    print("Table found.")
    # Read the table into a DataFrame
    df = pd.read_html(str(table))
    print(df)

    # Check if any data was extracted
    if df:
      # Save the DataFrame to a CSV file
      df[0].to_csv('wikipedia.csv', index=False)
      print(f"Table data exported to wikipedia.csv")
      return df
    else:
      print("No data found in the table.")
      return None
  else:
    print("No tables found.")
    return None
 except Exception as e:
  print(f"An error occurred during data extraction: {e}")
  return None

if __name__ == "__main__":
 # Specify the Wikipedia page URL to scrape
 page_url = 'https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_area'
 crawlbase_token = 'crawlbase_token'

 # Call the crawl function
 crawl(page_url, crawlbase_token)
Enter fullscreen mode Exit fullscreen mode

Step 4: Execute the Code to Save in CSV file

Run the code one again using the command below:

python wikipedia_scraper.py
Enter fullscreen mode Exit fullscreen mode

The code structure will allow us to scrape the table from the specified Wikipedia page, process it using BeautifulSoup, and save the extracted data to a CSV file for further analysis. See the example output below:

save scraped wikipedia data 'save scraped wikipedia data'

VIII. Conclusion

In summary, this tutorial provided a comprehensive guide on scraping data from Wikipedia using Python, BeautifulSoup, and the Crawlbase library. It covered various aspects, including scraping HTML content, extracting specific data such as page titles, nicknames, citizenship information, and images, as well as scraping tables from Wikipedia pages.

Furthermore, the tutorial included instructions on scraping tables from Wikipedia pages, highlighting the process of inspecting the HTML structure to identify table elements and utilizing BeautifulSoup to extract tabular data.

The code provided in this tutorial is freely available for use. We encourage you to modify and adapt the code for your projects. Additionally, the techniques demonstrated here can be applied to scrape data from other websites, offering a versatile approach to web scraping tasks.

We hope that this tutorial serves as a valuable resource for anyone looking to learn Wikipedia scraping or other online sources for analysis, research, or other purposes.

We recommend browsing the following pages if you would like to see more projects like this:

How to Scrape Samsung Products

How to Scrape Google Scholar Results

Real-Estate Data Scraping from Zillow

If you have questions regarding Crawlbase or this article, please contact our support team.

IX. Frequently Asked Questions

Q. Is scraping Wikipedia legal?

Scraping Wikipedia is generally legal, but there are certain factors to consider. While Wikipedia's content is often licensed under Creative Commons licenses, which allow for certain uses with proper attribution, it's essential to review and adhere to their terms of use, including any restrictions on automated access.

Consulting with legal experts or seeking clarification from Wikipedia directly is advisable to ensure compliance with applicable laws and regulations. This helps to mitigate the risk of potential legal issues related to web scraping activities.

Q. Is it possible to scrape Wikipedia using a different programming language?

Yes, it is possible to scrape Wikipedia using various programming languages besides Python. Wikipedia provides an API, such as the MediaWiki API, which allows access to its content programmatically. This means you can use different programming languages that support HTTP requests and JSON/XML parsing to interact with the API and retrieve Wikipedia data.

However, the specific details may vary depending on the programming language and the available libraries or frameworks for making HTTP requests and processing JSON/XML data.

Additionally, web scraping techniques, such as parsing HTML with libraries like BeautifulSoup, can also be applied in different programming languages to extract Wikipedia data directly.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Terabox Video Player