How to Scrape TechCrunch with Python

Crawlbase - Aug 14 - - Dev Community

This blog was originally posted to Crawlbase Blog

TechCrunch is a leading source of technology news, covering everything from emerging startups to major tech giants. With millions of readers worldwide, TechCrunch publishes articles that influence industry trends and shape business strategies. Scraping data from TechCrunch can provide valuable insights into the latest technology trends, startup news, and industry developments.

In this blog, we will guide you through the process of how to scrape TechCrunch using Python. We’ll cover everything from understanding the website structure to writing a web scraper that can efficiently collect data from TechCrunch articles. Additionally, we’ll explore how to optimize the scraping process using the Crawlbase Crawling API to bypass anti-scraping measures. Let’s start!

Why Scrape TechCrunch Data?

TechCrunch is among the leading sources of technology news and analysis, providing valuable insights into the latest developments in the tech industry. Below are some of the benefits of scraping TechCrunch and what type of information you can get from it.

Benefits of Scraping TechCrunch

Scraping TechCrunch can offer several benefits:

  • Stay Updated: By scraping TechCrunch data, you can get the most recent technological trends, start-up launches, and changes in the industry. This helps organizations and individuals remain ahead of competitors in an ever-changing market.
  • Market Research: By scraping TechCrunch data, you are able to conduct thorough market research. By analyzing articles and news releases, it becomes easy for one to identify new trends, customer preferences, and competitor’s strategies.
  • Trends and Voices: By studying TechCrunch articles, it would be possible to identify the subjects that are gaining popularity as well as determine those people who have influential voices in the field of technology. This aids you in identifying potential partners, competitors, or even market leaders.
  • Data-Driven Decision Making: The availability of TechCrunch data allows firms to make business decisions based on current industry trends. If you are planning to launch a new product or enter a different market, the information provided by TechCrunch can be very helpful in decision-making.

Key Data Points to Extract

When scraping TechCrunch, there are several key data points you might want to focus on:

  • Article Titles and Authors: Understanding what topics are being covered and who is writing these articles will give you an idea of industry trends and influential voices.
  • Publication Dates: Tracking when articles are published can help you identify timely trends and how they evolve over time.
  • Content Summaries: Getting summaries or key points from these articles can help quickly reveal what the main ideas are without reading them in full.
  • Tags and Categories: Knowing how articles are categorized gives more insights into which issues TechCrunch addresses most frequently while also showing where these issues fit into bigger industry developments.
  • Company Mentions: Identifying which companies are frequently mentioned can offer insights into market leaders and potential investment opportunities.

By understanding these benefits and key data points, you can effectively leverage TechCrunch data to gain a competitive edge and enhance your knowledge of the tech landscape.

Setting Up Your Python Environment

To scrape TechCrunch data effectively, set up your Python environment by installing Python, using a virtual environment, and selecting the right tools.

Installing Python

Ensure Python is installed on your system. Download the latest version from the Python website and follow the installation instructions. Remember to add Python to your system PATH.

Setting Up a Virtual Environment

The use of a virtual environment helps you to handle Python project dependencies without affecting other projects. It creates a separate instance where one can install and keep track of packages that are relevant only to that scraping project. Here’s how to get started.

Install Virtualenv: If you don’t have virtualenv installed, you can install it via pip:

pip install virtualenv
Enter fullscreen mode Exit fullscreen mode

Create a Virtual Environment: Navigate to your project directory and create a virtual environment:

virtualenv techcrunch_venv
Enter fullscreen mode Exit fullscreen mode

Activate the Virtual Environment:

  • On Windows:
  techcrunch_venv\Scripts\activate
Enter fullscreen mode Exit fullscreen mode
  • On macOS and Linux:
  source techcrunch_venv/bin/activate
Enter fullscreen mode Exit fullscreen mode

Installing Required Libraries

With the virtual environment activated, you can install the libraries necessary for web scraping:

  1. BeautifulSoup: For parsing HTML and XML documents.
  2. Requests: To handle HTTP requests and responses.
  3. Pandas: To store and manipulate the data you scrape.
  4. Crawlbase: To enhance scraping efficiency and handle complex challenges later in the process.

Install these libraries using the following command:

pip install beautifulsoup4 requests pandas crawlbase
Enter fullscreen mode Exit fullscreen mode

Choosing an IDE

Picking the right Integrated Development Environment (IDE) for your work may greatly improve your efficiency and even comfort when programming. Below are some popular choices.

  • PyCharm: A powerful IDE specifically for Python development, offering code completion, debugging, and a wide range of plugins.
  • VS Code: A versatile and lightweight editor with strong support for Python through extensions.
  • Jupyter Notebook: Ideal for exploratory data analysis and interactive coding, especially useful if you prefer a notebook interface.

Selecting the appropriate IDE will depend on personal preference and which features you feel would be most helpful in streamlining your workflow. Next, we'll cover scraping article listings to extract insights from TechCrunch content.

Scraping TechCrunch Article Listings

In this section, we are going to discuss how to scrape article listings from TechCrunch. This involves inspecting the HTML structure of the webpage, writing a scraper to extract data, handling pagination, and saving the data into a CSV file.

Inspecting the HTML Structure

Before scraping TechCrunch listings, you need to identify the correct CSS selectors for the elements that hold the data you need.

  1. Open Developer Tools: Visit the TechCrunch homepage, then open Developer Tools by right-clicking and selecting "Inspect" or using Ctrl+Shift+I (Windows) or Cmd+Option+I (Mac).
  2. Locate Article Containers: Find the main container for each article. On TechCrunch, articles are usually inside a <div> with the class wp-block-tc23-post-picker. This helps you loop through each article.
  3. Identify Key Elements: Within each article container, locate the specific elements containing the data:
  • Title: Typically within an <h2> tag with the class wp-block-post-title.
  • Link: An <a> tag inside the title element, with the URL in the href attribute.
  • Author: Usually in a <div> with the class wp-block-tc23-author-card-name.
  • Publication Date: Often in a <time> tag, with the date in the datetime attribute.
  • Summary: Found in a <p> tag with the class wp-block-post-excerpt__excerpt.

Writing the TechCrunch Listing Scraper

Let's write a web scraper to extract data from TechCrunch's article listings page using Python and BeautifulSoup. We'll scrape the title, article link, author, date of publication, and summary from each article listed.

Import Libraries

First, we need to import the necessary libraries:

import requests
from bs4 import BeautifulSoup
import json
Enter fullscreen mode Exit fullscreen mode

Define the Scraper Function

Next, we'll define a function to scrape the data:

def scrape_techcrunch_listings(url):
    response = requests.get(url)

    if response.status_code == 200:
        soup = BeautifulSoup(response.content, 'html.parser')
        articles = soup.select('div.wp-block-group > div.wp-block-tc23-post-picker-group > div.wp-block-tc23-post-picker')
        data = []

        for article in articles:
            title_element = article.select_one('h2.wp-block-post-title')
            title = title_element.text.strip()
            link = title_element.find('a')['href']
            author = article.select_one('div.wp-block-tc23-author-card-name').text.strip()
            publication_date = article.select_one('time')['datetime']
            summary = article.select_one('p.wp-block-post-excerpt__excerpt').text.strip()

            data.append({
                'Title': title,
                'Link': link,
                'Author': author,
                'Publication Date': publication_date,
                'Summary': summary
            })

        return data
    else:
        print(f"Failed to retrieve the page. Status code: {response.status_code}")
        return None
Enter fullscreen mode Exit fullscreen mode

This function collects article data from TechCrunch's listings, capturing details such as titles, links, authors, publication dates, and summaries.

Test the Scraper

To test the scraper, use the following code:

url = 'https://techcrunch.com'
articles_data = scrape_techcrunch_listings(url)

print(json.dumps(articles_data, indent=2))
Enter fullscreen mode Exit fullscreen mode

Create a new file named techcrunch_listing_scraper.py, copy the code provided into this file, and save it. Run the Script using Following command:

python techcrunch_listing_scraper.py
Enter fullscreen mode Exit fullscreen mode

You should see output similar to the example below.

[
  {
    "Title": "How CNH\u2019s \u2018black belt\u2019 M&A head makes deals",
    "Link": "https://techcrunch.com/2024/08/11/how-cnhs-black-belt-ma-head-makes-deals/",
    "Author": "Sean O'Kane",
    "Publication Date": "2024-08-11T11:35:08-07:00",
    "Summary": "Heavy equipment manufacturer CNH Industrial has a long history of mergers and acquisitions, at times supervising legendary brands like Ferrari. But five years ago, as agtech was booming, the global\u2026"
  },
  {
    "Title": "CrowdStrike accepts award for \u2018most epic fail\u2019 after global IT outage",
    "Link": "https://techcrunch.com/2024/08/11/crowdstrike-accepts-award-for-most-epic-fail-after-global-it-outage/",
    "Author": "Anthony Ha",
    "Publication Date": "2024-08-11T10:40:21-07:00",
    "Summary": "CrowdStrike\u2019s president said he\u2019ll take the trophy back to headquarters as a reminder that \u201cour goal is to protect people, and we got this wrong.\u201d"
  },
  {
    "Title": "Open source tools to boost your productivity",
    "Link": "https://techcrunch.com/2024/08/11/a-not-quite-definitive-guide-to-open-source-alternative-software/",
    "Author": "Paul Sawers",
    "Publication Date": "2024-08-11T09:00:00-07:00",
    "Summary": "TechCrunch has pulled together some open-source alternatives to popular productivity apps that might appeal to prosumers, freelancers, or small businesses looking to escape the clutches of Big Tech."
  },
  {
    "Title": "Oyo valuation crashes over 75% in new funding",
    "Link": "https://techcrunch.com/2024/08/11/oyo-valuation-crashes-over-75-in-new-funding/",
    "Author": "Manish Singh",
    "Publication Date": "2024-08-11T06:07:12-07:00",
    "Summary": "The valuation of Oyo, once India\u2019s second-most valuable startup at $10 billion, has dipped to $2.4 billion in a new funding round, multiple sources told TechCrunch. The Gurugram-headquartered startup, which\u2026"
  },
  .... more
]
Enter fullscreen mode Exit fullscreen mode

In the next sections, we'll handle pagination and store the extracted data efficiently.

Handling Pagination

When scraping TechCrunch, you may encounter multiple pages of article listings. To gather data from all pages, you need to handle pagination. This involves making multiple requests and navigating through each page.

Understanding Pagination URLs

TechCrunch’s article listings use URL parameters to navigate between pages. For example, the URL for the first page might be https://techcrunch.com/page/1/, while the second page could be https://techcrunch.com/page/2/, and so on.

Define the Pagination Function

This function will manage pagination by iterating through pages and collecting data until there are no more pages to scrape.

def scrape_techcrunch_with_pagination(base_url, start_page=0, num_pages=1):
    all_data = []

    for page in range(start_page, start_page + num_pages):
        url = f"{base_url}/page/{page}/"
        print(f"Scraping page: {page + 1}")

        page_data = scrape_techcrunch_listings(url)
        if page_data:
            all_data.extend(page_data)
        else:
            print(f"Failed to retrieve data from page: {page + 1}")
            break

    return all_data
Enter fullscreen mode Exit fullscreen mode

In this function:

  • base_url is the URL of the TechCrunch listings page.
  • start_page specifies the starting page number.
  • num_pages determines how many pages to scrape.

Storing Data in a CSV File

Using below function, you can save the scraped article data into a CSV file.

import pandas as pd

def save_data_to_csv(data, filename='techcrunch_listing.csv'):
    df = pd.DataFrame(data)
    df.to_csv(filename, index=False, encoding='utf-8')
    print(f"Data successfully saved to {filename}")
Enter fullscreen mode Exit fullscreen mode

This function converts the list of dictionaries (containing your scraped data) into a DataFrame using pandas and then saves it as a CSV file.

Complete Code

Here’s the complete code to scrape TechCrunch article listings, handle pagination, and save the data to a CSV file. This script combines all the functions we've discussed into one Python file.

import requests
from bs4 import BeautifulSoup
import pandas as pd

# Function to scrape TechCrunch article listings
def scrape_techcrunch_listings(url):
    response = requests.get(url)

    if response.status_code == 200:
        soup = BeautifulSoup(response.content, 'html.parser')
        articles = soup.select('div.wp-block-group > div.wp-block-tc23-post-picker-group > div.wp-block-tc23-post-picker')
        data = []

        for article in articles:
            title_element = article.select_one('h2.wp-block-post-title')
            title = title_element.text.strip()
            link = title_element.find('a')['href']
            author = article.select_one('div.wp-block-tc23-author-card-name').text.strip()
            publication_date = article.select_one('time')['datetime']
            summary = article.select_one('p.wp-block-post-excerpt__excerpt').text.strip()

            data.append({
                'Title': title,
                'Link': link,
                'Author': author,
                'Publication Date': publication_date,
                'Summary': summary
            })

        return data
    else:
        print(f"Failed to retrieve the page. Status code: {response.status_code}")
        return None

# Function to handle pagination
def scrape_techcrunch_with_pagination(base_url, start_page=1, num_pages=1):
    all_data = []

    for page in range(start_page, start_page + num_pages):
        url = f"{base_url}/page/{page}/"
        print(f"Scraping page: {page}")

        page_data = scrape_techcrunch_listings(url)
        if page_data:
            all_data.extend(page_data)
        else:
            print(f"Failed to retrieve data from page: {page}")
            break

    return all_data

# Function to save data to CSV
def save_data_to_csv(data, filename='techcrunch_listing.csv'):
    df = pd.DataFrame(data)
    df.to_csv(filename, index=False, encoding='utf-8')
    print(f"Data successfully saved to {filename}")

# Main function to run the scraper
def main():
    base_url = 'https://techcrunch.com'
    num_pages_to_scrape = 5  # Specify the number of pages you want to scrape

    all_article_data = scrape_techcrunch_with_pagination(base_url, num_pages=num_pages_to_scrape)

    if all_article_data:
        save_data_to_csv(all_article_data)
    else:
        print("No data collected.")

if __name__ == "__main__":
    main()
Enter fullscreen mode Exit fullscreen mode

Scraping TechCrunch Article Page

In this section, we will focus on scraping individual TechCrunch article pages to gather more detailed information about each article. This involves inspecting the HTML structure of an article page, writing a scraper function, and saving the collected data.

Inspecting the HTML Structure

To scrape TechCrunch articles, start by finding the CSS selectors of required elements from the page’s HTML structure:

  1. Open Developer Tools: Visit a TechCrunch article and open Developer Tools using Ctrl+Shift+I (Windows) or Cmd+Option+I (Mac).
  2. Identify Key Elements:
  • Title: Usually in an <h1> tag with the class wp-block-post-title.
  • Author: Often in a <div> with the class wp-block-tc23-author-card-name.
  • Publication Date: Found in a <time> tag, with the date in the datetime attribute.
  • Content: Usually in a <div> with class wp-block-post-content.

Writing the TechCrunch Article Page Scraper

With the HTML structure in mind, let’s write a function to scrape the detailed information from a TechCrunch article page.

import requests
from bs4 import BeautifulSoup
import json

def scrape_techcrunch_article(url):
    response = requests.get(url)

    if response.status_code == 200:
        soup = BeautifulSoup(response.content, 'html.parser')

       # Extracting the title
        title = soup.select_one('h1.wp-block-post-title').text.strip()

        # Extracting the author
        author = soup.select_one('div.wp-block-tc23-author-card-name > a').text.strip()

        # Extracting the publication date
        publication_date = soup.select_one('div.wp-block-post-date > time')['datetime']

        # Extracting the content
        content = soup.select_one('div.wp-block-post-content').text.strip()

        return {
            'Title': title,
            'Author': author,
            'Publication Date': publication_date,
            'Content': content
        }
    else:
        print(f"Failed to retrieve the article. Status code: {response.status_code}")
        return None
Enter fullscreen mode Exit fullscreen mode

Test the Scraper

To test the scraper, use the following code:

url = 'https://techcrunch.com/2024/08/11/oyo-valuation-crashes-over-75-in-new-funding/'
article_data = scrape_techcrunch_article(url)

print(json.dumps(article_data, indent=2))
Enter fullscreen mode Exit fullscreen mode

Create a new file named techcrunch_article_scraper.py, copy the code provided into this file, and save it. Run the Script using Following command:

python techcrunch_article_scraper.py
Enter fullscreen mode Exit fullscreen mode

You should see output similar to the example below.

{
  "Title": "Oyo valuation crashes over 75% in new funding",
  "Author": "Manish Singh",
  "Publication Date": "2024-08-11T06:07:12-07:00",
  "Content": "The valuation of Oyo, once India\u2019s second-most valuable startup at $10 billion, has dipped to $2.4 billion in a new funding round, multiple sources told TechCrunch ... more till end."
}
Enter fullscreen mode Exit fullscreen mode

Storing Data in a CSV File

To store the article data, you can use pandas to save the results into a CSV file. We will modify the previous save_data_to_csv function to include this functionality.

import pandas as pd

def save_article_data_to_csv(data, filename='techcrunch_articles.csv'):
    df = pd.DataFrame(data)
    df.to_csv(filename, index=False, encoding='utf-8')
    print(f"Article data successfully saved to {filename}")
Enter fullscreen mode Exit fullscreen mode

Complete Code

Combining everything, here is the complete code to scrape individual TechCrunch article pages and save the data:

import requests
from bs4 import BeautifulSoup
import pandas as pd

# Function to scrape individual TechCrunch article pages
def scrape_techcrunch_article(url):
    response = requests.get(url)

    if response.status_code == 200:
        soup = BeautifulSoup(response.content, 'html.parser')

        # Extracting the title
        title = soup.select_one('h1.wp-block-post-title').text.strip()

        # Extracting the author
        author = soup.select_one('div.wp-block-tc23-author-card-name > a').text.strip()

        # Extracting the publication date
        publication_date = soup.select_one('div.wp-block-post-date > time')['datetime']

        # Extracting the content
        content = soup.select_one('div.wp-block-post-content').text.strip()

        return {
            'Title': title,
            'Author': author,
            'Publication Date': publication_date,
            'Content': content
        }
    else:
        print(f"Failed to retrieve the article. Status code: {response.status_code}")
        return None

# Function to save article data to CSV
def save_article_data_to_csv(data, filename='techcrunch_articles.csv'):
    df = pd.DataFrame(data)
    df.to_csv(filename, index=False, encoding='utf-8')
    print(f"Article data successfully saved to {filename}")

# Example usage
if __name__ == "__main__":
    # Replace with actual article URLs
    article_urls = [
        'https://techcrunch.com/2024/08/10/example-article/',
        'https://techcrunch.com/2024/08/11/another-article/'
    ]

    all_article_data = []
    for url in article_urls:
        article_data = scrape_techcrunch_article(url)
        if article_data:
            all_article_data.append(article_data)

    save_article_data_to_csv(all_article_data)
Enter fullscreen mode Exit fullscreen mode

You can adapt the article_urls list to include URLs of the articles you want to scrape.

Optimizing Scraping with Crawlbase Crawling API

When you scrape TechCrunch data, there may be some challenges, such as IP blocking, rate limiting, and dynamic content. The Crawlbase Crawling API can help to overcome these hurdles and ensure a smoother scraping process has been achieved. Here’s how Crawlbase can optimize your scraping efforts:

Bypassing Scraping Challenges

  1. IP Blocking and Rate Limiting: Websites like TechCrunch may block your IP address if too many requests are made in a short period. To reduce the risk of detection and blocking, Crawlbase Crawling API rotates between different IP addresses and manages request rates.
  2. Dynamic Content: Some pages in TechCrunch load certain contents using JavaScript which makes it hard for traditional scrapers to get into them directly. By rendering JavaScript, the Crawlbase Crawling API enables you to access every single item that is on a page.
  3. CAPTCHA and Anti-Bot Measures: TechCrunch may use CAPTCHAs and other anti-bot technologies to prevent automated scraping. Crawlbase Crawling API can bypass these measures, allowing you to collect data without interruptions.
  4. Geolocation: TechCrunch may serve different content based on location. Crawlbase Crawling API lets you specify the country for your requests, ensuring you get relevant data based on your target region.

Implementing Crawlbase in Your Scraper

To integrate the Crawlbase Crawling API into your TechCrunch scraper, follow these steps:

  1. Install the Crawlbase Library: Install the Crawlbase Python library using pip:
pip install crawlbase
Enter fullscreen mode Exit fullscreen mode
  1. Set Up the Crawlbase API: Initialize the Crawlbase API with your access token. You can get one by creating an account on Crawlbase.
from crawlbase import CrawlingAPI

# Initialize Crawlbase API with your access token
crawling_api = CrawlingAPI({'token': 'YOUR_CRAWLBASE_TOKEN'})
Enter fullscreen mode Exit fullscreen mode

Note: Crawlbase provides two types of tokens: a Normal Token for static websites and a JavaScript (JS) Token for handling dynamic or browser-based requests. In case of TechCrunch, you need Normal Token. The first 1,000 requests are free to get you started, with no credit card required. Read Crawlbase Crawling API documentation here.

  1. Update Scraper Function: Modify your scraping functions to use the Crawlbase API for making requests. Here’s an example of how to update the scrape_techcrunch_listings function:
def scrape_techcrunch_listings(url):
    options = {
        'country': 'US',  # Set your preferred country or remove it for default settings
        'user_agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36'
    }
    response = crawling_api.get(url, options)

    if response['headers']['pc_status'] == '200':

    # remaining function same as before
Enter fullscreen mode Exit fullscreen mode

Through the use of the Crawlbase Crawling API, you can effectively deal with frequent scraping problems and scrape data from TechCrunch without getting blocked.

Final Thoughts (Scrape TechCrunch with Crawlbase)

Scraping data from TechCrunch can provide valuable insights into the tech industry's latest trends, innovations, and influential figures. By extracting information from articles and listings, you can stay informed about emerging technologies and key players in the field. This guide has shown you how to set up a Python environment, write a functional scraper, and optimize your efforts with the Crawlbase Crawling API to overcome common scraping challenges.

If you're looking to expand your web scraping capabilities, consider exploring our following guides on scraping other important websites.

📜 How to Scrape Bloomberg
📜 How to Scrape Wikipedia
📜 How to Google FInance
📜 How to Scrape Google News
📜 How to Scrape Clutch.co

If you have any questions or feedback, our support team is always available to assist you on your web scraping journey. Happy Scraping!

Frequently Asked Questions

Q. What are the legal considerations for scraping TechCrunch data?

Collecting data from sites such as TechCrunch raises legal and ethical issues. One has to learn more about the terms of service of the platform being used, TechCrunch in this case, as they occasionally have specific policies on the use of certain forms of data scraping. Make sure that your scraping operations are in concordance with these provisions and abstain from violating data protection regulations such as GDPR or CCPA. It is advisable to speak to legal advisers in order to clarify any prospective legal issues that are related to legal and ethical issues that respect data gathering.

Q. What should I do if my IP address gets blocked while scraping?

If your IP address gets blocked while scraping TechCrunch, you can take several measures to mitigate this issue. Implement IP rotation by using proxy services or scraping tools like the Crawlbase Crawling API, which automatically rotates IPs to avoid detection. You can also adjust the rate of your requests to mimic human browsing behavior, reducing the risk of triggering anti-scraping measures.

Q. How can I improve the performance of my TechCrunch scraper?

Some of the methods that can help you optimize scraper to work much faster are multi-threading or asynchronous requests. Reduce your operations that are simply not required and use special libraries such as the pandas library for efficient data representation. Also, Crawlbase Crawling API can enhance performance by managing IP rotation and handling CAPTCHAs, ensuring uninterrupted access to the data you want to scrape.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Terabox Video Player