Real-Estate Data Scraping from Zillow

Crawlbase - Feb 26 - - Dev Community

This blog was originally posted to Crawlbase Blog

When it comes to the real estate industry, having access to accurate and up-to-date data can give you a competitive edge. One platform that has become a go-to source for real estate data is Zillow. With its vast database of property listings, market trends, and neighborhood information, Zillow has become a treasure trove of valuable data for homebuyers, sellers, and real estate professionals.

Zillow, boasting impressive site statistics, records millions of visits daily and hosts a staggering number of property listings. With a user-friendly interface and a diverse range of features, Zillow attracts a substantial audience seeking information on real estate trends and property details.

Zillow Visitors

Real estate professionals rely heavily on accurate and comprehensive data to make informed decisions. Whether it's researching market trends, evaluating property prices, or identifying investment opportunities, having access to reliable data is crucial. But manually extracting data from Zillow can be a tedious and time-consuming task. That's where data scraping comes into play. Data scraping from Zillow empowers real estate professionals with the ability to collect and analyze large amounts of data quickly and efficiently, saving both time and effort.

Come along as we explore the world of Zillow data scraping using Python. We'll kick off with a commonly used approach, understand its limitations, and then delve into the efficiency of the Crawlbase Crawling API. Join us on this adventure through the intricacies of web scraping on Zillow!

Table of Contents

  1. Understanding Zillow Website
  • Zillow's Search Paths
  • Zillow's Front-end Technologies
  • Zillow SERP Layout
  • Zillow Property Page Layout
  • Key data points available on Zillow
  1. Setting Up Your Python Environment
  • Installing Python
  • Installing essential libraries
  • Choosing a suitable Development IDE
  1. Zillow Scraper With Common Approach
  • Utilizing Python's requests library
  • Inspect the Zillow Page for CSS selectors
  • Parsing HTML with BeautifulSoup
  • Drawbacks and challenges of the Common approach
  1. Using Crawlbase Crawling API for Zillow
  • Crawlbase Registration and API Token
  • Accessing the Crawling API with Crawlbase Library
  • Scraping Property pages URL from SERP
  • Handling pagination for extensive data retrieval
  • Extracting required data from Property Page URLs
  • Saving scraped Data in a Database
  • Advantages of using Crawlbase's Crawling API for Zillow scraping
  1. Real Estate Insights: Analyzing Zillow Data
  • Potential use cases and applications for real estate professionals
  • Data analysis and visualization possibilities
  1. Final Thoughts
  2. Frequently Asked Questions (FAQs)

Understanding Zillow Website

Zillow offers a user-friendly interface and a vast database of property listings. With Zillow, you can easily search for properties based on your desired location, price range, and other specific criteria. The platform provides detailed property information, including the number of bedrooms and bathrooms, square footage, and even virtual tours or 3D walkthroughs in some cases.

Moreover, Zillow goes beyond just property listings. It also provides valuable insights into neighborhoods and market trends. You can explore the crime rates, school ratings, and amenities in a particular area to determine if it aligns with your preferences and lifestyle. Zillow's interactive mapping tools allow you to visualize the proximity of the property to nearby amenities such as schools, parks, and shopping centers.

Zillow's Search Paths

Understanding the structure of Zillow's Search Engine Results Page (SERP) URLs provides insights into how the platform organizes and presents its data. Zillow allows users to search for properties in specific cities, neighborhoods, or even by zip code.

Zillow offers various other search filters, such as price range, property type, number of bedrooms, and more. By utilizing these filters effectively, you can narrow down your search and extract specific data that aligns with your needs. The URLs are categorized into distinct sections based on user queries and preferences. Here are examples of some main categories within the SERP URLs:

  • Sale Listings: https://www.zillow.com/{location}/sale/?searchQueryState={...}
  • Sold Properties: https://www.zillow.com/{location}/sold/?searchQueryState={...}
  • Rental Listings: https://www.zillow.com/{location}/rentals/?searchQueryState={....}

These URLs represent specific sections of Zillow's database, allowing users to explore properties available for sale, recently sold properties, or rental listings in a particular location.

Zillow's Front-end Technologies

Understanding Zillow's front-end technologies is pivotal for effective data scraping. The platform employs advanced technologies to ensure a seamless user experience:

  • Responsive Web Design: This makes the website work well on different devices like computers, tablets, and phones, giving users a consistent experience.
  • Dynamic User Interface: Zillow uses JavaScript to show real-time updates. This helps in loading content and interactive parts of the site dynamically.
  • Asynchronous JavaScript (AJAX): This technology allows updates on the website without needing to reload the whole page. It makes the site responsive and interactive.
  • Single Page Application (SPA) Architecture: Zillow's site works like a single page, reducing the need to reload the entire page. This makes navigating through the site smoother.
  • RESTful APIs: These tools help the front-end (what users see) talk to the back-end (the behind-the-scenes part). They allow Zillow to get and change data for user interaction.

Understanding these front-end technologies provides valuable insights for crafting effective web scraping strategies on Zillow. It helps decipher the webpage structure, ensuring efficient and accurate data extraction.

Zillow SERP Layout

Zillow SERP

  • Search Filters: These are at the top for personalized property searches. Users can filter by location, price range, and property type, making it important to consider when scraping data for specific criteria.
  • Property Listings: The listings show details like property type, price, square footage, bedrooms, and bathrooms. These details are essential for focused data extraction, ensuring you capture the information you need.
  • Map Integration: Although it enhances the user experience by providing a visual representation of property locations, it isn't directly involved in scraping. It's something to be aware of but doesn't impact the extraction process.
  • Sort and Filter Options: Users can organize listings based on parameters like "Newest" or "Price." When crafting scraping strategies, it's important to consider these options to ensure the data is gathered in a way that aligns with user preferences.
  • Pagination: Zillow breaks down search results into multiple pages. This is crucial for capturing all relevant listings. Scraping strategies need to account for pagination to ensure comprehensive data retrieval.
  • Featured Listings and Advertisements: These intermittently appear within the Search Engine Results Page (SERP). Being aware of these elements helps distinguish between organic and sponsored content during scraping, allowing for a more accurate understanding of the data.

Understanding Zillow's SERP layout is crucial for effective web scraping, ensuring accurate data extraction and a systematic approach to accessing valuable real estate information.

Zillow Property Page Layout

Zillow Product Page

  • Essential Property Information: Key details like property type, address, price, size (sqft), bedrooms, and bathrooms are prominently displayed for quick reference. When scraping, capturing this information ensures a comprehensive understanding of the property.
  • High-Resolution Images: Multiple images showcasing different areas of the property provide a visual aid for users. While not directly involved in scraping, recognizing the presence of images is essential for data interpretation and presentation.
  • Description and Features: The detailed property description and features help users understand unique aspects of the listing. When scraping, capturing and analyzing this text provides valuable insights into the property's characteristics.
  • Neighborhood Insights: Information about the neighborhood, schools, and local amenities is valuable for potential homebuyers assessing surroundings. Scraping strategies should consider capturing this data for a more comprehensive property profile.
  • Property History and Tax Information: Historical overview and tax details offer transparency and additional context for interested parties. When scraping, capturing this information adds depth to the understanding of the property's background.
  • Contact Information: Facilitating direct communication with the listing agent, contact information allows users to inquire or schedule property visits easily. This detail is crucial for user interaction and engagement.

Understanding the layout of Zillow's property pages is essential for effective navigation and information extraction. Each section serves a specific purpose, guiding users through a comprehensive overview of the listed property.

Key data points available on Zillow

When scraping data from Zillow, it's crucial to identify the key data points that align with your objectives. Zillow provides a vast array of information, ranging from property details to market trends. Some of the essential data points you can extract from Zillow include:

Scrape Zillow

  • Property Details: Includes detailed information about the property, such as square footage, the number of bedrooms and bathrooms, and the type of property (e.g., single-family home, condo, apartment).
  • Price History: Tracks the historical pricing information for a property, allowing users to analyze price trends and fluctuations over time.
  • Zestimate: Zillow's proprietary home valuation tool that provides an estimated market value for a property based on various factors. It offers insights into a property's potential worth.
  • Neighborhood Information: Offers data on the neighborhood, including nearby schools, amenities, crime rates, and other relevant details that contribute to a comprehensive understanding of the area.
  • Local Market Trends: Provides insights into the local real estate market, showcasing trends such as median home prices, inventory levels, and the average time properties spend on the market.
  • Comparable Home Sales: Allows users to compare a property's details and pricing with similar homes in the area, aiding in market analysis and decision-making.
  • Rental Information: For rental properties, Zillow includes details such as monthly rent, lease terms, and amenities, assisting both renters and landlords in making informed choices.
  • Property Tax Information: Offers data on property taxes, helping users understand the tax implications associated with a particular property.
  • Home Features and Amenities: Lists specific features and amenities available in a property, providing a detailed overview for potential buyers or tenants.
  • Interactive Maps: Utilizes maps to display property locations, neighborhood boundaries, and nearby points of interest, enhancing spatial understanding.

Understanding and leveraging these key data points on Zillow is essential for anyone involved in real estate research, whether it be for personal use, investment decisions, or market analysis.

Setting Up Your Python Environment

Setting up a conducive Python environment is the foundational step for efficient real estate data scraping from Zillow. Here's a brief guide to getting your Python environment ready:

Installing Python

Begin by installing Python on your machine. Visit the official Python website (https://www.python.org/) to download the latest version compatible with your operating system.

During installation, ensure you check the box that says "Add Python to PATH" to make Python accessible from any command prompt window.

Once Python is installed, open a command prompt or terminal window and verify the installation by using following command:

python --version
Enter fullscreen mode Exit fullscreen mode

Installing Essential Libraries

For web scraping, you'll need to install essential libraries like requests for making HTTP requests and beautifulsoup4 for parsing HTML. To leverage the Crawlbase Crawling API seamlessly, install the Crawlbase Python library as well. Use the following commands:

pip install requests
pip install beautifulsoup4
pip install crawlbase
Enter fullscreen mode Exit fullscreen mode

Choosing a Suitable Development IDE:

Selecting the right Integrated Development Environment (IDE) can greatly enhance your coding experience. There are several IDEs to choose from; here are a few popular ones:

  • PyCharm: A powerful and feature-rich IDE specifically designed for Python development. It offers intelligent code assistance, a visual debugger, and built-in support for web development.
  • VSCode (Visual Studio Code): A lightweight yet powerful code editor that supports Python development. It comes with a variety of extensions, making it customizable to your preferences.
  • Jupyter Notebook: Ideal for data analysis and visualization tasks. Jupyter provides an interactive environment and is widely used in data science projects.
  • Spyder: A MATLAB-like IDE that is well-suited for scientific computing and data analysis. It comes bundled with the Anaconda distribution.

Choose an IDE based on your preferences and the specific requirements of your real estate data scraping project. Ensure the selected IDE supports Python and provides the features you need for efficient coding and debugging.

Zillow Scraper With Common Approach

In this section, we'll walk through the common approach to creating a Zillow scraper using Python. This method involves using the requests library to fetch web pages and BeautifulSoup for parsing HTML to extract the desired information.

In our example, let's focus on scraping properties on sale at location “Columbia Heights, Washington, DC”. Let's break down the process into digestible chunks:

Utilizing Python's Requests Library

The requests library allows us to send HTTP requests to Zillow's servers and retrieve the HTML content of web pages. Here's a code snippet to make a request to the Zillow website:

import requests

url = "https://www.zillow.com/columbia-heights-washington-dc/sale/"
response = requests.get(url)

if response.status_code == 200:
    html_content = response.text
    print(html_content)
else:
    print(f"Failed to fetch page. Status code: {response.status_code}")
Enter fullscreen mode Exit fullscreen mode

Open your preferred text editor or IDE, copy the provided code, and save it in a Python file. For example, name it zillow_scraper.py.

Run the Script:

Open your terminal or command prompt and navigate to the directory where you saved zillow_scraper.py. Execute the script using the following command:

python zillow_scraper.py
Enter fullscreen mode Exit fullscreen mode

As you hit Enter, your script will come to life, sending a request to the Zillow website, retrieving the HTML content and displaying it on your terminal.

Output HTML Snapshot

Inspect the Zillow Page for CSS selectors

With the HTML content obtained from the page, the next step is to analyze the webpage and pinpoint the location of data points we need.

Zillow SERP Inspect

  1. Open Developer Tools: Simply right-click on the webpage in your browser and choose 'Inspect' (or 'Inspect Element'). This will reveal the Developer Tools, allowing you to explore the HTML structure.
  2. Traverse HTML Elements: Once in the Developer Tools, explore the HTML elements to locate the specific data you want to scrape. Look for unique identifiers, classes, or tags associated with the desired information.
  3. Pinpoint CSS Selectors: Take note of the CSS selectors that correspond to the elements you're interested in. These selectors serve as essential markers for your Python script, helping it identify and gather the desired data.

Parsing HTML with BeautifulSoup

Once we've fetched the HTML content from Zillow using the requests library and CSS selectors are in our hands, the next step is to parse this content and extract the information we need. This is where BeautifulSoup comes into play, helping us navigate and search the HTML structure effortlessly.

In our example, we'll grab web link to each property listed on the chosen Zillow search page. Afterwards, we'll utilize these links to extract key details about each property. Now, let's enhance our existing script to gather this information directly from the HTML.

import requests
from bs4 import BeautifulSoup
import json

def get_property_urls(url):
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/122.0'}

    response = requests.get(url, headers=headers)

    property_page_urls = []

    if response.status_code == 200:
        soup = BeautifulSoup(response.text, 'html.parser')

        property_page_urls = [property['href'] for property in soup.select('div[id="grid-search-results"] > ul > li[class^="ListItem-"] article[data-test="property-card"] a[data-test="property-card-link"]')]

    else:
        print(f'Error: {response.status_code}')

    return property_page_urls

def main():
    url = "https://www.zillow.com/columbia-heights-washington-dc/sale/"
    results = get_property_urls(url)

    print(json.dumps(results, indent=2))

if __name__ == "__main__":
    main()
Enter fullscreen mode Exit fullscreen mode

But will the HTML we receive using requests contain the required information? Let see the output of above script:

[
  "https://www.zillow.com/homedetails/1429-Girard-St-NW-101-Washington-DC-20009/2053968963_zpid/",
  "https://www.zillow.com/homedetails/1439-Euclid-St-NW-APT-301-Washington-DC-20009/68081615_zpid/",
  "https://www.zillow.com/homedetails/1362-Newton-St-NW-Washington-DC-20010/472850_zpid/",
  "https://www.zillow.com/homedetails/1362-Parkwood-Pl-NW-Washington-DC-20010/472302_zpid/",
  "https://www.zillow.com/homedetails/1458-Columbia-Rd-NW-APT-300-Washington-DC-20009/82293130_zpid/",
  "https://www.zillow.com/homedetails/1438-Meridian-Pl-NW-APT-106-Washington-DC-20010/467942_zpid/",
  "https://www.zillow.com/homedetails/2909-13th-St-NW-Washington-DC-20009/473495_zpid/",
  "https://www.zillow.com/homedetails/1421-Columbia-Rd-NW-APT-B4-Washington-DC-20009/467706_zpid/",
  "https://www.zillow.com/homedetails/2516-12th-St-NW-Washington-DC-20009/473993_zpid/"
]
Enter fullscreen mode Exit fullscreen mode

You will observe that the output only captures a portion of the anticipated results. This limitation arises because Zillow utilizes JavaScript/Ajax to dynamically load search results on its SERP page. When you make an HTTP request to the Zillow URL, the HTML response lacks a significant portion of the search results, resulting in the absence of valuable information. The dynamically loaded content is not present in the initial HTML response, making it challenging to retrieve the complete set of data through a static request.

Drawbacks and Challenges of the Common Approach

While the common approach of using Python's requests library and BeautifulSoup for Zillow scraping is a straightforward method, it comes with certain drawbacks and challenges:

  • Dynamic Content Loading: Zillow, like many modern websites, often uses dynamic content loading techniques with JavaScript. The common approach relies on static HTML parsing, making it challenging to retrieve data that is loaded dynamically after the initial page load.
  • Website Structure Changes: Web scraping is sensitive to changes in the HTML structure of a website. If Zillow updates its website layout, adds new elements, or modifies class names, it can break the scraper. Regular maintenance is required to adapt to any structural changes.
  • Rate Limiting and IP Blocking: Zillow may have rate-limiting mechanisms in place to prevent excessive requests from a single IP address in a short period. Continuous and aggressive scraping using the common approach may lead to temporary or permanent IP blocking, impacting the scraper's reliability.
  • Limited Scalability: As the common approach relies on synchronous requests, scalability becomes an issue when dealing with a large volume of data. Making numerous sequential requests can be time-consuming, hindering the efficiency of the scraping process.
  • No Built-in Handling of JavaScript: Since the common approach does not handle JavaScript execution, any data loaded dynamically through JavaScript will be missed. This limitation is particularly relevant for websites, like Zillow, that heavily rely on JavaScript for content presentation

To overcome these challenges and ensure a more robust and scalable solution, we'll explore the advantages of using the Crawlbase Crawling API in the subsequent sections of this guide. This API offers solutions to many of the limitations posed by the common approach, providing a more reliable and efficient way to scrape real estate data from Zillow.

Using Crawlbase Crawling API for Zillow

Now, let's explore a more advanced and efficient method for Zillow scraping using the Crawlbase Crawling API. This approach offers several advantages over the common method and addresses its limitations. Its parameters allow us to handle various scraping tasks effortlessly.

Here's a step-by-step guide on harnessing the power of this dedicated API:

Crawlbase Account Creation and API Token Retrieval

Initiating the process of extracting Target data through the Crawlbase Crawling API starts with establishing your presence on the Crawlbase platform. Let's walk you through the steps of creating an account and obtaining your essential API token:

  1. Visit Crawlbase: Launch your web browser and go to the Signup page on the Crawlbase website to commence your registration.
  2. Input Your Credentials: Provide your email address and create a secure password for your Crawlbase account. Accuracy in filling in the required details is crucial.
  3. Verification Steps: Upon submitting your details, check your inbox for a verification email. Complete the steps outlined in the email to verify your account.
  4. Log into Your Account: Once your account is verified, return to the Crawlbase website and log in using the credentials you established.
  5. Obtain Your API Token: Accessing the Crawlbase Crawling API necessitates an API token, which you can locate in your account documentation.

Quick Note: Crawlbase offers two types of tokens – one tailored for static websites and another designed for dynamic or JavaScript-driven websites. Since our focus is on scraping Zillow, we will utilize the JS token. As an added perk, Crawlbase extends an initial allowance of 1,000 free requests for the Crawling API, making it an optimal choice for our web scraping endeavor.

Accessing the Crawling API with Crawlbase Library

The Crawlbase library in Python facilitates seamless interaction with the API, allowing you to integrate it into your Zillow scraping project effortlessly.The provided code snippet demonstrates how to initialize and utilize the Crawling API through the Crawlbase Python library.

from crawlbase import CrawlingAPI

API_TOKEN = 'YOUR_CRAWLBASE_JS_TOKEN'
crawling_api = CrawlingAPI({'token': API_TOKEN})

url = "https://www.zillow.com/columbia-heights-washington-dc/sale/"

response = crawling_api.get(url)

if response['headers']['pc_status'] == '200':
    html_content = response['body'].decode('utf-8')
    print(html_content)
else:
    print(f"Failed to fetch the page. Crawlbase status code: {response['headers']['pc_status']}")
Enter fullscreen mode Exit fullscreen mode

Detailed documentation of the Crawling API is available on the Crawlbase platform. You can read it here. If you want to learn more about the Crawlbase Python library and see additional examples of its usage, you can find the documentation here.

Scraping Property Pages URL from SERP

To extract all the URLs of property pages from Zillow's SERP, we'll enhance our common script by bringing in the Crawling API. Zillow, like many modern websites, employs dynamic elements that load asynchronously through JavaScript. We'll incorporate the ajax_wait and page_wait parameters to ensure our script captures all relevant property URLs.

from crawlbase import CrawlingAPI
from bs4 import BeautifulSoup
import json

def get_property_urls(api, url):
    options = {
        'user_agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/122.0',
        'ajax_wait': 'true',
        'page_wait': 5000
    }

    response = api.get(url, options)

    property_page_urls = []

    if response['headers']['pc_status'] == '200':
        html_content = response['body'].decode('utf-8')
        soup = BeautifulSoup(html_content, 'html.parser')

        property_page_urls = [property['href'] for property in soup.select('div[id="grid-search-results"] > ul > li[class^="ListItem-"] article[data-test="property-card"] a[data-test="property-card-link"]')]

    else:
        print(f'Error: {response["headers"]["pc_status"]}')

    return property_page_urls

def main():
    API_TOKEN = 'YOUR_CRAWLBASE_JS_TOKEN'
    crawling_api = CrawlingAPI({'token': API_TOKEN})
    serp_url = "https://www.zillow.com/columbia-heights-washington-dc/sale/"

    results = get_property_urls(crawling_api, serp_url)

    print(json.dumps(results, indent=2))

if __name__ == "__main__":
    main()
Enter fullscreen mode Exit fullscreen mode

Example Output:

[
  "https://www.zillow.com/homedetails/1429-Girard-St-NW-101-Washington-DC-20009/2053968963_zpid/",
  "https://www.zillow.com/homedetails/1439-Euclid-St-NW-APT-301-Washington-DC-20009/68081615_zpid/",
  "https://www.zillow.com/homedetails/1362-Newton-St-NW-Washington-DC-20010/472850_zpid/",
  "https://www.zillow.com/homedetails/1362-Parkwood-Pl-NW-Washington-DC-20010/472302_zpid/",
  "https://www.zillow.com/homedetails/1458-Columbia-Rd-NW-APT-300-Washington-DC-20009/82293130_zpid/",
  "https://www.zillow.com/homedetails/1438-Meridian-Pl-NW-APT-106-Washington-DC-20010/467942_zpid/",
  "https://www.zillow.com/homedetails/2909-13th-St-NW-Washington-DC-20009/473495_zpid/",
  "https://www.zillow.com/homedetails/1421-Columbia-Rd-NW-APT-B4-Washington-DC-20009/467706_zpid/",
  "https://www.zillow.com/homedetails/2516-12th-St-NW-Washington-DC-20009/473993_zpid/",
  "https://www.zillow.com/homedetails/2617-University-Pl-NW-1-Washington-DC-20009/334524041_zpid/",
  "https://www.zillow.com/homedetails/1344-Kenyon-St-NW-Washington-DC-20010/473267_zpid/",
  "https://www.zillow.com/homedetails/2920-Georgia-Ave-NW-UNIT-304-Washington-DC-20001/126228603_zpid/",
  "https://www.zillow.com/homedetails/2829-13th-St-NW-1-Washington-DC-20009/2055076326_zpid/",
  "https://www.zillow.com/homedetails/1372-Monroe-St-NW-UNIT-A-Washington-DC-20010/71722141_zpid/"
  ..... more
]
Enter fullscreen mode Exit fullscreen mode

Handling Pagination for Extensive Data Retrieval

To ensure comprehensive data retrieval from Zillow, we need to address pagination. Zillow organizes search results across multiple pages, each identified by a page number in the URL. Zillow employs the {pageNo}_p path parameter for pagination management. Let's modify our existing script to handle pagination and collect property URLs from multiple pages.

from crawlbase import CrawlingAPI
from bs4 import BeautifulSoup
import time
import json

def fetch_html(api, url, options, max_retries=2):
    retries = 0
    while retries <= max_retries:
        try:
            response = api.get(url, options)

            if response['headers']['pc_status'] == '200':
                return response['body'].decode('utf-8')
            else:
                raise Exception(f'Response with pc_status: {response["headers"]["pc_status"]}')

        except Exception as e:
            print(f'Exception: {str(e)}')
            retries += 1
            if retries <= max_retries:
                print(f'Retrying ({retries}/{max_retries})...')
                time.sleep(1)

    print(f'Maximum retries reached. Unable to fetch data from {url}')
    return None

def get_property_urls(api, base_url, options, max_pages):
    # Fetch the first page to determine the actual number of pages
    first_page_url = f"{base_url}1_p/"
    first_page_html = fetch_html(api, first_page_url, options)

    if first_page_html is not None:
        first_page_soup = BeautifulSoup(first_page_html, 'html.parser')

        # Extract the total number of pages available
        pagination_max_element = first_page_soup.select_one('div.search-pagination > nav > li:nth-last-child(3)')
        total_pages = int(pagination_max_element.text) if pagination_max_element else 1
    else:
        return []

    # Determine the final number of pages to scrape
    actual_max_pages = min(total_pages, max_pages)

    all_property_page_urls = []

    for page_number in range(1, actual_max_pages + 1):
        url = f"{base_url}{page_number}_p/"
        page_html = fetch_html(api, url, options)

        if page_html is not None:
            soup = BeautifulSoup(page_html, 'html.parser')

            property_page_urls = [property['href'] for property in soup.select('div[id="grid-search-results"] > ul > li[class^="ListItem-"] article[data-test="property-card"] a[data-test="property-card-link"]')]

            all_property_page_urls.extend(property_page_urls)

    return all_property_page_urls

def main():
    API_TOKEN = 'YOUR_CRAWLBASE_JS_TOKEN'
    crawling_api = CrawlingAPI({'token': API_TOKEN})
    serp_url = "https://www.zillow.com/columbia-heights-washington-dc/sale/"
    options = {
        'user_agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/122.0',
        'ajax_wait': 'true',
        'page_wait': 5000
    }
    max_pages = 2  # Adjust the number of pages to scrape as needed

    property_page_urls = get_property_urls(crawling_api, serp_url, options, max_pages)

    # further process the property_page_urls

if __name__ == "__main__":
    main()
Enter fullscreen mode Exit fullscreen mode

The first function, fetch_html, is designed to retrieve the HTML content of a given URL using an API, with the option to specify parameters. It incorporates a retry mechanism, attempting the request up to a specified number of times (default is 2) in case of errors or timeouts. The function returns the decoded HTML content if the server responds with a success status (HTTP 200), and if not, it raises an exception with details about the response status.

The second function, get_property_urls, aims to collect property URLs from multiple pages on a specified website. It first fetches the HTML content of the initial page to determine the total number of available pages. Then, it iterates through the pages, fetching and parsing the HTML to extract property URLs. The maximum number of pages to scrape is determined by the minimum of the total available pages and the specified maximum pages parameter. The function returns a list of property URLs collected from the specified number of pages.

Extracting required data from Property Page URLs

Now that we have a comprehensive list of property page URLs, the next step is to extract the necessary data from each property page. Let's enhance our script to navigate through these URLs and gather relevant details such as property type, address, price, size, bedrooms & bathrooms count, and other essential data points.

from crawlbase import CrawlingAPI
from bs4 import BeautifulSoup
import time
import json

def fetch_html(api, url, options):
    # ... (unchanged)

def get_property_urls(api, base_url, options, max_pages):
    # ... (unchanged)

def scrape_properties_data(api, urls, options):
    properties_data = []

    for url in urls:
        page_html = fetch_html(api, url, options)

        if page_html is not None:
            soup = BeautifulSoup(page_html, 'html.parser')
            type_element = soup.select_one('div[data-testid="macro-data-view"] > div[data-renderstrat="inline"]:nth-child(3) div.dBmBNo:first-child > span')
            builtin_year_element = soup.select_one('div[data-testid="macro-data-view"] > div[data-renderstrat="inline"]:nth-child(3) div.dBmBNo:nth-child(2) > span')

            address_element = soup.select_one('div[data-testid="macro-data-view"] > div[data-renderstrat="inline"]:nth-child(2) div[class^="styles__AddressWrapper-"] > h1')
            price_element = soup.select_one('div[data-testid="macro-data-view"] > div[data-renderstrat="inline"]:nth-child(2) span[data-testid="price"] > span')
            size_element = soup.select_one('div[data-testid="macro-data-view"] > div[data-renderstrat="inline"]:nth-child(2) div[data-testid="bed-bath-sqft-facts"] > div[data-testid="bed-bath-sqft-fact-container"]:last-child > span:first-child')
            bedrooms_element = soup.select_one('div[data-testid="macro-data-view"] > div[data-renderstrat="inline"]:nth-child(2) div[data-testid="bed-bath-sqft-facts"] > div[data-testid="bed-bath-sqft-fact-container"]:first-child > span:first-child')
            bathrooms_element = soup.select_one('div[data-testid="macro-data-view"] > div[data-renderstrat="inline"]:nth-child(2) div[data-testid="bed-bath-sqft-facts"] > button > div[data-testid="bed-bath-sqft-fact-container"] > span:first-child')

            property_data = {
                'property url': url,
                'type': type_element.text.strip() if type_element else None,
                'address': address_element.text.strip() if address_element else None,
                'size': size_element.text.strip() if size_element else None,
                'price': price_element.text.strip() if price_element else None,
                'bedrooms': bedrooms_element.text.strip() if bedrooms_element else None,
                'bathrooms': bathrooms_element.text.strip() if bathrooms_element else None,
                'builtin year': builtin_year_element.text.strip() if builtin_year_element else None,
            }

            properties_data.append(property_data)

    return properties_data

def main():
    API_TOKEN = 'YOUR_CRAWLBASE_JS_TOKEN'
    crawling_api = CrawlingAPI({'token': API_TOKEN})
    serp_url = "https://www.zillow.com/columbia-heights-washington-dc/sale/"
    options = {
        'user_agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/122.0',
        'ajax_wait': 'true',
        'page_wait': 5000
    }
    max_pages = 2  # Adjust the number of pages to scrape as needed

    property_page_urls = get_property_urls(crawling_api, serp_url, options, max_pages)

    properties_data = scrape_properties_data(crawling_api, property_page_urls, options)

    print(json.dumps(properties_data, indent=2))

if __name__ == "__main__":
    main()
Enter fullscreen mode Exit fullscreen mode

This script introduces the scrape_properties_data function, which retrieves the HTML content from each property page URL and extracts details we need. Adjust the data points based on your requirements, and further processing can be performed as needed.

Example Output:

[
  {
    "property url": "https://www.zillow.com/homedetails/1008-Fairmont-St-NW-Washington-DC-20001/473889_zpid/",
    "type": "Townhouse",
    "address": "1008 Fairmont St NW,\u00a0Washington, DC 20001",
    "size": "1,801",
    "price": "$850,000",
    "bedrooms": "3",
    "bathrooms": "4",
    "builtin year": "Built in 1910"
  },
  {
    "property url": "https://www.zillow.com/homedetails/1429-Girard-St-NW-101-Washington-DC-20009/2053968963_zpid/",
    "type": "Stock Cooperative",
    "address": "1429 Girard St NW #101,\u00a0Washington, DC 20009",
    "size": "965",
    "price": "$114,745",
    "bedrooms": "2",
    "bathrooms": "1",
    "builtin year": "Built in 1966"
  },
  {
    "property url": "https://www.zillow.com/homedetails/1362-Parkwood-Pl-NW-Washington-DC-20010/472302_zpid/",
    "type": "Single Family Residence",
    "address": "1362 Parkwood Pl NW,\u00a0Washington, DC 20010",
    "size": "1,760",
    "price": "$675,000",
    "bedrooms": "3",
    "bathrooms": "2",
    "builtin year": "Built in 1911"
  },
  {
    "property url": "https://www.zillow.com/homedetails/3128-Sherman-Ave-NW-APT-1-Washington-DC-20010/2076798673_zpid/",
    "type": "Stock Cooperative",
    "address": "3128 Sherman Ave NW APT 1,\u00a0Washington, DC 20010",
    "size": "610",
    "price": "$117,000",
    "bedrooms": "1",
    "bathrooms": "1",
    "builtin year": "Built in 1955"
  },
  {
    "property url": "https://www.zillow.com/homedetails/1438-Meridian-Pl-NW-APT-106-Washington-DC-20010/467942_zpid/",
    "type": "Condominium",
    "address": "1438 Meridian Pl NW APT 106,\u00a0Washington, DC 20010",
    "size": "634",
    "price": "$385,000",
    "bedrooms": "2",
    "bathrooms": "2",
    "builtin year": "Built in 1910"
  },
  {
    "property url": "https://www.zillow.com/homedetails/2909-13th-St-NW-Washington-DC-20009/473495_zpid/",
    "type": "Townhouse",
    "address": "2909 13th St NW,\u00a0Washington, DC 20009",
    "size": "3,950",
    "price": "$1,025,000",
    "bedrooms": "7",
    "bathrooms": "3",
    "builtin year": "Built in 1909"
  },
  {
    "property url": "https://www.zillow.com/homedetails/1412-Chapin-St-NW-APT-1-Washington-DC-20009/183133784_zpid/",
    "type": "Condominium",
    "address": "1412 Chapin St NW APT 1,\u00a0Washington, DC 20009",
    "size": "724",
    "price": "$550,000",
    "bedrooms": "2",
    "bathrooms": "2",
    "builtin year": "Built in 2015"
  },
  ..... more
]
Enter fullscreen mode Exit fullscreen mode

Saving scraped Data in a Database

Once you've successfully extracted the desired data from Zillow property pages, it's a good practice to store this information systematically. One effective way is by utilizing a SQLite database to organize and manage your scraped real estate data. Below is an enhanced version of the script to integrate SQLite functionality and save the scraped data:

from crawlbase import CrawlingAPI
from bs4 import BeautifulSoup
import sqlite3
import time

def fetch_html(api, url, options):
    # ... (unchanged)

def get_property_urls(api, base_url, options, max_pages):
    # ... (unchanged)

def scrape_properties_data(api, urls, options):
    # ... (unchanged)

def initialize_database(database_path='zillow_properties_data.db'):
    # Establish a connection to the SQLite database
    connection = sqlite3.connect(database_path)
    cursor = connection.cursor()

    # Create the 'properties' table if it doesn't exist
    cursor.execute('''
        CREATE TABLE IF NOT EXISTS properties (
            id INTEGER PRIMARY KEY,
            url TEXT,
            type TEXT,
            address TEXT,
            price TEXT,
            size TEXT,
            bedrooms TEXT,
            bathrooms TEXT,
            builtin_year TEXT
        )
    ''')

    # Commit the changes and close the connection
    connection.commit()
    connection.close()

def insert_into_database(property_data, database_path='zillow_properties_data.db'):
    # Establish a connection to the SQLite database
    connection = sqlite3.connect(database_path)
    cursor = connection.cursor()

    # Insert property data into the 'properties' table
    cursor.execute('''
        INSERT INTO properties (url, type, address, price, size, bedrooms, bathrooms, builtin_year)
        VALUES (?, ?, ?, ?, ?, ?, ?, ?)
    ''', (
        property_data.get('property url'),
        property_data.get('type'),
        property_data.get('address'),
        property_data.get('price'),
        property_data.get('size'),
        property_data.get('bedrooms'),
        property_data.get('bathrooms'),
        property_data.get('builtin year')
    ))

    # Commit the changes and close the connection
    connection.commit()
    connection.close()

def main():
    API_TOKEN = 'YOUR_CRAWLBASE_JS_TOKEN'
    crawling_api = CrawlingAPI({'token': API_TOKEN})
    serp_url = "https://www.zillow.com/columbia-heights-washington-dc/sale/"
    options = {
        'user_agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/122.0',
        'ajax_wait': 'true',
        'page_wait': 5000
    }
    max_pages = 2  # Adjust the number of pages to scrape as needed

    # Initialize the database
    initialize_database()

    property_page_urls = get_property_urls(crawling_api, serp_url, options, max_pages)

    properties_data = scrape_properties_data(crawling_api, property_page_urls, options)

    # Insert data into the database
    for property_data in properties_data:
        insert_into_database(property_data)


if __name__ == "__main__":
    main()
Enter fullscreen mode Exit fullscreen mode

This script introduces two functions: initialize_database to set up the SQLite database table, and insert_into_database to insert each property's data into the database. The SQLite database file (zillow_properties_data.db) will be created in the script's directory. Adjust the table structure and insertion logic based on your specific data points.

properties Table Snapshot:

Properties Table Snapshot

Advantages of using Crawlbase's Crawling API for Zillow scraping

Scraping real estate data from Zillow becomes more efficient with Crawlbase's Crawling API. Here's why it stands out:

  • Efficient Dynamic Content Handling: Crawlbase's API adeptly manages dynamic content on Zillow, ensuring your scraper captures all relevant data, even with delays or dynamic changes.
  • Minimized IP Blocking Risk: Crawlbase reduces the risk of IP blocking by allowing you to switch IP addresses, enhancing the success rate of your Zillow scraping project.
  • Tailored Scraping Settings: Customize API requests with settings like user_agent, format, and country for adaptable and efficient scraping based on specific needs.
  • Pagination Made Simple: Crawlbase simplifies pagination handling with parameters like ajax_wait and page_wait, ensuring seamless navigation through Zillow's pages for extensive data retrieval.
  • Tor Network Support: For added privacy, Crawlbase supports the Tor network via the tor_network parameter, enabling secure scraping of onion websites.
  • Asynchronous Crawling: The API supports asynchronous crawling with the async parameter, enhancing the efficiency of large-scale Zillow scraping tasks.
  • Autoparsing for Data Extraction: Use the autoparse parameter for simplified data extraction in JSON format, reducing post-processing efforts.

In summary, Crawlbase's Crawling API streamlines Zillow scraping with efficiency and adaptability, making it a robust choice for real estate data extraction projects.

Real Estate Insights: Analyzing Zillow Data

Once you've successfully scraped real estate data from Zillow, the wealth of information you've gathered opens up numerous possibilities for analysis and application in the real estate industry. Here are key insights into potential use cases and the exciting realm of data analysis and visualization:

Potential Use Cases for Real Estate Professionals

Zillow Data Real Estate Usecases

Identifying Market Trends: Zillow data allows real estate professionals to identify market trends, such as price fluctuations, demand patterns, and popular neighborhoods. This insight aids in making informed decisions regarding property investments and sales strategies.

Property Valuation and Comparisons: Analyzing Zillow data enables professionals to assess property values and make accurate comparisons. This information is crucial for determining competitive pricing, understanding market competitiveness, and advising clients on realistic property valuations.

Targeted Marketing Strategies: By delving into Zillow data, real estate professionals can tailor their marketing strategies. They can target specific demographics, create effective advertising campaigns, and reach potential clients who are actively searching for properties matching certain criteria.

Investment Opportunities: Zillow data provides insights into potential investment opportunities. Real estate professionals can identify areas with high growth potential, emerging trends, and lucrative opportunities for property development or investment.

Client Consultations and Recommendations: Armed with comprehensive Zillow data, professionals can provide clients with accurate and up-to-date information during consultations. This enhances the credibility of recommendations and empowers clients to make well-informed decisions.

Data Analysis and Visualization Possibilities

Zillow Data Analysis

Interactive Dashboards: Real estate professionals can create interactive dashboards using Zillow data. These dashboards offer a visual representation of market trends, property values, and other key metrics, making it easier to grasp complex information.

Geospatial Mapping: Utilizing geospatial mapping, professionals can visually represent property locations, neighborhood boundaries, and market hotspots. This aids in understanding geographical trends and planning strategic real estate moves.

Predictive Analytics: Applying predictive analytics to Zillow data allows professionals to forecast future market trends. This proactive approach enables them to stay ahead of market shifts and make informed decisions for their clients.

Comparative Market Analysis (CMA): Zillow data supports the creation of Comparative Market Analysis reports. These reports include detailed property comparisons, helping professionals guide clients on pricing strategies and property valuations.

Final Thoughts

In the world of real estate data scraping from Zillow, simplicity and effectiveness play a vital role. While the common approach may serve its purpose, the Crawlbase Crawling API emerges as a smarter choice. Say goodbye to challenges and embrace a streamlined, reliable, and scalable solution with the Crawlbase Crawling API for Zillow scraping.

For those eager to explore data scraping from various platforms, feel free to dive into our comprehensive guides:

📜 How to Scrape Amazon
📜 How to Scrape Airbnb Prices
📜 How to Scrape Booking.com
📜 How to Scrape Expedia

Happy scraping! If you encounter any hurdles or need guidance, our dedicated team is here to support you on your journey through the realm of real estate data.

Frequently Asked Questions (FAQs)

Q1: Is scraping data from Zillow legal?

Web scraping is a complex legal area. While Zillow's terms of service generally allow browsing, systematic data extraction may be subject to restrictions. It is advisable to review Zillow's terms and conditions, including the robots.txt file. Always respect the website's policies and consider the ethical implications of web scraping.

Q2: Can I use Zillow data for commercial purposes?

The use of scraped data, especially for commercial purposes, depends on Zillow's policies. It is important to carefully review and adhere to Zillow's terms of service, including any guidelines related to data usage and copyright. Seeking legal advice is recommended if you plan to use the scraped data commercially.

Q3: Are there any limitations to using the Crawlbase Crawling API for Zillow scraping?

While the Crawlbase Crawling API is a robust tool, users should be aware of certain limitations. These may include rate limits imposed by the API, policies related to API usage, and potential adjustments needed due to changes in the structure of the target website. It is advisable to refer to the Crawlbase documentation for comprehensive information on API limitations.

Q4: How can I handle dynamic content on Zillow using the Crawlbase Crawling API?

The Crawlbase Crawling API provides mechanisms to handle dynamic content. Parameters such as ajax_wait and page_wait are essential tools for ensuring the API captures all relevant content, even if the web pages undergo dynamic changes during the scraping process. Adjusting these parameters based on the website's behavior helps in effective content retrieval.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Terabox Video Player