This article was originally posted on Crawlbase Blog.
This blog is a step-by-step guide of scraping Amazon PPC ad data with Python. Amazon PPC ads, or Sponsored Products, have become a pivotal component of Amazon's vast advertising ecosystem. These are the ads you see when you perform a search on Amazon, often labeled as "Sponsored" or "Ad." Scraping Competitors sponsored ads data gives you alot more than competitive edge. Scroll down to learn more how amazon ads can benefit your business or you can head straight to scraping amazon ads data by clicking here.
So, kick back, grab a cup of coffee, and let's get into how you can scrape Amazon PPC ad data using Python like a pro! 😉
1. Getting Started
Amazon has a large and expanding marketplace. Every month, about 200 million individuals shop on Amazon. Amazon's marketplace now has over 2.5 million sellers selling their wares. A company can do everything it can to raise awareness of its brand and product, but in the early stages, it often needs to utilize someone else's brand to build its own. Smaller shops trying to scale platforms like Amazon to gain exposure to a client base would be unable to do so on their own. Amazon sells to almost 200,000 enterprises with annual sales of $100,000 or higher. On the marketplace, around 25,000 vendors earn more than $1 million.
Lets explore more on why you should scrape Amazon ads.
The Power of Amazon PPC Ads
Here's why these ads are so potent:
- Enhanced Visibility: Amazon PPC ads boost product visibility, helping your products appear at the top of relevant search results, even above organic listings. This increases the likelihood of potential customers seeing and clicking on your products.
- Precision Targeting: Amazon advertising is laser-focused targeting. You can choose specific keywords, products, or categories to display your ads, ensuring they reach the most relevant audience.
- Pay Only for Performance: With PPC, you pay only when a user clicks on your ad, which means you're not spending on mere impressions; you're investing in potential conversions.
- Data-Driven Insights: Amazon sponsored ads provides rich data and analytics on ad performance. You can track clicks, conversions, and other crucial metrics.
- Competitive Advantage: Leveraging Amazon PPC can give you an edge over competitors, especially when you're introducing a new product.
Why Scrape Amazon Sponsored Ads Data?
Scraping Amazon PPC ad data might not be the first idea that comes to mind, but it holds immense potential for e-commerce businesses. Here's why you should consider diving into the world of scraping Amazon PPC ad data:
- Competitive Analysis: By scraping data from Amazon PPC ads, you can gain insights into your competitors' advertising strategies. You can monitor their keywords, ad copy, and bidding strategies to stay ahead in the game.
- Optimizing Your Ad Campaigns: Accessing data from your own Amazon PPC campaigns allows you to analyze their performance in detail. You can identify what's working and what's not, helping you make data-driven decisions to optimize your ad spend.
- Discovering New Keywords: Scraping ad data can uncover valuable keywords that you might have missed in your initial research. These new keywords can be used to enhance your organic listings as well.
- Staying Informed: Amazon's ad system is dynamic. New products, new keywords, and changing trends require constant monitoring. Scraping keeps you informed about these changes and ensures your advertising strategy remains relevant.
- Research and Market Insights: Beyond your own campaigns, scraping Amazon PPC ad data provides a broader perspective on market trends and customer behavior. You can identify rising trends and customer preferences by analyzing ad data at scale.
In the subsequent sections of this guide, you'll delve into the technical aspects of scraping Amazon PPC ad data, unlocking the potential for a competitive advantage in the e-commerce world.
2. Getting Started with Crawlbase Crawling API
If you're new to web scraping or experienced in the field, you'll find that the Crawlbase Crawling API simplifies the process of extracting data from websites, including Scraping Amazon search pages. Before we go into the specifics of using this API, let's take a moment to understand why it's essential and how it can benefit you.
Introducing Crawlbase Crawling API
Crawlbase Crawling API is one of the best web crawling tools that allows developers and businesses to easily scrape data from websites at scale. It's designed to simplify web scraping by providing a user-friendly interface and powerful features. With Crawlbase, you can automate the process of extracting data from websites, including Amazon search pages, saving you valuable time and effort.
Crawlbase offers a Restful API that allows you to interact with their crawling infrastructure programmatically. This means you can send requests to the API, specifying the URLs you want to scrape along with the available query parameters, and receive the scraped data in a structured format, typically HTML or JSON. You can read more about Crawlbase Crawling API here.
Why Choose Crawlbase Crawling API?
You might be wondering why you should opt for Crawlbase Crawling API when other web scraping tools and libraries are available. Here are some compelling reasons:
Scalability: Crawlbase is built for large-scale web scraping. Whether you need to scrape a few hundred pages or millions, Crawlbase can handle it, ensuring your scraping projects can grow with your needs.
Reliability: Web scraping can be demanding, as websites often change their structure. Crawlbase offers robust error handling and monitoring, reducing the chances of your scraping jobs failing unexpectedly.
Proxy Management: Many websites employ anti-scraping measures like IP blocking. Crawlbase provides rotating proxies to help you avoid IP bans and access data more reliably.
Convenience: With Crawlbase's API, you don't need to worry about creating and maintaining your own crawler or scraper. It's a cloud-based solution that handles the technical complexities, allowing you to focus on your data extraction tasks.
Real-time Data: With Crawling API, you will always have your hands on the newest and updated data. It crawls everything in real-time. This is crucial for accurate analysis and decision-making.
Cost-Effective: Building and maintaining an in-house scraping solution can be expensive. Crawling API is very cost-effective, and you must only pay as per your requirements. You can calculate the pricing for Crawling API usage here.
Crawlbase Python Library
To harness the power of Crawlbase Crawling API, you can use the Crawlbase Python library. This library simplifies the integration of Crawlbase into your Python projects, making it accessible to Python developers of all levels of expertise.
First, initialize the Crawling API class.
api = CrawlingAPI({ 'token': 'YOUR_CRAWLBASE_TOKEN' })
Pass the URL that you want to scrape by using the following function.
api.get(url, options = {})
Example:
response = api.get('https://www.facebook.com/britneyspears')
if response['status_code'] == 200:
print(response['body'])
You can pass any options from the ones available in the API documentation.
Example:
response = api.get('https://www.reddit.com/r/pics/comments/5bx4bx/thanks_obama/', {
'user_agent': 'Mozilla/5.0 (Windows NT 6.2; rv:20.0) Gecko/20121202 Firefox/30.0',
'format': 'json'
})
if response['status_code'] == 200:
print(response['body'])
There are many other functionalities provided by Crawlbase Python library. You can read more about it here.
In the following sections, we will guide you through harnessing the capabilities of the Crawlbase Crawling API to scrape Amazon search pages effectively. We'll use Python, a versatile programming language, to demonstrate the process step by step. Let's explore Amazon's wealth of information and learn how to unlock its potential.
3. Understanding Amazon PPC Ads
Before delving into the technical aspects of scraping Amazon PPC ad data, it's crucial to understand Amazon sponsored ads, the different types of it, and the specific data you'll want to scrape. Let's start by decoding Amazon's advertising system.
Decoding Amazon's Advertising System
Amazon's advertising system promote their products in various ways such as Sponsored Products, Sponsored Brands, Sponsored Display, and more. Let’s focus on the most common type that are Sponsored products.
Sponsored Products are a form of Amazon advertising that allows sellers to promote individual product listings within Amazon's search results. These ads are displayed prominently on search result pages and product detail pages.
Types of PPC Ads on Amazon
Amazon offers a range of PPC ad types. Understanding amazon ad types is crucial for effective advertising strategy. Here's an overview of the main types:
- Sponsored Products: These ads promote individual product listings within search results and on product detail pages.
- Sponsored Brands: Formerly known as Headline Search Ads, Sponsored Brands allow advertisers to feature their brand logo, a custom headline, and a selection of products in a banner ad.
- Sponsored Display: This ad type is designed to reach audiences both on and off Amazon. It includes features like product targeting and audience targeting.
- Display Re-marketing: Advertisers can re-target users who have previously visited their product detail pages.
- Video Ads: Amazon offers in-stream video ads for brands to engage shoppers with video content.
- Stores: Amazon Stores are custom multi-page shopping destinations for brands to showcase their products.
The Data You Want to Scrape
Now that you have an understanding of Amazon's advertising, let's focus on the specific data you want to scrape from Amazon PPC ads. When scraping Amazon PPC ad data, the key information you'll typically aim to extract includes:
- Ad Campaign Information: This data provides insights into the overall performance of your ad campaigns. It includes campaign names, IDs, start and end dates, and budget details.
- Keyword Data: Keywords are the foundation of PPC advertising. You'll want to scrape keyword information, including the keywords used in your campaigns, their match types (broad, phrase, exact), and bid amounts.
- Ad Group Details: Ad groups help you organize your ads based on common themes. Scraping ad group data allows you to understand the structure of your campaigns.
- Ad Performance Metrics: Essential metrics include the number of clicks, impressions, CTR, conversion rate, total spend, and more. These metrics help you evaluate the effectiveness of your ads.
- Product Information: Extracting data about the advertised products, such as ASIN, product titles, prices, and image URLs, is vital for optimizing ad content.
- Competitor Analysis: In addition to your own ad data, you might want to scrape competitor ad information to gain insights into their strategies and keyword targeting.
Understanding these core elements and the specific data you aim to scrape will be instrumental as you progress in scraping Amazon PPC ad data using Python and the Crawlbase Crawling API. In the subsequent sections, you'll learn how to turn this understanding into actionable technical processes.
4. Prerequisites
Before we embark on our web scraping journey, let's ensure that you have all the necessary tools and resources ready. In this chapter, we'll cover the prerequisites needed for successful web scraping of Amazon search pages using the Crawlbase Crawling API.
Setting Up Your Development Environment
You'll need a suitable development environment to get started with web scraping. Here's what you'll require:
Python:
Python is a versatile programming language widely used in web scraping. Ensure that you have Python installed on your system. You can download the latest version of Python from the official website here.
Code Editor or IDE:
Choose a code editor or integrated development environment (IDE) for writing and running your Python code. Popular options include PyCharm, and Jupyter Notebook. You can also use Google Colab. Select the one that best suits your preferences and workflow.
Installing Required Libraries
Web scraping in Python is made more accessible using libraries that simplify tasks like making HTTP, parsing HTML, and handling data. Install the following libraries using pip, Python's package manager:
pip install pandas
pip install crawlbase
pip install beautifulsoap4
- Pandas: Pandas is a powerful data manipulation library that will help you organize and analyze the scraped data efficiently.
- Crawlbase: A lightweight, dependency free Python class that acts as wrapper for Crawlbase API.
- Beautiful Soup: Beautiful Soup is a Python library that makes it easy to parse HTML and extract data from web pages.
Creating a Crawlbase Account
To access the Crawlbase Crawling API, you'll need a Crawlbase account. If you don't have one, follow these steps to create an account:
- Click here to create a new Crawlbase Account.
- Fill in the required information, including your name, email address, and password.
- Verify your email address by clicking the verification link sent to your inbox.
- Once your email is verified, you can access your Crawlbase dashboard.
Now that your development environment is set up and you have a Crawlbase account ready let's proceed to the next steps, where we'll get your Crawlbase token and start making requests to the Crawlbase Crawling API.
5. Amazon PPC Ad Scraping - Step by Step
Now that we've established the groundwork, it's time to dive into the technical process of scraping Amazon PPC ad data step by step. This section will guide you through the entire journey, from making HTTP requests to Amazon and navigating search result pages to structuring your scraper for extracting ad data. We'll also explore handling pagination to unearth more ads.
Getting the Correct Crawlbase Token
We must obtain an API token before we can unleash the power of the Crawlbase Crawling API. Crawlbase provides two types of tokens: the Normal Token (TCP) for static websites and the JavaScript Token (JS) for dynamic or JavaScript-driven websites. Given that Amazon relies heavily on JavaScript for dynamic content loading, we will opt for the JavaScript Token.
from crawlbase import CrawlingAPI
# Initialize the Crawling API with your Crawlbase JavaScript token
api = CrawlingAPI({ 'token': 'YOU_CRAWLBASE_JS_TOKEN' })
You can get your Crawlbase token here after creating account on it.
Setting up Crawlbase Crawling API
Armed with our JavaScript token, we're all set to set up the Crawlbase Crawling API. But before we proceed, let's delve into the structure of the output response. The response you receive can come in two formats: HTML or JSON. The default choice for the Crawling API is HTML format.
HTML response:
Headers:
url: "The URL which was crawled"
original_status: 200
pc_status: 200
Body:
The HTML of the page
To get the response in JSON format you have to pass a parameter “format” with the value “json”.
JSON Response:
{
"original_status": "200",
"pc_status": 200,
"url": "The URL which was crawled",
"body": "The HTML of the page"
}
We can read more about Crawling API response here. For the example, we will go with the default option. We'll utilize the initialized API object to make requests. Specify the URL you intend to scrape using the api.get(url, options={})
function.
from crawlbase import CrawlingAPI
# Initialize the Crawling API with your Crawlbase token
api = CrawlingAPI({ 'token': 'YOU_CRAWLBASE_JS_TOKEN' })
# URL of the Amazon search page you want to scrape
amazon_search_url = 'https://www.amazon.com/s?k=headphones'
# Make a request to scrape the Amazon search page
response = api.get(amazon_search_url)
# Check if the request was successful
if response['status_code'] == 200:
# Extracted HTML content after decoding byte data
#latin1 will also handle chinese characters)
html_content = response['body'].decode('latin1')
# Save the HTML content to a file
with open('output.html', 'w', encoding='utf-8') as file:
file.write(html_content)
else:
print("Failed to retrieve the page. Status code:", response['status_code'])
In the provided code snippet, we're safeguarding the acquired HTML content by storing it in an HTML file. This action is crucial to confirm the successful acquisition of the targeted HTML data. We can then review the file to inspect the specific content contained within the crawled HTML.
output.html Preview:
As you can see above, no useful information is present in the crawled HTML. This is because Amazon loads its important content dynamically using JavaScript and Ajax.
Handling Dynamic Content
Much like numerous contemporary websites, Amazon's search pages employ dynamic content loading through JavaScript rendering and Ajax calls. This dynamic behavior can present challenges when attempting to scrape data from these pages. Nonetheless, thanks to the Crawlbase Crawling API, these challenges can be effectively addressed. We can leverage the following query parameters provided by the Crawling API to tackle this issue.
Incorporating Parameters
When using the JavaScript token in conjunction with the Crawlbase API, you have the capability to define specific parameters that ensure the accurate capture of dynamically rendered content. Several pivotal parameters include:
- page_wait: This parameter, although optional, empowers you to specify the duration in milliseconds to await before the browser captures the resultant HTML code. Deploy this parameter in scenarios where a page necessitates additional time for rendering or when AJAX requests must be fully loaded before HTML capture.
- ajax_wait: Another optional parameter tailored for the JavaScript token. It grants you the ability to indicate whether the script should await the completion of AJAX requests prior to receiving the HTML response. This proves invaluable when content relies on the execution of AJAX requests.
For using these parameters in our example, we can update our code like this:
from crawlbase import CrawlingAPI
# Initialize the Crawling API with your Crawlbase token
api = CrawlingAPI({ 'token': 'YOU_CRAWLBASE_JS_TOKEN' })
# URL of the Amazon search page you want to scrape
amazon_search_url = 'https://www.amazon.com/s?k=headphones'
# options for Crawling API
options = {
'page_wait': 2000,
'ajax_wait': 'true'
}
# Make a request to scrape the Amazon search page with options
response = api.get(amazon_search_url, options)
# Check if the request was successful
if response['status_code'] == 200:
# Extracted HTML content after decoding byte data
html_content = response['body'].decode('latin1')
# Save the HTML content to a file
with open('output.html', 'w', encoding='utf-8') as file:
file.write(html_content)
else:
print("Failed to retrieve the page. Status code:", response['status_code'])
Crawling API provides many other important parameters. You can read about them here.
Extracting Ad Data And Saving into SQLite Database
Now that we have successfully acquired the HTML content of Amazon's dynamic search pages, it's time to extract the valuable data for Amazon PPC ads from the retrieved content. For the example we will extract title and price of the ads.
After extracting this data, it's prudent to store it systematically. For this purpose, we'll employ SQLite, a lightweight and efficient relational database system that seamlessly integrates with Python. SQLite is an excellent choice for local storage of structured data, and in this context, it's a perfect fit for preserving the scraped Amazon PPC ad data.
import sqlite3
from bs4 import BeautifulSoup
from crawlbase import CrawlingAPI
# Function to initialize the SQLite database
def initialize_db(db_name):
conn = sqlite3.connect(db_name)
cursor = conn.cursor()
# Create a table if it doesn't exist
cursor.execute('''
CREATE TABLE IF NOT EXISTS ppc_ads (
id INTEGER PRIMARY KEY AUTOINCREMENT,
price TEXT,
title TEXT
)
''')
# Commit the table creation
conn.commit()
return conn, cursor
# Function to insert data into the database
def insert_data(conn, cursor, price_text, title_text):
# Insert the data into the database
cursor.execute('INSERT INTO ppc_ads (price, title) VALUES (?, ?)', (price_text, title_text))
conn.commit()
# Initialize the database
db_name = 'ppc_ads.db'
conn, cursor = initialize_db(db_name)
# Initialize the Crawling API with your Crawlbase token
api = CrawlingAPI({ 'token': 'YOU_CRAWLBASE_JS_TOKEN' })
# URL of the Amazon search page you want to scrape
amazon_search_url = 'https://www.amazon.com/s?k=headphones'
# options for Crawling API
options = {
'page_wait': 2000,
'ajax_wait': 'true'
}
# Make a request to scrape the Amazon search page with options
response = api.get(amazon_search_url, options)
# Check if the request was successful
if response['status_code'] == 200:
# Extracted HTML content after decoding byte data
html_content = response['body'].decode('latin1')
# Parse the HTML content using Beautiful Soup
soup = BeautifulSoup(html_content, 'html.parser')
# Select PPC ads div elements
ads = soup.select('.AdHolder div[data-asin], div[data-asin][data-component-type="s-search-result"].AdHolder')
# Extract information from each ad and insert it into the database
for ad in ads:
# Extract the price inside the ad div
price = ad.select_one('span.a-price span.a-offscreen')
if price:
price_text = price.text.strip()
else:
price_text = "Price not found"
# Extract the title inside the ad div
title = ad.select_one('div.a-section h2 a.a-link-normal span, div.a-section a.a-link-normal span.a-offscreen')
if title:
title_text = title.text.strip()
else:
title_text = "Title not found"
# Insert the data into the database
insert_data(conn, cursor, price_text, title_text)
else:
print("Failed to retrieve the page. Status code:", response['status_code'])
# Close the database connection
conn.close()
Example Output:
This Python script demonstrates the process of scraping Amazon's search page for PPC ads. It begins by initializing an SQLite database, creating a table to store the scraped data, including the ad ID, price, and title. The insert_data
function is defined to insert the extracted data into this database. The script then sets up the Crawlbase API for web crawling, specifying options for page and AJAX waiting times to handle dynamically loaded content effectively.
After successfully retrieving the Amazon search page using the Crawlbase API, the script utilizes BeautifulSoup for parsing the HTML content. It specifically targets PPC ad elements on the page. For each ad element, the script extracts the price and title information. It verifies the existence of these details and cleans them before inserting them into the SQLite database using the insert_data
function. The script concludes by properly closing the database connection. In essence, this script showcases the complete process of web scraping, data extraction, and cloud storage, essential for various data analysis and usage scenarios.
6. Final Words
So this was scraping amazon sponsored ads, if you’re interested in more guides like these check out the links below:
📜 How to Scrape Amazon Reviews
📜 How to Scrape Amazon Search Pages
📜 How to Scrape Amazon Product Data
For additional help and support, check out the guides on scraping amazon ASIN, Amazon reviews in Node, Amazon Images, and Amazon data in Ruby.
We have written some guides on other e-commerce sites like scraping product data from Walmart, eBay, and AliExpress. just in case you’re scraping them ;).
Feel free to reach out to us here for questions and queries.
7. Frequently Asked Questions
Q. What is Amazon PPC advertising?
Amazon PPC advertising allows sellers and advertisers to promote their products on the Amazon platform. These ads are displayed within Amazon's search results and product detail pages, helping products gain enhanced visibility. Advertisers pay a fee only when a user clicks on their ad. It's a cost-effective way to reach potential customers who are actively searching for products.
Q. Why is scraping Amazon PPC ad data important?
Scraping Amazon data helps leverage data-driven insights to enhance the performance of PPC campaigns, boost visibility, and maximize ROI. Firstly, it enables businesses to gain insights into their competitors' advertising strategies, such as keywords, ad copy, and bidding techniques. Secondly, it allows advertisers to optimize their own ad campaigns by analyzing performance metrics. Additionally, scraping can uncover valuable keywords for improving organic listings. Moreover, it keeps businesses informed about changes in Amazon's ad system and provides broader market insights, helping them stay ahead in the dynamic e-commerce landscape.
Q. What is the Crawlbase Crawling API?
The Crawlbase Crawling API is a sophisticated web scraping tool that simplifies the process of extracting data from websites at scale. It offers developers and businesses an automated and user-friendly means of gathering information from web pages. One of its noteworthy features is automatic IP rotation, which enhances data extraction by dynamically changing the IP address for each request, reducing the risk of IP blocking or restrictions. Users can send requests to the API, specifying the URLs to scrape, along with query parameters, and in return, they receive the scraped data in structured formats like HTML or JSON. This versatile tool is invaluable for those seeking to collect data from websites efficiently and without interruption.
Q. How can I get started with web scraping using Crawlbase and Python?
To get started with web scraping using Crawlbase and Python, follow these steps:
- Ensure you have Python installed on your system.
- Choose a code editor or integrated development environment (IDE) for writing your Python code.
- Install necessary libraries, such as BeautifulSoap4 and the Crawlbase library, using pip.
- Create a Crawlbase account to obtain an API token.
- Set up the Crawlbase Python library and initialize the Crawling API with your token.
- Make requests to the Crawlbase Crawling API to scrape data from websites, specifying the URLs and any query parameters.
- Save the scraped data and analyze it as needed for your specific use case.