How to Use cURL For Web Scraping

Scrapfly - Mar 12 - - Dev Community

How to Use cURL For Web Scraping

cURL is one of the oldest tools used for sending HTTP requests. Yet, it's still a great asset for the web scraping toolbox.

In this article, we'll go over a step-by-step guide on sending and configuring HTTP requests with cURL. We'll also explore advanced usages of cURL for web scraping, such as scraping dynamic pages and avoiding getting blocked. Let's get started!

What is cURL and Why Use It?

cURL, standing for "client for URL", is an open-source command-line tool used for transferring data with URLs. It's built on the top of the libcurl C library. It supports the different types of HTTP methods (GET, POST, PUT, etc.) with various HTTP protocols, including HTTP and HTTPS.

cURL isn't only super-fast and straightforward, but it provides a comprehensive request configuration, including:

  • Adding custom headers and cookies.
  • Enabling or disabling request redirects.
  • Downloading binary files.

This makes using cURL for web scraping a viable tool for debugging and developing scraping scripts or even extracting small data portions.

How To Install cURL?

Before we start web scraping with cURL, we must install it. cURL comes pre-installed in almost all operating systems. However, run the below commands to upgrade or install it if it isn't found.

Linux

$ apt-get install curl
Enter fullscreen mode Exit fullscreen mode

Mac

$ brew install curl
Enter fullscreen mode Exit fullscreen mode

Windows

$ choco install curl
Enter fullscreen mode Exit fullscreen mode

To verify your installation, simply run the following command. You should receive the cURL version details:

$ curl --version
# curl 8.4.0 (Windows) libcurl/8.4.0 Schannel WinIDN
# Release-Date: 2023-10-11
Enter fullscreen mode Exit fullscreen mode

How To Use cURL?

In this section, we'll explore the basics of cURL and how to navigate it to send different request types. Let's start with the most basic cURL usage: sending GET requests.

Sending GET Requests

cURL follows the below syntax for all the request types:

curl [OPTIONS] URL
Enter fullscreen mode Exit fullscreen mode
  • OPTIONS

    Represents the request option, which are configurations that can be passed to the request to specify headers, cookies, proxies, request type and so on. To list the commonly used options , use the curl -h command. To view all the available ones , use the curl -h all command.

  • URL

    The actual URL to request.

To send a GET request with cURL, all we have to do is specify the URL to request, as it uses the GET method by default:

curl https://httpbin.dev/get
Enter fullscreen mode Exit fullscreen mode

The above command will request the httpbin.dev/get endpoint and return the request details:

{
  "args": {},
  "headers": {
    "Accept": [
      "*/*"
    ],
    "Accept-Encoding": [
      "gzip"
    ],
    "Host": [
      "httpbin.dev"
    ],
    "User-Agent": [
      "curl/8.4.0"
    ]
  },
  "url": "https://httpbin.dev/get"
}
Enter fullscreen mode Exit fullscreen mode

We can see that the request has been sent successfully with the default cURL header configurations. Let's have a look at modifying them.

Adding Headers

To add headers with cURL, we can use the -H option for each header. For example, here is how we can send a cURL with User-Agent and Accept headers:

curl -H "User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/113.0" -H "Accept: application/json" https://httpbin.dev/headers
Enter fullscreen mode Exit fullscreen mode

In the above cURL request, we override the cURL User-Agent and Accept headers with custom ones. The response will include the newly configured headers:

{
  "headers": {
    "Accept": [
      "application/json"
    ],
    "Accept-Encoding": [
      "gzip"
    ],
    "Host": [
      "httpbin.dev"
    ],
    "User-Agent": [
      "Mozilla/5.0 (X11; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/113.0"
    ]
  }
}
Enter fullscreen mode Exit fullscreen mode

Alternatively, we can change the cURL User-Agent header through the -A option:

curl -A "Mozilla/5.0 (X11; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/113.0" https://httpbin.dev/headers
Enter fullscreen mode Exit fullscreen mode

Adding Cookies

Next, let's set cookies with cURL. For this, we can use the cURL -b option:

curl -b "cookie1=value1; cookie2=value2" https://httpbin.dev/cookies
Enter fullscreen mode Exit fullscreen mode

The above command will set two cookie values with the cURL request sent:

{
  "cookie1": "value1",
  "cookie2": "value2"
}
Enter fullscreen mode Exit fullscreen mode

Alternatively, we can treat the cookies as regular cURL headers and pass them through the cookie header:

curl -H "cookie: cookie1=value1; cookie2=value2" https://httpbin.dev/cookies
Enter fullscreen mode Exit fullscreen mode

Sending Post Requests

In the previous sections, we have sent GET requests with cURL. In this one, we'll explain sending POST requests. To send POST requests with cURL, we can utilize the -X option, which determines the request HTTP method:

curl -X POST https://httpbin.dev/post
Enter fullscreen mode Exit fullscreen mode

Thhe above cURL command will send a POST request and return the request details:

{
  "args": {},
  "headers": {
    "Accept": [
      "*/*"
    ],
    "Accept-Encoding": [
      "gzip"
    ],
    "Content-Length": [
      "0"
    ],
    "Host": [
      "httpbin.dev"
    ],
    "User-Agent": [
      "curl/8.4.0"
    ]
  },
  "url": "https://httpbin.dev/post",
  "data": "",
  "files": null,
  "form": null,
  "json": null
}
Enter fullscreen mode Exit fullscreen mode

In most cases, POST requests require a body. So, let's take a look at adding a request body with cURL requests.

Adding Request Body

To add a request body with cURL, we can use the -d cURL option and pass the body as an object:

curl -X POST -d '{"key1": "value1", "key2": "value2"}' https://httpbin.dev/post
Enter fullscreen mode Exit fullscreen mode

If we observe the response, we'll find the body passed to the request present:

{
  ....
  ""data": "{\"key1\": \"value1\", \"key2\": \"value2\"}",
}
Enter fullscreen mode Exit fullscreen mode

Note that on Windows, you need to escape the body with backslashes:

curl -X POST -d "{\"key1\": \"value1\", \"key2\": \"value2\"}" https://httpbin.dev/post
Enter fullscreen mode Exit fullscreen mode

Web Scraping With cURL

The standard web scraping process requires HTML parsing, crawling, processing and saving the extracted. Therefore, cURL itself isn't suitable for these extensive scraping tasks. However, it can be a great asset for debugging and development purposes. Accordingly, we'll explore using cURL for common web scraping tips and tricks.

Scraping Dynamic pages With cURL

Data on dynamic websites are usually loaded through background XHR calls. These API calls can be captured on the browser developer tools and exported as cURL requests for web scraping.

For example, the review data on web-scraping.dev is loaded through background API requests:

How to Use cURL For Web Scraping
Reviews on web-scraping.dev

First, let's capture the API calls on the above web page using the following steps:

  • Open the browser developer tools by pressing the F12 key.
  • Select the network tab and filter by Fetch/XHR calls.
  • Scroll down the page to load more review data.

After following the above steps, you will find the outgoing API calls recorded on the browser:

How to Use cURL For Web Scraping
Background API calls on web-scraping.dev

Next, copy the cURL representation of the request. Right-click on the request, hover on the copy menu and select copy as cURL (bash) if you are on Mac or Linux and (cmd) for Windows.

How to Use cURL For Web Scraping
Copy the request as cURL

The copied cURL command should look this:

curl 'https://web-scraping.dev/api/testimonials?page=2' \
  -H 'authority: web-scraping.dev' \
  -H 'accept: */*' \
  -H 'accept-language: en-US,en;q=0.9' \
  -H 'cookie: cookiesAccepted=true' \
  -H 'hx-current-url: https://web-scraping.dev/testimonials' \
  -H 'hx-request: true' \
  -H 'referer: https://web-scraping.dev/testimonials' \
  -H 'sec-ch-ua: "Chromium";v="122", "Not(A:Brand";v="24", "Microsoft Edge";v="122"' \
  -H 'sec-ch-ua-mobile: ?0' \
  -H 'sec-ch-ua-platform: "Windows"' \
  -H 'sec-fetch-dest: empty' \
  -H 'sec-fetch-mode: cors' \
  -H 'sec-fetch-site: same-origin' \
  -H 'user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36 Edg/122.0.0.0' \
  -H 'x-secret-token: secret123'
Enter fullscreen mode Exit fullscreen mode

We can see the headers, cookies and parameters used with the cURL request. Executing it will return the HTML data found on the browser:

<div class="testimonial">
    <identicon-svg username="testimonial-11"></identicon-svg>
    <div>
        <span class="rating"></span>
        <p class="text">The features are great but it took me a while to understand how to use them.</p>
    </div>
</div>

<div class="testimonial">
    <identicon-svg username="testimonial-12"></identicon-svg>
    <div>
        <span class="rating"></span>
        <p class="text">Love the simplicity and effectiveness of this app.</p>
    </div>
</div>
Enter fullscreen mode Exit fullscreen mode

Now that we can execute a successful cURL request, we can import it to an HTTP client such as Postman. This allows us to convert the cURL command into a programming language script like Python requests to continue the scraping process from there.

Moreover, this approach allows our web scraping requests to be identical to those of normal users , reducing our chances of getting blocked!

Avoid cURL Scraping Blocking

cURL can be a viable tool for requesting and transferring data across web pages. However, websites use protection shields, such as Cloudflare, to prevent automated requests like those of cURL from accessing the website.

For example, let's attempt to request G2 with cURL. It's a popular website with Cloudflare protection:

curl https://www.g2.com/
Enter fullscreen mode Exit fullscreen mode

The website greeted us with a Cloudflare challenge to solve:

<!DOCTYPE html><html lang="en-US"><head><title>Just a moment...</title>
Enter fullscreen mode Exit fullscreen mode

To prevent cURL web scraping blocking, we can use Curl Impersonate. A modified version of cURL that simulates the TLS fingerprint of normal web browsers. It also overrides the default cURL headers , such as the User-Agent, with regular header values. This makes the cURL Impersonate requests look like those sent from the browsers, preventing the firewalls from detecting the usage of HTTP clients.

If we request G2 again with Curl Impersonate, we'll get the actual page HTML:

<h1 class="hero-unit__title" id="main">Where you go for software.</h1>
Enter fullscreen mode Exit fullscreen mode

For more details on Curl Impersonate, including the installation and usage. Refer to our dedicated guide.

Adding proxies to cURL

In the previous section, we explored preventing the detection of the usage of cURL for web scraping by modifying the requests' configurations. However, websites use another trick to block requests: IP address.

Using proxies with cURL allows for distributing the traffic load across multiple IP addresses. This makes it harder for websites and firewalls to detect the origin of the IP address, leading to better chances of avoiding blocking.

To add proxies for cURL, we can use the -x or --proxy option followed by the proxy URL:

curl -x <protocol>://<proxy_host>:<proxy_port> <url>
Enter fullscreen mode Exit fullscreen mode

The above syntax is the unified syntax used to add proxies to cURL requests. In practice, this syntax can be used like this for different proxy types:

# HTTP
curl -x http://proxy_domain.com:8080 https://httpbin.dev/ip
# HTTPS
curl -x https://proxy_domain.com:8080 https://httpbin.dev/ip
# SOCKS5
curl -x socks5://proxy_domain.com:8080 https://httpbin.dev/ip
# Proxies with crednetials
curl -x https://username:password@proxy.proxy_domain.com:8080 https://httpbin.dev/ip
Enter fullscreen mode Exit fullscreen mode

For more details on using proxies for web scraping, refer to our dedicated guide.

Powering Up With ScrapFly

ScrapFly is a web scraping API that allows for scraping at scale by providing:

How to Use cURL For Web Scraping
ScrapFly service does the heavy lifting for you!

ScrapFly provides an API player that allows for converting cURL commands into ScrapFly-powered web scraping requests:

How to Use cURL For Web Scraping
Import cURL command into ScrapFly's API player

ScrapFly also provides a cURL to Python tool that allows for converting cURL command into different Python HTTP clients, such as requests, aiohttp, httpx, and curl_cfii:

How to Use cURL For Web Scraping

Here is an example output of importing a cURL request from the browser into the ScrapFly API player to automatically add the request configuration. We'll also enable the asp parameter to bypass scraping blocking, select a proxy country and use the render_js feature to enable JavaScript:

from scrapfly import ScrapflyClient, ScrapeConfig, ScrapeApiResponse

scrapfly = ScrapflyClient(key="Your ScrapFly API key")

response: ScrapeApiResponse = scrapfly.scrape(ScrapeConfig(
    url="https://web-scraping.dev/api/testimonials?page=2",
    # enable anti scraping protection
    asp=True,
    # selector a proxy country
    country="us", 
    # enable JavaScript rendering, similat to headless browsers
    render_js=True,
    # headers assigned to the cURL request from the browser
    headers={ 
        "sec-ch-ua": "\"Chromium\";v=\"122\", \"Not(A:Brand\";v=\"24\", \"Google Chrome\";v=\"122\"",
        "x-secret-token": "secret123",
        "HX-Current-URL": "https://web-scraping.dev/testimonials",
        "sec-ch-ua-mobile": "?0",
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
        "Referer": "https://web-scraping.dev/testimonials",
        "HX-Request": "true",
        "sec-ch-ua-platform": "\"Windows\""
    },
))

# get the HTML from the response
html = response.scrape_result['content']

# use the built-in Parsel selector
selector = response.selector
Enter fullscreen mode Exit fullscreen mode

Try for FREE!

More on Scrapfly

FAQ

To wrap up this guide on web scraping with cURL, let's have a look at some frequently asked questions.

Can I use cURL for web scraping?

Yes, but not in the traditional sense. cURL is an HTTP client that doesn't provide additional utilities for parsing or data processing. Therefore, web scraping with cURL is best suited for debugging and development purposes or extracting a narrow amount of data.

Are there alternatives for cURL?

Yes, curlie is a command-line HTTP client that uses the same cURL features with the HTTPie interface. Another alternative to using cURL for web scraping is the Postman HTTP client. We have covered using Postman in a previous article.

Summary

In this guide, we explained how to web scrape with cURL. We started by exploring different cURL commands for various actions, including:

  • Sending GET requests.
  • Managing and manipulating HTTP headers and cookies.
  • Sending POST requests.

We have also explained common tips and tricks for web scraping with cURL, such as:

  • Scraping dynamic web pages by replicating background XHR calls.
  • Avoiding cURL scraping blocking using Curl Impersonate.
  • Preventing IP address blocking with cURL by adding proxies.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Terabox Video Player