Web scraping with cURL (fetching RAW HTML data)
Do you know you can scrape a website from your command line? With curl
, you have a simple tool at your fingertips, ready to collect data from the web with minimal fuss. Let's explore how powerful curl is for web scraping!
Warning: In web scraping, cURL can be used only for retrieving the raw HTML data, but not parsing or extracting specific data.
What is cURL?
cURL, is a command line tool and library for transferring data with URLs. Think of Postman, but without the GUI (Graphic User Interface). We'll play only with the Command line / Terminal instead of a clickable interface.
Keep in mind that web scraping usually involves two things:
- Get the raw HTML data
- Parsing or extracting a specific section
cURL can only do the former. So, we need to combine it with other tools to have a fully functioning web scraping tool.
Web scraping is just one of the many uses of curl; it's not its only purpose. With curl
, you can do things like to download files, automate data collection, test APIs, monitor server response times, and simulate user interactions with web services all from the command line.
Basic command for cURL
The basic syntax for writing cURL command is:
curl [options...] <url>
Make sure to Install cURL
To try the commands below, make sure to install cURL first. Some operating systems ship curl by default. You can verify if you already have cURL by typing curl --version
.
If you see an error, you can refer to this page, which shows how to install cURL based on your operating system:
- Install cURL on Linux
- Install cURL on MacOS
- Install cURL on Windows
12 Basic cURL commands
To warm up, here are basic cURL commands / syntax with different options and their explanations:
Command | Explanation |
---|---|
curl http://example.com |
Fetches the content of a webpage. (No options example) |
curl -o filename.html http://example.com |
Downloads the content of a webpage to a specified file. |
curl -I http://example.com |
Retrieves only the HTTP headers from a URL. |
curl -L http://example.com |
Follows HTTP redirects, which is useful for capturing the final destination of a URL with multiple redirects. |
curl -u username:password http://example.com |
Performs a request with HTTP authentication. |
curl -x http://proxyserver:port http://example.com |
Uses a proxy server for the request. |
curl -d "param1=value1¶m2=value2" -X POST http://example.com |
Sends a POST request with data to the server. |
curl -H "X-Custom-Header: value" http://example.com |
Adds a custom header to the request. |
curl -s http://example.com |
Makes curl run in silent mode, suppressing the progress meter and error messages. |
curl -X PUT -T file.txt http://example.com |
Uploads a file to the server using PUT method. |
curl -A "User-Agent-String" http://example.com |
Simulates a user agent by sending a custom User-Agent header. |
curl --json '{"tool": "curl"}' https://example.com/ |
Send a json data on cURL |
Use curl --help
to see more commands.
These commands demonstrate some of the basic functionalities of curl
for interacting with web servers, APIs, and handling different HTTP methods and data types.
How to use cURL for web scraping (retrieving raw HTML data)
Here are command examples of using cURL specifically for web scraping tasks.
Fetch the HTML Content of a Web Page
To get the HTML content of a web page, use:curl http://example.com
Save the HTML Content to a File
Instead of just displaying the content, you can save it to a file:curl -o filename.html
http://example.com
After saving the HTML file, you can continue to work with this HTML file with any programming language since the raw content you want to scrape is already in this file. Now, you can load this file without sending another HTTP request.
Scrape Specific Data from the HTMLcurl
itself doesn't parse HTML. You’ll need to use it in combination with other command-line tools like grep
, awk
, or sed
to extract specific data. For instance:curl
http://example.com
| grep -o '<h1.*</h1>'
We're using grep
command line tool to search and return certain areas using RegEx. In this example, we're fetching content between h1 tags.
Send POST Requests
If the data you need is behind a form, you might need to send a POST request:curl -d "param1=value1¶m2=value2" -X POST
http://example.com/form
This method can also be used to register or log in to an account when the form is using the POST request method.
Handle pagination or multiple pages.
Your data will likely spread into multiple pages. You must loop over page numbers
and replace them in the URL (Assume this is a server-rendered website).
For example, we want to request three different pages with this structure:
https://books.toscrape.com/catalogue/page-1.html
https://books.toscrape.com/catalogue/page-2.html
https://books.toscrape.com/catalogue/page-3.html
*Notice the number is changing for each page
This is what the bash code looks like:
#!/bin/bash
# Base URL for the book catalogue
base_url="https://books.toscrape.com/catalogue/page-"
# Loop through the first three pages
for i in {1..3}; do
# Construct the full URL
url="${base_url}${i}.html"
# Use curl to fetch the content and save it to a file
curl -o "page-${i}.html" "$url"
# Wait for a second to be polite and not overload the server
sleep 1
done
We're using sleep
here to respect the website we're scraping. We don't want to send too many requests at once.
Avoid getting blocked upon web scraping using cURL
Some websites like Google will probably block your request when you try to access it with cURL. There are many tricks how we can avoid getting blocked during web scraping. One of them is changing the user agent.
Change user-agent with cURL
You can set the user-agent using cURL with the -A command.
curl -A "Mozilla/5.0" https://www.google.com
Take a look at the valid user agent list here.
Rotating a proxy with cURL
You can simulate a request coming from a certain proxy server or IP address with the -x command.
curl -x http://proxyserver:port http://example.com
You might need to subscribe to a proxy provider to get several proxies you can rotate.
Using custom headers
Some websites also may require some information to be sent on the headers alongside the request. Things like cookie
or referrer
should be attached in the header in this situation.
curl -H "Cookie: key1=value1" -H "Referer: https://example.com" https://example.com
The alternative command is using -b for the cookie
curl -b "cookie_name=cookie_value" https://example.com
and -e for the referer header.
curl -e 'http://example.com' 'http://targetwebsite.com'
or --referer is also valid. In the above sample, example.com is the referer URL and targetwebsite.com is the target URL.
Why not use cURL for web scraping?
After seeing the power of cURL, we still need to consider why we are not using it.
- No JavaScript Execution:
curl
cannot execute JavaScript. If a website relies on JavaScript to load content dynamically,curl
will not be able to access that content.
Solution: You need to use a headless browser like Puppeteer, Selenium or Playwright. Learn how to scrape dynamic website using Puppeteer and NodeJS. - Not a Parsing Tool:
curl
is not designed to parse HTML or extract specific data from a response; it simply retrieves the raw data. You need to use it in conjunction with other tools or languages that can parse HTML.
Solution: you can use AI by OpenAI to parse the HTML data after receiving the raw HTML from cURL. - Limited Debugging Features: Unlike tools with graphical interfaces,
curl
has limited debugging capabilities. Understanding errors may require a good grasp of HTTP and the command line.
Solution: Use a library from any programming language, for examplerequests
for Python or Cheerio for Javascript. - No Interaction with Web Pages:
curl
cannot interact with web pages, fill out forms, or simulate clicks, which limits its scraping capabilities for more dynamic sites.
Solution: Using a headless browser, like the solution to the first problem.
cURL vs Postman
cURL is still a powerful tool if we want to debug, test, or quickly retrieve information from a URL. You might wonder why you should use cURL instead of a tool like Postman.
Using curl
instead of a graphical user interface (GUI) tool like Postman has several benefits, especially for those comfortable with the command line:
- Simplicity: cURL API is very simple and easy to use.
- Speed and Efficiency:
curl
is a command-line utility, which means it can be much faster to execute because you can run it with a simple line of code instead of navigating through a GUI. - Scriptability:
curl
can easily be scripted and integrated into larger shell scripts or automation workflows. This allows for repetitive tasks to be automated, saving time and reducing the potential for human error. - Resource-Friendly:
curl
typically uses fewer system resources than a GUI application, which can be an important consideration when working on a system with limited resources. - Versatility:
curl
supports a wide range of protocols beyond HTTP and HTTPS, including FTP, FTPS, SCP, SFTP, and more, which can be very useful in various scenarios. - Availability:
curl
is often pre-installed on many Unix-like operating systems (also available at Windows starting Windows 10), making it readily available for use without the need for additional downloads.
Can cURL be used in a programming language?
Yes, curl
can be used within various programming languages, typically through libraries or bindings that provide an interface to curl
functionality (using ProcessBuilder). This allows programmers to make HTTP requests, interact with APIs, and perform web scraping within their code. For examples:
- CURL in Python: Python provides several libraries to use
curl
, withpycurl
being the most direct wrapper around thelibcurl
library. It offers Python bindings forlibcurl
and gives access to almost allcurl
capabilities. Additionally, Python'srequests
library, while not a direct binding tocurl
, provides similar functionality in a more Pythonic way.
Read: How to use curl in Python and it's alternative for more. - CURL in PHP: PHP has a built-in module for
curl
, typically referred to ascURL
orPHP-CURL
. This module allows PHP scripts to make requests to servers, download files, and process HTTP responses.
Read: How to use cURL in PHP - CURL in Node.js: In Node.js, while there are native HTTP modules, you can also use
node-libcurl
, which is a binding tolibcurl
. This allows the use ofcurl
functionalities in a Node.js environment.
Read: How to use cURL in Javascript (Nodejs) for more. - CURL in Ruby: Ruby also has a
curl
-like library known asCurb
which provides bindings tolibcurl
. This allows Ruby scripts to utilizecurl
's capabilities for making HTTP requests. - CURL in Java: While Java doesn't have a direct
curl
library, tools likeApache HttpClient
andOkHttp
offer similar functionalities.
FAQ
Can cURL handle Javascript-rendered websites?
Unfortunately, not. cURL is designed to transfer data with a URL, it doesn't execute Javascript. To scrape content from a JavaScript-heavy website, you would typically use tools that can render JavaScript like headless browsers (e.g., Puppeteer, Selenium, Playwright).
Where can I learn more about cURL?
cURL has a great resource at https://everything.curl.dev/
Can cURL be used for web scraping?
Yes and No. cURL can do 50% of the job from web scraping, which is retrieving the raw HTML data. The other 50%, which is parsing the data, can't be achieved by cURL alone.