How to build a web scraper: A beginner's guide

Have you ever wanted to collect data from websites automatically? Maybe you're interested in tracking product prices, gathering news articles, or compiling a list of businesses.

how to build a web scraper for beginner illustration

What is Web Scraping?

Web scraping is the process of automatically extracting information from websites. Instead of manually copying and pasting data, a web scraper does the heavy lifting by fetching the web pages and extracting the desired information.

Choose Your Tools

Before we dive in, it's essential to know that web scraping can be done using a variety of programming languages and tools—not just Python. Whether you're comfortable with JavaScript, Ruby, Go, or any other language, libraries and frameworks are available to help you scrape websites.

tools for web scrpaing

You can click on the sidebar to jump to the implementation of your desired programming language.

If you're not into coding or prefer a quicker setup, there are also low-code and no-code tools that make web scraping accessible to everyone. It might be less flexible than writing your program.

The Two Core Steps of Web Scraping

Web scraping generally involves two fundamental steps:

Requesting the HTML Content
Parsing the Relevant Data

Let's dive into what each of these steps entails.

1. Requesting the HTML Content

The first step in web scraping is to fetch the HTML content of the web page you want to scrape. This is similar to how your web browser works when you visit a website—it sends a request to the server, and the server responds with the HTML, CSS, JavaScript, and other resources needed to display the page.

In web scraping, you'll use a programming language or tool to send an HTTP request to the website's server. This can be done using various methods:

Programming Libraries: Languages like Python (with the requests library), JavaScript, Ruby, and others have libraries that make it easy to send HTTP requests.
Built-in Functions: Some languages have built-in functions for making web requests.

By requesting the HTML content, you're retrieving the raw data that makes up the web page, which you can then process and analyze.

2. Parsing the Relevant Data

Once you've obtained the HTML content, the next step is to parse it to extract the information you're interested in. Web pages are structured documents that use HTML tags to define elements like headings, paragraphs, links, images, and more.

To navigate through the HTML document, we can use CSS Selector or XPath.

Parsing involves:

Understanding the Structure: To find your desired data, you'll need to inspect the HTML. This often involves looking at the tags, classes, or IDs encapsulating the data.

You can right-click and click inspect in browsers to see the HTML structure. Alternatively, you can choose "view page source" to open the raw HTML on a new tab.

Inspect raw HTML in browser

Using Parsing Tools or Libraries:
- For Programmers: Libraries like Beautiful Soup (Python), Cheerio (JavaScript), Nokogiri/Nokolexbor (Ruby), and others can parse HTML and XML documents, allowing you to navigate and search the document tree.
Extracting the Data: Once you've identified the correct elements, you can extract the text, attributes, or other relevant information.

You can even use AI to parse raw HTML into nice structured data.

By parsing the relevant data, you transform the raw HTML into meaningful information that you can use for your specific needs, like building a dataset, analyzing trends, or feeding into another application.

Practice

It's time to practice. Let's use the Hackernews: https://news.ycombinator.com/ website as the target website that we want to scrape.

Target website to scrape

Inspect where this data is located

Let's assume we want to scrape the:
- title
- link
- visible link

Inspect Hackernews website

It looks like they are inside the athing class, and below titleline and sitestr classes.

The structure layout may change in the future. Please adjust your code accordingly.

Below, we'll look at how to scrape this website using different languages.

Web scraping using Python

Python has become the go-to language for web scraping. Python has a rich ecosystem of libraries and frameworks that make web scraping straightforward:

Requests: Simplifies making HTTP requests to fetch web pages.
Beautiful Soup: Makes it easy to parse and navigate HTML and XML documents.
Selenium: Allows you to automate web browsers to scrape dynamic, JavaScript-rendered content.
Scrapy: A powerful framework for large-scale web scraping with built-in handling for crawling, data extraction, and pipeline management.

These tools handle much of the heavy lifting, so you can focus on extracting the data you need. Let's practice with the first 2 tools.

We'll be using the Hackernews website for practice

1. Install Python and Necessary Libraries

First, you'll need to install Python on your computer. You can download it from the official website.

Next, install two essential libraries:

Requests: To send HTTP requests and get the web page's HTML.
Beautiful Soup: To parse and navigate the HTML content.

Open your command prompt or terminal and run:

pip install requests beautifulsoup4

2. Request the HTML Content

Create a new Python file called main.py and start by importing the requests library:

import requests

URL = 'https://news.ycombinator.com/'
response = requests.get(URL)
html_content = response.text
print(html_content)

This code sends a GET request to the URL and stores the HTML content in a variable.

3. Parse the Relevant Data

Now, let's parse the HTML to extract the data.

a. Import Beautiful Soup

Add the following import statement:

from bs4 import BeautifulSoup

b. Create a Soup Object

soup = BeautifulSoup(html_content, 'html.parser')

This creates a Beautiful Soup object that represents the parsed HTML document.

c. Extract title, link, and visible links

rows = soup.find_all('tr', class_='athing')

for row in rows:
    titleline = row.find('span', class_='titleline')
    link_tag = titleline.find('a', href=True)

    title = link_tag.text
    link = link_tag['href']

    visible_link_tag = row.find('span', class_='sitestr')
    # We add this condition to avoid an error when the visible_link_tag is None
    visible_link = visible_link_tag.text if visible_link_tag else "No visible link"

    print("Title:", title)
    print("Link:", link)
    print("Visible Link:", visible_link)
    print('---')

This loop goes through each thread.

4. Scrape Multiple Pages

The website has multiple pages. If we want to grab all the data, we must first find the pattern. In this case, this is what the patterns look like:

Page 1: https://news.ycombinator.com/?p=1
Page 2: https://news.ycombinator.com/?p=2
Page n: https://news.ycombinator.com/?p=n

Let's modify the script to scrape the first three pages.

import requests
from bs4 import BeautifulSoup

URL = 'https://news.ycombinator.com/'
max_pages = 3 # I gave the limit to only 3 pages

for i in range(1, max_pages+1):
    response = requests.get(f'{URL}?p={i}')
    html_content = response.text
    soup = BeautifulSoup(html_content, 'html.parser')

    rows = soup.find_all('tr', class_='athing')

    for row in rows:
        titleline = row.find('span', class_='titleline')
        link_tag = titleline.find('a', href=True)

        title = link_tag.text
        link = link_tag['href']

        visible_link_tag = row.find('span', class_='sitestr')
        visible_link = visible_link_tag.text if visible_link_tag else "No visible link"

        print("Title:", title)
        print("Link:", link)
        print("Visible Link:", visible_link)
        print('---')

This script will continue to the next page until the 3rd page.

5. Save Data to a CSV File

You might want to save the data for later use. Let's save this example as a CSV file.

import requests
from bs4 import BeautifulSoup
import csv

URL = 'https://news.ycombinator.com/'
max_pages = 3

# Open a CSV file in write mode
with open('data.csv', mode='w', newline='') as file:
    writer = csv.writer(file)
    
    # Write the header row
    writer.writerow(['Title', 'Link', 'Visible Link'])
    
    for i in range(1, max_pages+1):
        response = requests.get(f'{URL}?p={i}')
        html_content = response.text
        soup = BeautifulSoup(html_content, 'html.parser')

        rows = soup.find_all('tr', class_='athing')

        for row in rows:
            titleline = row.find('span', class_='titleline')
            link_tag = titleline.find('a', href=True)

            title = link_tag.text
            link = link_tag['href']

            visible_link_tag = row.find('span', class_='sitestr')
            visible_link = visible_link_tag.text if visible_link_tag else "No visible link"

            # Write the data row to the CSV file
            writer.writerow([title, link, visible_link])

Now, running the script will create a data.csv file with all relevant data.

Learn About CSS Selectors: Libraries like Beautiful Soup allow you to use CSS selectors to find elements.
Scrape JavaScript-Rendered Content: Use tools like Selenium or Playwright for sites that load content dynamically.
Build a Real Project: Think of a practical application for your scraper to solidify your learning.
If you want to learn more, please read this Python for web scraping blog.

Useful Python Resources

Web scraping using Javascript

JavaScript isn't just for front-end development—it's also a powerful tool for web scraping. If you're already familiar with JavaScript or prefer using it over Python, you're in luck! JavaScript has a rich ecosystem of libraries and frameworks that make web scraping straightforward:

Axios: Simplifies making HTTP requests to fetch web pages.
Cheerio: This enables you to parse and navigate HTML and XML documents efficiently.
Puppeteer: Allows you to control a headless Chrome browser to scrape dynamic, JavaScript-rendered content.
Node.js: Provides a JavaScript runtime environment that lets you run JavaScript code outside of a browser.

These tools handle much of the heavy lifting, so you can focus on extracting the needed data. Let's practice with the first two tools.

We'll be using the Hackernews website for practice

1. Install Node.js and Necessary Libraries

First, you'll need Node.js installed on your computer. You can download it from the official website.

Next, install two essential libraries:

Axios: To send HTTP requests and get the web page's HTML.
Cheerio: To parse and navigate the HTML content.

Open your command prompt or terminal and run:

npm install axios cheerio

2. Request the HTML Content

Create a new JavaScript file called main.js and start by importing the axios library:

const axios = require('axios');

const URL = 'https://news.ycombinator.com/';

async function run() {
    const response = await axios.get(`${URL}`);
    console.log(response.data)
}

run()

This code sends a GET request to the URL and logs the HTML content of the page.

3. Parse the Relevant Data

Now, let's parse the HTML to extract the data.

a. Import Cheerio

Add the following import statement:

const cheerio = require('cheerio');

b. Load and parse the data

const URL = 'https://news.ycombinator.com/';

async function run() {
    const response = await axios.get(`${URL}`);
    const htmlContent = response.data;
    const $ = cheerio.load(htmlContent);

    // Find all rows with the class 'athing'
    $('tr.athing').each((_, element) => {
      const titleElement = $(element).find('span.titleline a');
      const title = titleElement.text();
      const link = titleElement.attr('href');

      const visibleLinkElement = $(element).find('span.sitestr');
      const visibleLink = visibleLinkElement.length ? visibleLinkElement.text() : 'No visible link';

    console.log({ title, link, visibleLink });
    }); 
}

run()

This loop goes through each thread.

4. Scrape Multiple Pages

The website has multiple pages. If we want to grab all the data, we must first find the pattern. In this case, this is what the patterns look like:

Page 1: https://news.ycombinator.com/?p=1
Page 2: https://news.ycombinator.com/?p=2
Page n: https://news.ycombinator.com/?p=n

Let's modify the script to scrape the first three pages.

const URL = 'https://news.ycombinator.com/';
const maxPages = 3;

async function run() {
    for (let i = 1; i <= maxPages; i++) {
        const response = await axios.get(`${URL}?p=${i}`);
        const htmlContent = response.data;
        const $ = cheerio.load(htmlContent);

        // Find all rows with the class 'athing'
        $('tr.athing').each((_, element) => {
        const titleElement = $(element).find('span.titleline a');
        const title = titleElement.text();
        const link = titleElement.attr('href');

        const visibleLinkElement = $(element).find('span.sitestr');
        const visibleLink = visibleLinkElement.length ? visibleLinkElement.text() : 'No visible link';

        console.log({ title, link, visibleLink });
        }); 
    }
}

run()

5. Save Data to a CSV File

You might want to save the data for later use. Let's save this example in a CSV file.

const axios = require('axios');
const cheerio = require('cheerio');
const fs = require('fs');

const URL = 'https://news.ycombinator.com/';
const maxPages = 3;

const csvFile = 'data.csv';
const header = 'Title,Link,Visible Link\n';
fs.writeFileSync(csvFile, header);

async function run() {
    for (let i = 1; i <= maxPages; i++) {
        const response = await axios.get(`${URL}?p=${i}`);
        const htmlContent = response.data;
        const $ = cheerio.load(htmlContent);
  
        // Find all rows with the class 'athing'
        $('tr.athing').each((_, element) => {
          const titleElement = $(element).find('span.titleline a');
          const title = titleElement.text();
          const link = titleElement.attr('href');
  
          const visibleLinkElement = $(element).find('span.sitestr');
          const visibleLink = visibleLinkElement.length ? visibleLinkElement.text() : 'No visible link';
  
          // Append data to CSV file
          const dataRow = `"${title}","${link}","${visibleLink}"\n`;
          fs.appendFileSync(csvFile, dataRow);
        });
    }
}

run()

Now, running the script will create a data.csv file with all the relevant data.

Next Steps

Learn About CSS Selectors: Cheerio supports jQuery-like selectors for finding elements.
Scrape JavaScript-Rendered Content: Use tools like Puppeteer or Playwright for sites that load content dynamically.
Build a Real Project: Think of a practical application for your scraper to solidify your learning.
Learn more by reading this post: Web scraping using Javascript for beginner

Useful JavaScript Resources:

Web scraping using Ruby

Ruby is a dynamic, open-source programming language focusing on simplicity and productivity. It's known for its elegant syntax, which is natural to read and easy to write. If you're comfortable with Ruby or want to try it out for web scraping, you're in the right place! Ruby has powerful libraries that make web scraping straightforward:

HTTParty: Simplifies making HTTP requests to fetch web pages.
Nokogiri: A gem that makes it easy to parse and navigate HTML and XML documents.
Selenium WebDriver: Allows you to automate web browsers to scrape dynamic, JavaScript-rendered content.

These tools handle much of the heavy lifting, so you can focus on extracting the data you need. Let's practice with the first two tools.

We'll be using the Hackernews website for practice

1. Install Ruby and Necessary Gems

First, you'll need Ruby installed on your computer. You can download it from the official website or use a version manager like rbenv or RVM.

Next, install two essential gems (Ruby libraries):

HTTParty: To send HTTP requests and get the web page's HTML.
Nokogiri: To parse and navigate the HTML content.

Open your command prompt or terminal and run:

gem install httparty nokogiri

2. Request the HTML Content

Create a new Ruby file called main.rb and start by requiring the httparty gem:

require 'httparty'

URL = 'https://news.ycombinator.com/'
response = HTTParty.get(URL)
html_content = response.body
puts html_content

This code sends a GET request to the URL and stores the HTML content in a variable.

3. Parse the Relevant Data

Now, let's parse the HTML to extract the relevant data.

a. Require Nokogiri

Add the following required statement:

require 'nokogiri'

b. Parse the HTML with Nokogiri

Modify your code to parse the HTML content:

require 'httparty'
require 'nokogiri'

URL = 'https://news.ycombinator.com/'
response = HTTParty.get(URL)
html_content = response.body

doc = Nokogiri::HTML(html_content)

This creates a Nokogiri document that represents the parsed HTML.

c. Parse the data

Let's extract the title, link, and the visible link:

URL = 'https://news.ycombinator.com/'
response = HTTParty.get(URL)
html_content = response.body

parsed_page = Nokogiri::HTML(html_content)
rows = parsed_page.css('tr.athing')

rows.each do |row|
    titleline = row.at_css('span.titleline a')
    title = titleline.text
    link = titleline['href']

    visible_link_tag = row.at_css('span.sitestr')
    visible_link = visible_link_tag ? visible_link_tag.text : 'No visible link'

    print "#{title} , #{link}, (#{visible_link})\n"
end

This loop goes through each block.

4. Scrape Multiple Pages

The website has multiple pages. If we want to grab all the data, we must first find the pattern. In this case, this is what the patterns look like:

Page 1: https://news.ycombinator.com/?p=1
Page 2: https://news.ycombinator.com/?p=2
Page n: https://news.ycombinator.com/?p=n

Let's modify the script to scrape the first three pages.


URL = 'https://news.ycombinator.com/'
max_pages = 3

(1..max_pages).each do |i|
    response = HTTParty.get("#{URL}?p=#{i}")
    html_content = response.body
    parsed_page = Nokogiri::HTML(html_content)
   
    rows = parsed_page.css('tr.athing')
   
    rows.each do |row|
        titleline = row.at_css('span.titleline a')
        title = titleline.text
        link = titleline['href']

        visible_link_tag = row.at_css('span.sitestr')
        visible_link = visible_link_tag ? visible_link_tag.text : 'No visible link'

        print "#{title} , #{link}, (#{visible_link})\n"
    end
end

This script will continue to the next page until the 3rd page.

5. Save Data to a CSV File

You might want to save the data for later use. Let's save this data in a CSV file.

require 'csv'

URL = 'https://news.ycombinator.com/'
max_pages = 3

# Open a CSV file in write mode
CSV.open('data.csv', 'w') do |csv|
  # Write the header row
  csv << ['Title', 'Link', 'Visible Link']

  (1..max_pages).each do |i|
    response = HTTParty.get("#{URL}?p=#{i}")
    html_content = response.body
    parsed_page = Nokogiri::HTML(html_content)

    rows = parsed_page.css('tr.athing')

    rows.each do |row|
      titleline = row.at_css('span.titleline a')
      title = titleline.text
      link = titleline['href']

      visible_link_tag = row.at_css('span.sitestr')
      visible_link = visible_link_tag ? visible_link_tag.text : 'No visible link'

      # Write the data row to the CSV file
      csv << [title, link, visible_link]
    end
  end
end

puts 'Data successfully written to data.csv'

Now, running the script will create a data.csv file with all relevant data inside.

Next Steps

Learn About CSS Selectors: Nokogiri supports CSS selectors, which can make selecting elements more intuitive.
Scrape JavaScript-Rendered Content: Use tools like Selenium WebDriver for sites that load content dynamically.
Build a Real Project: Think of a practical application for your scraper to solidify your learning.
Nokogiri Documentation
HTTParty Documentation

Challenges in Web Scraping

Beyond fetching and parsing data, web scraping involves several additional challenges:

Identifying Data Sources: Determining where and how data is stored can be complex. Websites may store data across different pages or load it via APIs, making it difficult to pinpoint the exact source.
Handling Captchas: Websites often use CAPTCHAs to block automated access. Bypassing CAPTCHAs without violating terms of service requires sophisticated methods, such as using CAPTCHA-solving services, which adds cost and complexity.
Managing IP Blocks and Rate Limits: Frequent requests from the same IP can lead to blocks or bans. Using proxies to rotate IPs and implementing smart throttling mechanisms can help mitigate this but require careful management and additional infrastructure.
Legal and Ethical Considerations: Scraping must be done in compliance with laws and website terms of service. Ignoring these can lead to legal actions and ethical issues.
Data Structure Changes: Websites often update their layout and structure, which can break scrapers. Regularly maintaining and updating scraping scripts is crucial to handle these changes.

You may interested in reading more tips and tricks on web scraping:

13 ways to scrape any public data from any website

This blog is to show different approaches to extract any type of public data from any website with Python code examples.

SerpApiDmitriy Zub

That's it! Thank you very much for reading this blog post. I hope you enjoyed it!