How to Scrape Individual Websites Found via Google Search Results

You can access Google search results easily using our API: Google Search API. The results will include organic search engine results websites, alongside their titles, snippets, links, and other information.

However, you may need to scrape the full content from these individual websites. While it's beyond what we offer here, you can still easily scrape the contents after collecting the external links from our API.

Scrape individual websites from Google SERP

The idea

The idea is pretty straightforward.

Collect the website links using SerpApi.
Then, scrape these websites individually.

In this blog post, I'll share how to do it in Python using a simple package that makes it easy to scrape multiple websites at once.

Once you have the content from these websites, you can start analyzing the text or maybe connect the data with any LLM or AI tool. Having internet knowledge will enrich your AI-based project.

Collect website links using SerpApi

Preparation for accessing the SerpApi API in Python

Create a new main.py file
Install requests with:

pip install requests

Here is what the basic setup looks like:

import requests
SERPAPI_API_KEY = "YOUR_REAL_SERPAPI_API_KEY"

params = {
    "api_key": SERPAPI_API_KEY, #replace with your real API Key
    # soon
}

search = requests.get("https://serpapi.com/search", params=params)
response = search.json()
print(response)

With these few lines of code, we can access all of the search engines available at SerpApi, including the Amazon Search API.

import requests
SERPAPI_API_KEY = "YOUR_SERPAPI_API_KEY"

params = {
    "api_key": SERPAPI_API_KEY, 
    "engine": "google",
    "k": "Caffee latte"
}

search = requests.get("https://serpapi.com/search", params=params)
response = search.json()
print(response)

To make the response easier to read, let's add indentation.

import json

# ...
# ...
# all previous code

print(json.dumps(response, indent=2))

Under the organic_results key, you can see all the organic results from Google SERP like this:

Google Search API response

Collect the links
Since we only care about the links, let's gather them all in a variable.

search = requests.get("https://serpapi.com/search", params=params)
response = search.json()

urls = [result["link"] for result in response.get("organic_results", []) if "link" in result]

We've successfully collected all the links!

If you want to learn more about how to use our API in Python, you can read this blog post:

How to scrape Google search results with Python

Learn how to quickly and effortlessly scrape Google search results using the SerpApi Python library. Bonus: export the data to a CSV file or a Database.

SerpApiHilman Ramadhan

Scrape external links from organic results

Next, we need to scrape the individual external links. We can do it manually, one by one, using the request package. However, in this tutorial, I'll use a simple package that I made in Python. This package will simplify the process of scraping the main content from multiple websites.

GitHub - hilmanski/py-websites-scraper: Scrape multiple sites in parallel with Python

Scrape multiple sites in parallel with Python. Contribute to hilmanski/py-websites-scraper development by creating an account on GitHub.

GitHubhilmanski

Step by Step

Install the package:

pip install py-websites-scraper

Quick usage example:

import asyncio
from py_websites_scraper import scrape_urls

urls = ["https://news.ycombinator.com", "https://example.com"] #change this
data = asyncio.run(scrape_urls(urls, max_concurrency=5))
for item in data:
    if item["success"] is True:
        print(item["url"], item.get("title"), item.get("content"))
    else:
        print("Failed fetching this URL: " + item["url"])

In our case, we need to replace the urls value with all the links we've gathered previously.

Warning: This package simply performs a request to the targeted websites. You may need to add custom logic to scrape JavaScript-rendered websites or if the website has any other blocking methods.

Let's continue from where we left:

...

search = requests.get("https://serpapi.com/search", params=params)
response = search.json()

urls = [result["link"] for result in response.get("organic_results", []) if "link" in result]

data = asyncio.run(
    scrape_urls(
        urls, 
        max_concurrency=5
    )
)

for item in data:
    if item["success"] is True:
        print(item["url"], item.get("title"), item.get("content"))
    else:
        print("Failed fetching this URL: " + item["url"])

Of course, you can do anything you want inside the if conditional statement. In that example, we simply print the URL, title, and content.

Using Proxy

Using this py-websites-scraper package, you can easily add your proxy to unblock some requests.

urls = []
results = await scrape_urls(
    urls,
    proxy="YOUR_PROXY_INFO",
    headers={"User-Agent": "USER_AGENT_INFO"},
)

Export all data to a file

Here is an example of how you can gather all the information in a single file:

SERPAPI_API_KEY = "YOUR_SERPAPI_KEY" #replace with your real API Key
SECRET_PROXY_INFO = "YOUR_PROXY_INFORMATION"

params = {
    "api_key": SERPAPI_API_KEY, 
    "engine": "google",
    "q": "coffee"
}

search = requests.get("https://serpapi.com/search", params=params)
response = search.json()

urls = [result["link"] for result in response.get("organic_results", []) if "link" in result]

data = asyncio.run(
    scrape_urls(
        urls, 
        max_concurrency=5,
        proxy=SECRET_PROXY_INFO,
    )
)

total_error = 0
response_text = ''
for item in data:
    if item["success"] is False:
        total_error += 1
        print("URL:", item["url"])
        print("Error:", item["error"])
        print("\n\n")
        print("=========================================")
        continue

    response_text += "URL: " + item["url"] + "\n"
    response_text += "Title: " + str(item["title"]) + "\n"
    response_text += "Content: " + str(item["content"]) + "\n"
    response_text += "\n\n"
    response_text += "=========================================\n"

print(f"Total URLs: {len(data)}")
print(f"Total Errors: {total_error}")

with open("scraped_data.txt", "w", encoding="utf-8") as file:
    file.write(response_text)

print("Scraped data saved to scraped_data.txt")

Whether you put the results in a variable or in an external file like that, once you have all the content from these individual websites, you can continue to use the data based on your needs.