How to Scrape Individual Websites Found via Google Search Results
You can access Google search results easily using our API: Google Search API. The results will include organic search engine results websites, alongside their titles, snippets, links, and other information.
However, you may need to scrape the full content from these individual websites. While it's beyond what we offer here, you can still easily scrape the contents after collecting the external links from our API.
The idea
The idea is pretty straightforward.
- Collect the website links using SerpApi.
- Then, scrape these websites individually.
In this blog post, I'll share how to do it in Python using a simple package that makes it easy to scrape multiple websites at once.
Once you have the content from these websites, you can start analyzing the text or maybe connect the data with any LLM or AI tool. Having internet knowledge will enrich your AI-based project.
Collect website links using SerpApi
Preparation for accessing the SerpApi API in Python
- Create a new
main.py
file - Install requests with:
pip install requests
Here is what the basic setup looks like:
import requests
SERPAPI_API_KEY = "YOUR_REAL_SERPAPI_API_KEY"
params = {
"api_key": SERPAPI_API_KEY, #replace with your real API Key
# soon
}
search = requests.get("https://serpapi.com/search", params=params)
response = search.json()
print(response)
With these few lines of code, we can access all of the search engines available at SerpApi, including the Amazon Search API.
import requests
SERPAPI_API_KEY = "YOUR_SERPAPI_API_KEY"
params = {
"api_key": SERPAPI_API_KEY,
"engine": "google",
"k": "Caffee latte"
}
search = requests.get("https://serpapi.com/search", params=params)
response = search.json()
print(response)
To make the response easier to read, let's add indentation.
import json
# ...
# ...
# all previous code
print(json.dumps(response, indent=2))
Under the organic_results
key, you can see all the organic results from Google SERP like this:
Collect the links
Since we only care about the links, let's gather them all in a variable.
search = requests.get("https://serpapi.com/search", params=params)
response = search.json()
urls = [result["link"] for result in response.get("organic_results", []) if "link" in result]
We've successfully collected all the links!
If you want to learn more about how to use our API in Python, you can read this blog post:
Scrape external links from organic results
Next, we need to scrape the individual external links. We can do it manually, one by one, using the request
package. However, in this tutorial, I'll use a simple package that I made in Python. This package will simplify the process of scraping the main content from multiple websites.
Step by Step
Install the package:
pip install py-websites-scraper
Quick usage example:
import asyncio
from py_websites_scraper import scrape_urls
urls = ["https://news.ycombinator.com", "https://example.com"] #change this
data = asyncio.run(scrape_urls(urls, max_concurrency=5))
for item in data:
if item["success"] is True:
print(item["url"], item.get("title"), item.get("content"))
else:
print("Failed fetching this URL: " + item["url"])
In our case, we need to replace the urls
value with all the links we've gathered previously.
Warning: This package simply performs a request to the targeted websites. You may need to add custom logic to scrape JavaScript-rendered websites or if the website has any other blocking methods.
Let's continue from where we left:
...
search = requests.get("https://serpapi.com/search", params=params)
response = search.json()
urls = [result["link"] for result in response.get("organic_results", []) if "link" in result]
data = asyncio.run(
scrape_urls(
urls,
max_concurrency=5
)
)
for item in data:
if item["success"] is True:
print(item["url"], item.get("title"), item.get("content"))
else:
print("Failed fetching this URL: " + item["url"])
Of course, you can do anything you want inside the if conditional statement. In that example, we simply print the URL, title, and content.
Using Proxy
Using this py-websites-scraper
package, you can easily add your proxy to unblock some requests.
urls = []
results = await scrape_urls(
urls,
proxy="YOUR_PROXY_INFO",
headers={"User-Agent": "USER_AGENT_INFO"},
)
Export all data to a file
Here is an example of how you can gather all the information in a single file:
SERPAPI_API_KEY = "YOUR_SERPAPI_KEY" #replace with your real API Key
SECRET_PROXY_INFO = "YOUR_PROXY_INFORMATION"
params = {
"api_key": SERPAPI_API_KEY,
"engine": "google",
"q": "coffee"
}
search = requests.get("https://serpapi.com/search", params=params)
response = search.json()
urls = [result["link"] for result in response.get("organic_results", []) if "link" in result]
data = asyncio.run(
scrape_urls(
urls,
max_concurrency=5,
proxy=SECRET_PROXY_INFO,
)
)
total_error = 0
response_text = ''
for item in data:
if item["success"] is False:
total_error += 1
print("URL:", item["url"])
print("Error:", item["error"])
print("\n\n")
print("=========================================")
continue
response_text += "URL: " + item["url"] + "\n"
response_text += "Title: " + str(item["title"]) + "\n"
response_text += "Content: " + str(item["content"]) + "\n"
response_text += "\n\n"
response_text += "=========================================\n"
print(f"Total URLs: {len(data)}")
print(f"Total Errors: {total_error}")
with open("scraped_data.txt", "w", encoding="utf-8") as file:
file.write(response_text)
print("Scraped data saved to scraped_data.txt")
Whether you put the results in a variable or in an external file like that, once you have all the content from these individual websites, you can continue to use the data based on your needs.