What will be scraped
Prerequisites
Install libraries:
pip install requests parsel google-search-results
Basic knowledge scraping with CSS selectors
CSS selectors declare which part of the markup a style applies to, thus allowing you to extract data from matching tags and attributes.
If you haven't scraped with CSS selectors, there's a dedicated blog post of mine
about how to use CSS selectors when web-scraping that covers what it is, pros and cons, and why they matter from a web-scraping perspective.
📌 Note: Only such layout will be covered in this blog post. There are at least 3 different Carousel results.
Using Google Top Carousel API
SerpApi is a paid API with a free plan which allows end-user to forget about figuring out how to bypass blocks from search entities and focus on which data to extract. This section is to compare the DIY example solution below to our API.
from serpapi import GoogleSearch
import json
def serpapi_get_top_carousel():
params = {
"api_key": "...", # https://serpapi.com/manage-api-key
"engine": "google", # search engine
"q": "dune actors", # search query
"hl": "en", # language
"gl": "us" # country
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results['knowledge_graph']['cast']:
print(json.dumps(result, indent=2))
serpapi_get_top_carousel()
Part of the output:
{
"name": "Timothée Chalamet",
"extensions": [
"Paul Atreides"
],
"link": "https://www.google.com/search?hl=en&gl=us&q=Timoth%C3%A9e+Chalamet&stick=H4sIAAAAAAAAAONgFuLVT9c3NEzLqko2ii8xUOLSz9U3KDDKM0wr0BLKTrbST8vMyQUTVsmJxSWPGJcycgu8_HFPWGo246Q1J68xTmHkwqJOyJCLzTWvJLOkUkhQip8L1RIjEahAtll2hpFZXqHAwmWzGJWcjUx2XZp2jk1P8FkoA0Ndb4iDkiLnFCHrhswn7-wFXd__299ywsBBgkWBQYPB8JElq8P6KYwHtBgOMDI17VtxiI2Fg1GAwYpJg6mKiYOFZxGrUEhmbn5JxuGVqQrOGYk5ibmpJRPYGAHILgFT8gAAAA&sa=X&ved=2ahUKEwiMxLi-ksXzAhUAl2oFHf88AN0Q-BZ6BAgBEDQ",
"image": "https://serpapi.com/searches/6165a3dcfa86759a4fa42ba4/images/94afec67f82aa614bb572a123ec09cf051cf10bde8e0bc8025daf21915c49798.jpeg"
} ... other results
Full DIY Code
import requests, lxml, re, json
from parsel import Selector
# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
"User-agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.83 Safari/537.36"
}
params = {
"q": "dune actors", # search query
"gl": "us" # country to search from
}
def parsel_get_top_carousel():
html = requests.get('https://www.google.com/search', headers=headers, params=params)
selector = Selector(text=html.text)
carousel_name = selector.css(".yKMVIe::text").get()
all_script_tags = selector.css("script::text").getall()
data = {f"{carousel_name}": []}
decoded_thumbnails = []
for _id in selector.css("img.d7ENZc::attr(id)").getall():
# https://regex101.com/r/YGtoJn/1
thumbnails = re.findall(r"var\s?s=\'([^']+)\'\;var\s?ii\=\['{_id}'\];".format(_id=_id), str(all_script_tags))
thumbnail = [
bytes(bytes(img, "ascii").decode("unicode-escape"), "ascii").decode("unicode-escape") for img in thumbnails
]
decoded_thumbnails.append("".join(thumbnail))
for result, image in zip(selector.css('.QjXCXd.X8kvh'), decoded_thumbnails):
title = result.css(".JjtOHd::text").get()
link = f"https://www.google.com{result.css('.QjXCXd div a::attr(href)').get()}"
extensions = result.css(".ellip.AqEFvb::text").getall()
if title and link and extensions is not None:
data[carousel_name].append({
"title": title,
"link": link,
"extensions": extensions,
"thumbnail": image
})
print(json.dumps(data, indent=2, ensure_ascii=False))
parsel_get_top_carousel()
Code Explanation
Thumbnail extraction
Parsing thumbnails from img.d7ENZc
CSS selector to grab src
attribute will bring a 1x1 placeholder, instead of actual thumbnail. Thumbnails are located in the <script>
tags. In order to grab them, we need to:
- Locate image element via Dev Tools.
- Copy
id
value. - Open page source
CTRL+U
, pressCTRL+F
and pasteid
value to find it.
Most likely you'll see two occurrences, and the second one will be somewhere in the <script>
tags. That's what we're looking for.
Now we need to match image id
with extracted data:image
from the <script>
elements to extract the right image:
selector = Selector(text=html.text)
# grabs every script element
all_script_tags = selector.css("script::text").getall()
# list to temporary store thumbnails data
decoded_thumbnails = []
# iterating over each image ID
# using _id because id is a Python build-in name
for _id in selector.css("img.d7ENZc::attr(id)").getall():
# https://regex101.com/r/YGtoJn/1
thumbnails = re.findall(r"var\s?s=\'([^']+)\'\;var\s?ii\=\['{_id}'\];".format(_id=_id), str(all_script_tags))
thumbnail = [
bytes(bytes(img, "ascii").decode("unicode-escape"), "ascii").decode("unicode-escape") for img in thumbnails
]
decoded_thumbnails.append("".join(thumbnail))
Code | Explanation |
---|---|
css("img.d7ENZc::attr(id)") |
to grab every image id . |
getall() |
returns a list of matches. |
re.findall() |
to find all matches via regular expression. |
r"<expression>" |
a regular expression. |
([^']+) |
is a regex capture group. |
['{_id}'\] |
is a parsed image id that were passed to regular expression to match the correct image. |
format(_id=_id) |
is a string placeholder. String interpolation would look a bit awkward. |
bytes().deccode() |
to convert unicode characters to ascii characters. |
"".join(thumbnail) |
to join (convert) each element from a list to a string. |
Output from decoded_thumbnails
:
# data:image is shortened on purpose,
# so the output would not cover the entire page
[
"",
"other images ..."
]
Title, link and extensions extraction
The next step is to iterate over CSS container with title, link, and extensions and over decoded_thumbnails
:
for result, image in zip(selector.css(".QjXCXd.X8kvh"), decoded_thumbnails):
title = result.css(".JjtOHd::text").get()
link = f"https://www.google.com{result.css('.QjXCXd div a::attr(href)').get()}"
extensions = result.css(".ellip.AqEFvb::text").getall()
Code | Explanation |
---|---|
zip() |
allows to iterate over multiple iterables in a single for loop. |
::text |
a parsel pseudo-element to extract textual node data which is identical to XPath <node>/text() |
::attr(<attribute>) |
a parsel pseudo-element grab attribute data from the node which is identical to XPath <node>/@href |
get() |
to return first element of actual data. |
getall() |
to return list of all matches. |
The next step is to check if
extracted title, link and extensions have some values and append to temporary list
and print
the data:
data = {f"{carousel_name}": []}
if title and link and extensions is not None:
data[carousel_name].append({
"title": title,
"link": link,
"extensions": extensions,
"thumbnail": image
})
print(json.dumps(data, indent=2, ensure_ascii=False))
Output:
{
"Dune": [
{
"title": "Zendaya", ... first results
"link": "https://www.google.com/search?gl=us&q=Zendaya&stick=H4sIAAAAAAAAAONgFuLVT9c3NEzLqko2ii8xUOLSz9U3SElJM7So0BLKTrbST8vMyQUTVsmJxSWLWNmjUvNSEisTAY7G9vs7AAAA&sa=X&ved=2ahUKEwjp99fw1972AhXXXM0KHeWoAX4Q9OUBegQIAxAC",
"extensions": [
"Chani"
],
"thumbnail": ""
}, ... other results
{
"title": "Javier Bardem", ... last results
"link": "https://www.google.com/search?gl=us&q=Javier+Bardem&stick=H4sIAAAAAAAAAONgFuLVT9c3NEzLqko2ii8xUOLUz9U3MDQ3NE7WEspOttJPy8zJBRNWyYnFJYtYeb0SyzJTixScEotSUnMBeUccjEAAAAA&sa=X&ved=2ahUKEwjp99fw1972AhXXXM0KHeWoAX4Q9OUBegQIAxAQ",
"extensions": [
"Stilgar"
],
"thumbnail": ""
}
]
}
Links
Join us on Reddit | Twitter | YouTube
Add a Feature Request💫 or a Bug🐞