What will be scraped

image

Prerequisites

Install libraries:

pip install requests parsel google-search-results 

Basic knowledge scraping with CSS selectors

CSS selectors declare which part of the markup a style applies to, thus allowing you to extract data from matching tags and attributes.

If you haven't scraped with CSS selectors, there's a dedicated blog post of mine
about how to use CSS selectors when web-scraping that covers what it is, pros and cons, and why they matter from a web-scraping perspective.

📌 Note: Only such layout will be covered in this blog post. There are at least 3 different Carousel results.

SerpApi is a paid API with a free plan which allows end-user to forget about figuring out how to bypass blocks from search entities and focus on which data to extract. This section is to compare the DIY example solution below to our API.

from serpapi import GoogleSearch
import json

def serpapi_get_top_carousel():
    params = {
      "api_key": "...",                # https://serpapi.com/manage-api-key
      "engine": "google",              # search engine
      "q": "dune actors",              # search query
      "hl": "en",                      # language
      "gl": "us"                       # country
    }

    search = GoogleSearch(params)
    results = search.get_dict()

    for result in results['knowledge_graph']['cast']:
        print(json.dumps(result, indent=2))


serpapi_get_top_carousel()

Part of the output:

{
  "name": "Timothée Chalamet",
  "extensions": [
    "Paul Atreides"
  ],
  "link": "https://www.google.com/search?hl=en&gl=us&q=Timoth%C3%A9e+Chalamet&stick=H4sIAAAAAAAAAONgFuLVT9c3NEzLqko2ii8xUOLSz9U3KDDKM0wr0BLKTrbST8vMyQUTVsmJxSWPGJcycgu8_HFPWGo246Q1J68xTmHkwqJOyJCLzTWvJLOkUkhQip8L1RIjEahAtll2hpFZXqHAwmWzGJWcjUx2XZp2jk1P8FkoA0Ndb4iDkiLnFCHrhswn7-wFXd__299ywsBBgkWBQYPB8JElq8P6KYwHtBgOMDI17VtxiI2Fg1GAwYpJg6mKiYOFZxGrUEhmbn5JxuGVqQrOGYk5ibmpJRPYGAHILgFT8gAAAA&sa=X&ved=2ahUKEwiMxLi-ksXzAhUAl2oFHf88AN0Q-BZ6BAgBEDQ",
  "image": "https://serpapi.com/searches/6165a3dcfa86759a4fa42ba4/images/94afec67f82aa614bb572a123ec09cf051cf10bde8e0bc8025daf21915c49798.jpeg"
} ... other results

Full DIY Code

import requests, lxml, re, json
from parsel import Selector

# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
  "User-agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.83 Safari/537.36"
  }

params = {
      "q": "dune actors",  # search query
      "gl": "us"           # country to search from
  }


def parsel_get_top_carousel():
  html = requests.get('https://www.google.com/search', headers=headers, params=params)
  selector = Selector(text=html.text)

  carousel_name = selector.css(".yKMVIe::text").get()
  all_script_tags = selector.css("script::text").getall()

  data = {f"{carousel_name}": []}

  decoded_thumbnails = []

  for _id in selector.css("img.d7ENZc::attr(id)").getall():
    # https://regex101.com/r/YGtoJn/1
    thumbnails = re.findall(r"var\s?s=\'([^']+)\'\;var\s?ii\=\['{_id}'\];".format(_id=_id), str(all_script_tags))
    thumbnail = [
      bytes(bytes(img, "ascii").decode("unicode-escape"), "ascii").decode("unicode-escape") for img in thumbnails
      ]
    decoded_thumbnails.append("".join(thumbnail))

  for result, image in zip(selector.css('.QjXCXd.X8kvh'), decoded_thumbnails):

    title = result.css(".JjtOHd::text").get()
    link = f"https://www.google.com{result.css('.QjXCXd div a::attr(href)').get()}"
    extensions = result.css(".ellip.AqEFvb::text").getall()

    if title and link and extensions is not None:
      data[carousel_name].append({
        "title": title,
        "link": link,
        "extensions": extensions,
        "thumbnail": image
        })

  print(json.dumps(data, indent=2, ensure_ascii=False))


parsel_get_top_carousel()

Code Explanation

Thumbnail extraction

image

Parsing thumbnails from img.d7ENZc CSS selector to grab src attribute will bring a 1x1 placeholder, instead of actual thumbnail. Thumbnails are located in the <script> tags. In order to grab them, we need to:

  1. Locate image element via Dev Tools.
  2. Copy id value. image
  3. Open page source CTRL+U, press CTRL+F and paste id value to find it.

Most likely you'll see two occurrences, and the second one will be somewhere in the <script> tags. That's what we're looking for.

Now we need to match image id with extracted data:image from the <script> elements to extract the right image:

selector = Selector(text=html.text)

# grabs every script element
all_script_tags = selector.css("script::text").getall()

# list to temporary store thumbnails data
decoded_thumbnails = []

# iterating over each image ID
# using _id because id is a Python build-in name
for _id in selector.css("img.d7ENZc::attr(id)").getall():
  # https://regex101.com/r/YGtoJn/1
  thumbnails = re.findall(r"var\s?s=\'([^']+)\'\;var\s?ii\=\['{_id}'\];".format(_id=_id), str(all_script_tags))
  thumbnail = [
    bytes(bytes(img, "ascii").decode("unicode-escape"), "ascii").decode("unicode-escape") for img in thumbnails
    ]
  decoded_thumbnails.append("".join(thumbnail))
Code Explanation
css("img.d7ENZc::attr(id)") to grab every image id.
getall() returns a list of matches.
re.findall() to find all matches via regular expression.
r"<expression>" a regular expression.
([^']+) is a regex capture group.
['{_id}'\] is a parsed image id that were passed to regular expression to match the correct image.
format(_id=_id) is a string placeholder. String interpolation would look a bit awkward.
bytes().deccode() to convert unicode characters to ascii characters.
"".join(thumbnail) to join (convert) each element from a list to a string.

Output from decoded_thumbnails:

# data:image is shortened on purpose, 
# so the output would not cover the entire page  
[
  "", 
  "other images ..."
]

The next step is to iterate over CSS container with title, link, and extensions and over decoded_thumbnails:

for result, image in zip(selector.css(".QjXCXd.X8kvh"), decoded_thumbnails):
  title = result.css(".JjtOHd::text").get()
  link = f"https://www.google.com{result.css('.QjXCXd div a::attr(href)').get()}"
  extensions = result.css(".ellip.AqEFvb::text").getall()
Code Explanation
zip() allows to iterate over multiple iterables in a single for loop.
::text a parsel pseudo-element to extract textual node data which is identical to XPath <node>/text()
::attr(<attribute>) a parsel pseudo-element grab attribute data from the node which is identical to XPath <node>/@href
get() to return first element of actual data.
getall() to return list of all matches.

The next step is to check if extracted title, link and extensions have some values and append to temporary list and print the data:

data = {f"{carousel_name}": []}

if title and link and extensions is not None:
  data[carousel_name].append({
    "title": title,
    "link": link,
    "extensions": extensions,
    "thumbnail": image
    })

print(json.dumps(data, indent=2, ensure_ascii=False))

Output:

{
  "Dune": [
    {
      "title": "Zendaya", ... first results
      "link": "https://www.google.com/search?gl=us&q=Zendaya&stick=H4sIAAAAAAAAAONgFuLVT9c3NEzLqko2ii8xUOLSz9U3SElJM7So0BLKTrbST8vMyQUTVsmJxSWLWNmjUvNSEisTAY7G9vs7AAAA&sa=X&ved=2ahUKEwjp99fw1972AhXXXM0KHeWoAX4Q9OUBegQIAxAC",
      "extensions": [
        "Chani"
      ],
      "thumbnail": ""
    }, ... other results
    {
      "title": "Javier Bardem", ... last results
      "link": "https://www.google.com/search?gl=us&q=Javier+Bardem&stick=H4sIAAAAAAAAAONgFuLVT9c3NEzLqko2ii8xUOLUz9U3MDQ3NE7WEspOttJPy8zJBRNWyYnFJYtYeb0SyzJTixScEotSUnMBeUccjEAAAAA&sa=X&ved=2ahUKEwjp99fw1972AhXXXM0KHeWoAX4Q9OUBegQIAxAQ",
      "extensions": [
        "Stilgar"
      ],
      "thumbnail": ""
    }
  ]
}

Join us on Reddit | Twitter | YouTube

Add a Feature Request💫 or a Bug🐞