Scrape Google Books in Python

Intro

Currently, we don't have an API that supports extracting data from Google Books.

This blog post is to show you how you can do it yourself with the provided DIY solution below while we're working on releasing our proper API.

The solution can be used for personal use as it doesn't include the Legal US Shield that we offer for our paid production and above plans and has its limitations such as the need to bypass blocks, for example, CAPTCHA.

You can check our public roadmap to track the progress for this API:

What will be scraped

๐Ÿ“ŒNote: For now, we don't have an API that supports extracting Google Books data.

This blog post is to show you how you can do it yourself while we're working on releasing our proper API in the meantime. We'll update you on our Twitter once this API will be released.

Prerequisites

Basic knowledge scraping with CSS selectors

CSS selectors declare which part of the markup a style applies to, thus allowing you to extract data from matching tags and attributes.

If you haven't scraped with CSS selectors, there's a dedicated blog post of mine about how to use CSS selectors when web-scraping that covers what it is, pros and cons, and why they matter from a web-scraping perspective.

Separate virtual environment

In short, it's a thing that creates an independent set of installed libraries including different Python versions that can coexist with each other at the same system, thus preventing libraries or Python version conflicts.

If you haven't worked with a virtual environment before, have a look at the dedicated Python virtual environments tutorial using Virtualenv and Poetry blog post of mine to get familiar.

๐Ÿ“ŒNote: This is not a strict requirement for this blog post.

Install libraries:

pip install requests parsel

Reduce the chance of being blocked

There's a chance that a request might be blocked. Have a look at how to reduce the chance of being blocked while web-scraping, there are eleven methods to bypass blocks from most websites.


Full Code

from parsel import Selector
import requests, json, re

params = {
    "q": "richard branson",
    "tbm": "bks",
    "gl": "us",
    "hl": "en"
}

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.87 Safari/537.36",
}

html = requests.get("https://www.google.com/search", params=params, headers=headers, timeout=30)
selector = Selector(text=html.text)

books_results = []

# https://regex101.com/r/mapBs4/1
book_thumbnails = re.findall(r"s=\\'data:image/jpg;base64,(.*?)\\'", str(selector.css("script").getall()), re.DOTALL)

for book_thumbnail, book_result in zip(book_thumbnails, selector.css(".Yr5TG")):
    title = book_result.css(".DKV0Md::text").get()
    link = book_result.css(".bHexk a::attr(href)").get()
    displayed_link = book_result.css(".tjvcx::text").get()
    snippet = book_result.css(".cmlJmd span::text").get()
    author = book_result.css(".fl span::text").get()
    author_link = f'https://www.google.com/search{book_result.css(".N96wpd .fl::attr(href)").get()}'
    date_published = book_result.css(".fl+ span::text").get()
    preview_link = book_result.css(".R1n8Q a.yKioRe:nth-child(1)::attr(href)").get()
    more_editions_link = book_result.css(".R1n8Q a.yKioRe:nth-child(2)::attr(href)").get()

    books_results.append({
        "title": title,
        "link": link,
        "displayed_link": displayed_link,
        "snippet": snippet,
        "author": author,
        "author_link": author_link,
        "date_published": date_published,
        "preview_link": preview_link,
        "more_editions_link": f"https://www.google.com{more_editions_link}" if more_editions_link is not None else None,
        "thumbnail": bytes(bytes(book_thumbnail, "ascii").decode("unicode-escape"), "ascii").decode("unicode-escape")
    })


print(json.dumps(books_results, indent=2))

Code explanation

Import libraries:

from parsel import Selector
import requests, json
  • parsel is a library to extract and remove data from HTML and XML using XPath and CSS selectors. It's similar to beautifulsoup4 except it supports full XPath and has its own CSS pseudo-elements support, for example ::text or ::attr(<attribute_name>).

Create search query parameters and request headers:

# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
    "q": "richard branson",  # search query
    "tbm": "bks",            # book results
    "gl": "us",              # country to search from
    "hl": "en"               # language
}

# https://requests.readthedocs.io/en/master/user/quickstart/#custom-headers
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.87 Safari/537.36",
}
  • user-agent is used to act as a "real" user visit so websites think it's a user, not the bot/script that sends a request. It's the most basic form of avoiding being blocked by a website.

Pass query params, request headers to the request and create a Selector object:

html = requests.get("https://www.google.com/search", params=params, headers=headers, timeout=30)
selector = Selector(text=html.text)

Create a temporary list to store the data:

books_results = []

Match thumbnails data using a regular expression:

# https://regex101.com/r/mapBs4/1
book_thumbnails = re.findall(r"s=\\'data:image/jpg;base64,(.*?)\\'", str(selector.css("script").getall()), re.DOTALL)

The reason why we need to parse the data from <script> tags is because if you parse book thumbnail from <img> ["src"] attribute you'll get a 1x1 placeholder instead of a thumbnail.

  • re.findall() return a list of all matches.
  • selector.css("script") return a list of all found <script> tags and getall() will get the data value from translated XPath returned by <class 'SelectorList'> or <class 'Selector'> instance.
  • re.DOTALL will match everything including new line. Note that you have to have . switch, otherwise it will match every charter except a new line.

Iterate over matched thumbnails and CSS container with all the needed data and extract it:

for book_thumbnail, book_result in zip(book_thumbnails, selector.css(".Yr5TG")):
    title = book_result.css(".DKV0Md::text").get()
    link = book_result.css(".bHexk a::attr(href)").get()
    displayed_link = book_result.css(".tjvcx::text").get()
    snippet = book_result.css(".cmlJmd span::text").get()
    author = book_result.css(".fl span::text").get()
    author_link = f'https://www.google.com/search{book_result.css(".N96wpd .fl::attr(href)").get()}'
    date_published = book_result.css(".fl+ span::text").get()
    preview_link = book_result.css(".R1n8Q a.yKioRe:nth-child(1)::attr(href)").get()
    more_editions_link = book_result.css(".R1n8Q a.yKioRe:nth-child(2)::attr(href)").get()
  • zip() aggregates multiple iterables in parallel and returns a tuple with an item from each one.
  • css(".Yr5TG") is like calling soup.select(".Yr5TG") with bs4, which will return a list of matches.
  • css(".DKV0Md::text") where CSS3 pseudo-element ::text will get text, and get() will get the textual data value from translated XPath. If using without get() you'll get a translated XPath <class 'SelectorList'> or <class 'Selector'> instance from CSS selector.
  • ::attr(href) is also a pseudo-element to grab an attribute.

Append the data to temporary list as a dict:

books_results.append({
    "title": title,
    "link": link,
    "displayed_link": displayed_link,
    "snippet": snippet,
    "author": author,
    "author_link": author_link,
    "date_published": date_published,
    "preview_link": preview_link,
    # if URL is present, add "https://www.google.com" to the URL, instead to None: "Nonehttps://www.google.com"
    "more_editions_link": f"https://www.google.com{more_editions_link}" if more_editions_link is not None else None, 
    "thumbnail": bytes(bytes(book_thumbnail, "ascii").decode("unicode-escape"), "ascii").decode("unicode-escape")
})
  • bytes().decode() will decode unicode escape characters. We have to do it twice, because after first decoding some unicode characters are still present for some reason.

Print the data:

print(json.dumps(books_results, indent=2))

Part of the JSON output:

[
  {
    "title": "The Virgin Way: How to Listen, Learn, Laugh and Lead",
    "link": "https://books.google.com/books?id=Jkp1AgAAQBAJ&printsec=frontcover&dq=richard+branson&hl=en&newbks=1&newbks_redir=1&sa=X&ved=2ahUKEwin3IrX-_n1AhXclmoFHbMHDfIQ6AF6BAgIEAI",
    "displayed_link": "books.google.com",
    "snippet": "This is not a conventional book on leadership. There are no rules \u2013 but rather the secrets of leadership that he has learned along the way from his days at Virgin Records, to his recent work with The Elders.",
    "author": "Sir Richard Branson",
    "author_link": "https://www.google.com/search/search?gl=us&hl=en&tbm=bks&tbm=bks&q=inauthor:%22Sir+Richard+Branson%22&sa=X&ved=2ahUKEwin3IrX-_n1AhXclmoFHbMHDfIQ9Ah6BAgIEAU",
    "date_published": "2014",
    "preview_link": "https://books.google.com/books?id=Jkp1AgAAQBAJ&printsec=frontcover&dq=richard+branson&hl=en&newbks=1&newbks_redir=1&sa=X&ved=2ahUKEwin3IrX-_n1AhXclmoFHbMHDfIQuwV6BAgIEAc",
    "more_editions_link": "https://www.google.com/books/edition/The_Virgin_Way/Jkp1AgAAQBAJ?hl=en&gl=us&kptab=editions&sa=X&ved=2ahUKEwin3IrX-_n1AhXclmoFHbMHDfIQmBZ6BAgIEAg",
    "thumbnail": ""
  }, ... other results
]

Join us on Reddit | Twitter | YouTube

Add a Feature Request๐Ÿ’ซ or a Bug๐Ÿž