What will be scraped

image

Prerequisites

Basic knowledge scraping with CSS selectors

CSS selectors declare which part of the markup a style applies to thus allowing to extract data from matching tags and attributes.

If you haven't scraped with CSS selectors, there's a dedicated blog post of mine about how to use CSS selectors when web-scraping that covers what it is, pros and cons, and why they're matter from a web-scraping perspective.

Separate virtual environment

In short, it's a thing that creates an independent set of installed libraries including different Python versions that can coexist with each other at the same system thus prevention libraries or Python version conflicts.

If you didn't work with a virtual environment before, have a look at the dedicated Python virtual environments tutorial using Virtualenv and Poetry blog post of mine to get familiar.

📌Note: this is not a strict requirement for this blog post.

Install libraries:

pip install playwright parsel

You also need to install chromium for playwright to work and operate the browser:

playwright install chromium

After that, if you're on Linux, you might need to install additional things (playwright will prompt you in the terminal in case something is missing):

sudo apt-get install -y libnss3 libnspr4 libatk1.0-0 libatk-bridge2.0-0 libcups2 libdrm2 libxkbcommon0 libxcomposite1 libxdamage1 libxfixes3 libxrandr2 libgbm1 libatspi2.0-0 libwayland-client0

Reduce the chance of being blocked

There's a chance that a request might be blocked. Have a look at how to reduce the chance of being blocked while web-scraping, there are eleven methods to bypass blocks from most websites and some of them will be covered in this blog post.

Full Code

import time, json, re
from parsel import Selector
from playwright.sync_api import sync_playwright


def run(playwright):
    page = playwright.chromium.launch(headless=True).new_page()
    page.goto("https://play.google.com/store/apps/details?id=com.collectorz.javamobile.android.books&hl=en_GB&gl=US")

    user_comments = []

    # if "See all reviews" button present
    if page.query_selector('.Jwxk6d .u4ICaf button'):
        print("the button is present.")

        print("clicking on the button.")
        page.query_selector('.Jwxk6d .u4ICaf button').click(force=True)

        print("waiting a few sec to load comments.")
        time.sleep(4)
        
        last_height = page.evaluate('() => document.querySelector(".fysCi").scrollTop')  # 2200

        while True:
            print("scrolling..")
            page.keyboard.press("End")
            time.sleep(3)

            new_height = page.evaluate('() => document.querySelector(".fysCi").scrollTop')

            if new_height == last_height:
                break
            else:
                last_height = new_height

    selector = Selector(text=page.content())
    page.close()

    print("done scrolling. Exctracting comments...")
    for index, comment in enumerate(selector.css(".RHo1pe"), start=1):

        comment_likes = comment.css(".AJTPZc::text").get()   

        user_comments.append({
            "position": index,
            "user_name": comment.css(".X5PpBb::text").get(),
            "user_avatar": comment.css(".gSGphe img::attr(srcset)").get().replace(" 2x", ""),
            "user_comment": comment.css(".h3YV2d::text").get(),
            "comment_likes": comment_likes.split("people")[0].strip() if comment_likes else None,
            "app_rating": re.search(r"\d+", comment.css(".iXRFPc::attr(aria-label)").get()).group(),
            "comment_date": comment.css(".bp9Aid::text").get(),
            "developer_comment": {
                "dev_title": comment.css(".I6j64d::text").get(),
                "dev_comment": comment.css(".ras4vb div::text").get(),
                "dev_comment_date": comment.css(".I9Jtec::text").get()
            }
        })

    print(json.dumps(user_comments, indent=2, ensure_ascii=False))


with sync_playwright() as playwright:
    run(playwright)

Code Explanation

Import libraries:

import time, json
from playwright.sync_api import sync_playwright
  • time to set a sleep() intervals between each scroll.
  • json just for pretty printing.
  • sync_playwright for synchronous API. playwright have asynchronous API as well using asyncio module.

Declare a function:

def run(playwright):
    # further code..

Initialize playwright, connect to chromium, launch() a browser new_page() and goto() a given URL:

page = playwright.chromium.launch(headless=False).new_page()
page.goto("https://play.google.com/store/apps/details?id=com.collectorz.javamobile.android.books&hl=en_GB&gl=US")

user_comments = [] # temporary list for all extracted data

Next, we need to check if the button responsible for showing all reviews is present and click on it if present:

if page.query_selector('.Jwxk6d .u4ICaf button'):
    print("the button is present.")

    print("clicking on the button.")
    page.query_selector('.Jwxk6d .u4ICaf button').click(force=True)

    print("waiting a few sec to load comments.")
    time.sleep(4)
  • query_selector is function that accepts CSS selectors to be searched.
  • click is to click on the button and force=True will bypass any auto-waits and click immidiately.

Scroll to the bottom of the comments window:

last_height = page.evaluate('() => document.querySelector(".fysCi").scrollTop')  # 2200

while True:
    print("scrolling..")
    page.keyboard.press("End")
    time.sleep(3)

    new_height = page.evaluate('() => document.querySelector(".fysCi").scrollTop')

    if new_height == last_height:
        break
    else:
        last_height = new_height
  • page.evaluate() will run a JavaScript code in the browser context that will measurement of the height of the .fysCi selector. scrollTop gets the number of pixels scrolled from a given element, in this case CSS selector.
  • time.sleep(3) will stop code execution for 3 seconds to load more comments.
  • Then it will measure a new_height after the scroll running the same measurement JavaScript code.
  • Finally, it will check if new_height == last_height, and if so, exit the while loop by using break.
  • else set the last_height to new_height and run the iteration (scroll) again.

After that, pass scrolled HTML content to parsel, close the browser:

selector = Selector(text=page.content())
page.close()

Iterate over all results after the while loop is done:

for index, comment in enumerate(selector.css(".RHo1pe"), start=1):

    comment_likes = comment.css(".AJTPZc::text").get()   

    user_comments.append({
        "position": index,
        "user_name": comment.css(".X5PpBb::text").get(),
        "user_avatar": comment.css(".gSGphe img::attr(srcset)").get().replace(" 2x", ""),
        "user_comment": comment.css(".h3YV2d::text").get(),
        "comment_likes": comment_likes.split("people")[0].strip() if comment_likes else None,
        "app_rating": re.search(r"\d+", comment.css(".iXRFPc::attr(aria-label)").get()).group(),
        "comment_date": comment.css(".bp9Aid::text").get(),
        "developer_comment": {
            "dev_title": comment.css(".I6j64d::text").get(),
            "dev_comment": comment.css(".ras4vb div::text").get(),
            "dev_comment_date": comment.css(".I9Jtec::text").get()
        }
    })

Print the data:

print(json.dumps(user_comments, indent=2, ensure_ascii=False))

Run your code using context manager:

with sync_playwright() as playwright:
    run(playwright)

Output

[
  {
    "position": 1,
    "user_name": "Selby Warren",
    "user_avatar": "https://play-lh.googleusercontent.com/a-/ACNPEu9_6h31fmuFO-BQOYPjA2oVz9sJXxaI6sL3ZuPdrw=s64-rw",
    "user_comment": "Tried logging in on multiple different devices, reset the password, uninstalled then reinstalled, all to no avail. The old app was fine, just update that one instead of creating a new one full of errors. @BN the issue has NOT been resolved. The issue is with the app, not the account, so there is nothing customer service can do.",
    "comment_likes": "9",
    "app_rating": "1",
    "comment_date": "2 September 2022",
    "developer_comment": {
      "dev_title": "Barnes & Noble",
      "dev_comment": "Sorry for the difficulties you had signing in. This issue has been addressed, Please try it again now. If the issue persists, contact us at service@bn.com with the account details.",
      "dev_comment_date": "2 September 2022"
    }
  }, ... other results
  {
    "position": 875,
    "user_name": "Originalbigguy",
    "user_avatar": "https://play-lh.googleusercontent.com/a/ALm5wu3dYTOHvlG8SUqgyTbRnjv9I49JtxgySY-RwTJU=s64-rw-mo",
    "user_comment": "Not free",
    "comment_likes": null,
    "app_rating": "1",
    "comment_date": "9 April 2021",
    "developer_comment": {
      "dev_title": "Collectorz.com",
      "dev_comment": "The app is never advertised as free anywhere. The app information clearly states this is a paid subscription app.\n",
      "dev_comment_date": "10 April 2021"
    }
  }
]

Using Google Play Product Reviews API

As we support extracting reviews data from Google Play App, this section is to show the comparison between the DIY solution and our solution.

The biggest difference is that you don't need to use browser automation to scrape results, create the parser from scratch and maintain it.

Keep in mind that there's also a chance that the request might be blocked at some point from Google (or CAPTCHA), we handle it on our backend.

Installing google-search-results from PyPi:

pip install google-search-results
from serpapi import GoogleSearch
from urllib.parse import (parse_qsl, urlsplit)

params = {
  "api_key": "...",                                        # your serpapi api key
  "engine": "google_play_product",                         # serpapi parsing engine
  "store": "apps",                                         # app results
  "gl": "us",                                              # country of the search
  "hl": "en",                                              # language of the search
  "product_id": "com.collectorz.javamobile.android.books"  # app id
}

search = GoogleSearch(params)                              # where data extraction happens on the backend

reviews = []

while True:
    results = search.get_dict()                            # JSON -> Python dict

    for review in results["reviews"]:
        reviews.append({
            "title": review.get("title"),
            "avatar": review.get("avatar"),
            "rating": review.get("rating"),
            "likes": review.get("likes"),
            "date": review.get("date"),
            "snippet": review.get("snippet"),
            "response": review.get("response")
        })

    # pagination
    if "next" in results.get("serpapi_pagination", {}):
        search.params_dict.update(dict(parse_qsl(urlsplit(results.get("serpapi_pagination", {}).get("next")).query)))
    else:
        break
        
print(json.dumps(reviews, indent=2, ensure_ascii=False))

Output:

[
  {
    "title": "JazzTripp",
    "avatar": "https://play-lh.googleusercontent.com/a-/ACNPEu8THUUDL3yzcd0bHSDRR4OegOWLmfbFi70On0HbRg",
    "rating": 5.0,
    "likes": 20,
    "date": "May 06, 2022",
    "snippet": "This app takes a bit if getting used to at first, but the catalogue is extensive, and most bar codes and isbn numbers can be used to autofill a good chuck of a collection. I personally use this app for manga, and while its only correct about 70% of the time, its still easy to update and change as you see fit. The 'add to core' option makes me feel like im actually helping out the app, so i add data whenever i can. Keep up the good work guys!",
    "response": null
  }, ... other reviews
  {
    "title": "Originalbigguy",
    "avatar": "https://play-lh.googleusercontent.com/a/ALm5wu3dYTOHvlG8SUqgyTbRnjv9I49JtxgySY-RwTJU=mo",
    "rating": 1.0,
    "likes": 0,
    "date": "April 09, 2021",
    "snippet": "Not free",
    "response": {
      "title": "Collectorz.com",
      "snippet": "The app is never advertised as free anywhere. The app information clearly states this is a paid subscription app.",
      "date": "April 10, 2021"
    }
  }
]

Join us on Reddit | Twitter | YouTube