Scrape Google Play Store App in Python

Intro

You can use the official Google Play Developer API which has a default limit of 200,000 requests per day, 60 requests per hour for retrieving the list of reviews and individual reviews, which is roughly 1 request every 2 minutes.

You can use a complete third-party Google Play Store App scraping solution for Python google-play-scraper without any external dependencies, and JavaScript google-play-scraper. Third-party solutions are usually used to break the quota limit.

You don't really need to read this post unless you need a step-by-step explanation without using browser automation such as playwright or selenium since you can see what Python google-play-scraper regex solution is, how it scrapes app results, and how it scrapes review results.

This ongoing blog post is meant to give an idea and actual step-by-step examples of how to scrape Google Play Store App using beautifulsoup and regular expressions to creating something on your own.

What will be scraped

Prerequisites

Separate virtual environment

In short, it's a thing that creates an independent set of installed libraries including different Python versions that can coexist with each other at the same system thus prevention libraries or Python version conflicts.

If you didn't work with a virtual environment before, have a look at the dedicated Python virtual environments tutorial using Virtualenv and Poetry blog post of mine to get familiar.

📌Note: this is not a strict requirement for this blog post.

Install libraries:

pip install requests lxml beautifulsoup4 google-search-results

Reduce the chance of being blocked

There's a chance that a request might be blocked. Have a look at how to reduce the chance of being blocked while web-scraping, there are eleven methods to bypass blocks from most websites. Only user-agent, which is the easiest method, was covered in this blog post.

Using Google Play Product API from SerpApi

The following section is for comparison example between DIY solution and API solution. SerpApi also extracts data without browser automation including all reviews extraction.

The biggest difference is that SerpApi bypasses blocks from Google. It removes the need to figure out how to uses proxies, CAPTCHAs and which providers are good, and there's no need to maintain the parser if Google Play updates again.

Two examples of extracting certain app info and all reviews using SerpApi pagination:

from serpapi import GoogleSearch
from urllib.parse import (parse_qsl, urlsplit)
import json

params = {
    'api_key': '...',                        # https://serpapi.com/manage-api-key
    "engine": "google_play_product",         # parsing engine
    "store": "apps",                         # app page
    "gl": "us",                              # country of the search
    "product_id": "com.MapstarAG.Mapstar3D", # low review count example to show it exits the while loop
    "all_reviews": "true"                    # shows all reviews
}

search = GoogleSearch(params)                # where data extraction happens


def serpapi_scrape_google_play_app_data():
    results = search.get_dict()

    print(json.dumps(results["product_info"], indent=2, ensure_ascii=False))
    print(json.dumps(results["media"], indent=2, ensure_ascii=False))
    # other data

    
def serpapi_scrape_google_play_app_reviews():
    # to show the page number
    page_num = 0

    # iterate over all pages
    while True:
        results = search.get_dict()              # JSON -> Python dict
    
        if "error" in results:
            print(results["error"])
            break
    
        page_num += 1
        print(f"Current page: {page_num}")
    
        # iterate over organic results and extract the data
        for result in results.get("reviews", []):
            print(result.get("title"), result.get("date"), sep="\n")
    
        # check if the next page key is present in the JSON
        # if present -> split URL in parts and update to the next page
        if "next" in results.get("serpapi_pagination", {}):
            search.params_dict.update(dict(parse_qsl(urlsplit(results.get("serpapi_pagination").get("next")).query)))
        else:
            break

Output

{
  "basic_info": {
    "developer": {
      "name": "Nintendo Co., Ltd.",
      "url": "https://supermariorun.com/",
      "email": "supermariorun-support@nintendo.co.jp"
    },
    "downloads_info": {
      "long_form_not_formatted": "100,000,000+",
      "long_form_formatted": "100000000",
      "as_displayed_short_form": "100M+",
      "actual_downloads": "211560819"
    },
    "name": "Super Mario Run",
    "type": "SoftwareApplication",
    "url": "https://play.google.com/store/apps/details/Super_Mario_Run?id=com.nintendo.zara&hl=en_GB&gl=US",
    "description": "Control Mario with just a tap!",
    "application_category": "GAME_ACTION",
    "operating_system": "ANDROID",
    "thumbnail": "https://play-lh.googleusercontent.com/3ZKfMRp_QrdN-LzsZTbXdXBH-LS1iykSg9ikNq_8T2ppc92ltNbFxS-tORxw2-6kGA",
    "content_rating": "Everyone",
    "rating": 4.0,
    "reviews": "1643139",
    "price": "0",
    "release_date": "22 Mar 2017",
    "images": [
      "https://play-lh.googleusercontent.com/yT8ZCQHNB_MGT9Oc6mC5_mQS5vZ-5A4fvKQHHOl9NBy8yWGbM5-EFG_uISOXmypBYQ6G",
      "https://play-lh.googleusercontent.com/AvRrlEpV8TCryInAnA__FcXqDu5d3i-XrUp8acW2LNmzkU-rFXkAKgmJPA_4AHbNjyY",
      "https://play-lh.googleusercontent.com/AESbAa4QFa9-lVJY0vmAWyq2GXysv5VYtpPuDizOQn40jS9Z_ji8HXHA5hnOIzaf_w",
      "https://play-lh.googleusercontent.com/KOCWy63UI2p7Fc65_X5gnIHsErEt7gpuKoD-KcvpGfRSHp-4k8YBGyPPopnrNQpdiQ",
      "https://play-lh.googleusercontent.com/iDJagD2rKMJ92hNUi5WS2S_mQ6IrKkz6-G8c_zHNU9Ck8XMrZZP-1S_KkDsA6KDJ9No",
      "https://play-lh.googleusercontent.com/QsdO8Pn6qxvfAi4es7uicI-xB21dPN3s8SBfmnuXPjFftdXCuugxis7CDJbAkQ_pzA",
      "https://play-lh.googleusercontent.com/oEIUG3KTnijbe5TH3HO3NMAF5Ai8LkIAtKOO__TduDq4wOzGQA2PzZlBJg2C4mURDR8",
      "https://play-lh.googleusercontent.com/BErkwcIVa4ldoVL56EvGWTQJ2nPu-Y6EFeAS4dfK7l0CufebWdrRC9CduHqNwysPYf8",
      "https://play-lh.googleusercontent.com/cw86ny78mbNHVRDLlhw1fxVbZxiYFC7yYDRY3Nt2dnRGihRhxo1eOy4IjrSVVzKW9Is",
      "https://play-lh.googleusercontent.com/Kx0gmRSH582Te-BeTo-C87f3hl-2sf7DRaWso3qZ46p9PZ97socE6FuK09vzebVF8AA",
      "https://play-lh.googleusercontent.com/OJhOUUZjTUw4e3EEbPlZnuKdmUIGdLSSwUgb5ygPfiO0h1SeHIl3s_L7R8xBDLVnjPU",
      "https://play-lh.googleusercontent.com/Z0Ggjrocxk7SRTAhFCL6ZEc04eCAdI09Xf08Th7dfn_ViIBrK7E8Bd1p3Lfi-pjiLLWz",
      "https://play-lh.googleusercontent.com/pn58u5DpcUNOgE4NOQc4jFJaFyR3EaiO0YWlekYdQmBV3Q6jrF_ioX78gbtH2eZTTA",
      "https://play-lh.googleusercontent.com/EItdRRArK4yI7LPArgKOhwTrcALMSFS41F49dOuX6c8a7XPw20WNfSiDrE7ZnIbTRME",
      "https://play-lh.googleusercontent.com/xDFJgEfAPeGcfk72Nfe9jE-7oDyMDYtucW4W0mYh3vV8YgMb2T91BQ1do1r_8fU-Sw",
      "https://play-lh.googleusercontent.com/Bn6SFuIjgL8CLHTB6C7t_Dv7MCGwAxh8OIV7z-gKhNpJtxss2Vqwl_50HdHFUyoet7s",
      "https://play-lh.googleusercontent.com/eEKSdZPf7yo-WWcb9tGLQ-O17XVbd02rGREHwWC79JDOgVZFTaWmi0s1vg2H4Mn51hI",
      "https://play-lh.googleusercontent.com/vlOYHPoi3AwQuAEAuWi1pu37cnxObDelQ5xQQP3ojAmptiJbBereG8Ugvlp_vihDS9c",
      "https://play-lh.googleusercontent.com/2PuQ1L2sE0opnEG9AywzAzNBIV0sZo1y1ftrJ518oPwgjtUJ6iUrKskgn8DWRClFQnM",
      "https://play-lh.googleusercontent.com/TvcAspZw7Tc1CQV3DJrzPL_I4sACQhvNhDqB90r9yiYfAnPOUk8gi1fFcT1NdAsKG_l-",
      "https://play-lh.googleusercontent.com/vpt0r-PxWy2ea8xvuPSg0cn3iNXrS1v6pCFzWSPOane0lkDcfIGoSTvhiFz_en4CePI",
      "https://play-lh.googleusercontent.com/3ZKfMRp_QrdN-LzsZTbXdXBH-LS1iykSg9ikNq_8T2ppc92ltNbFxS-tORxw2-6kGA",
      "https://play-lh.googleusercontent.com/iTZtyWYr4T-slu1nifgRqEhtMLmxcNagc2rDAyiWntDQWCVLlGR7rDvx0uK6z-zLujwv",
      "https://play-lh.googleusercontent.com/iTZtyWYr4T-slu1nifgRqEhtMLmxcNagc2rDAyiWntDQWCVLlGR7rDvx0uK6z-zLujwv"
    ],
    "video_trailer": "https://play-games.googleusercontent.com/vp/mp4/1280x720/qjHSn4GwQWY.mp4"
  },
  "user_comments": [
    {
      "user_avatar": "https://play-lh.googleusercontent.com/EGemoI2NTXmTsBVtJqk8jxF9rh8ApRWfsIMQSt2uE4OcpQqbFu7f7NbTK05lx80nuSijCz7sc3a277R67g",
      "user_rating": "3",
      "user_comment": "Now, while I love the Mario Series, I will say that I am not the biggest fan of this game. When playing Remix 10, I found that the screen lagged for seemingly no reason, which threw me off plenty of times. The level design also seems pretty bland and just the same old settings you see over and over again. Overall I feel like this was just another cash grab from Nintendo, not to mention you actually need to PAY to unlock the rest of the game. But other than that, it looks decent graphic-wise."
    }, ... other comments
    {
      "user_avatar": "https://play-lh.googleusercontent.com/EGemoI2NTXmTsBVtJqk8jxF9rh8ApRWfsIMQSt2uE4OcpQqbFu7f7NbTK05lx80nuSijCz7sc3a277R67g",
      "user_rating": "2",
      "user_comment": "Too many tutorials that dont even let you play until 5 minutes of tapping the screen. Then after only a few levels you have to pay for the rest of them. Nintendo makes so much money you\\'d think they could make a game that allowed you to pay to remove ads, not pay to play the game you installed in the first place. But when you aren\\'t being forcefed tutorials for a game you won\\'t play that long anyway, the gameplay is actually pretty fun and challenging. Those are the only pros."
    }
  ]
}

DIY Code

from bs4 import BeautifulSoup
import requests, lxml, re, json

# https://requests.readthedocs.io/en/latest/user/quickstart/#custom-headers
headers = {
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36"
}

# https://requests.readthedocs.io/en/latest/user/quickstart/#passing-parameters-in-urls
params = {
    "id": "com.nintendo.zara",     # app name
    "gl": "US",                    # country of the search
    "hl": "en_GB"                  # language of the search
}


def google_store_app_data():
    html = requests.get("https://play.google.com/store/apps/details", params=params, headers=headers, timeout=30)
    soup = BeautifulSoup(html.text, "lxml")

    # where all app data will be stored
    app_data = {
        "basic_info":{
            "developer":{},
            "downloads_info": {}
        },
        "user_comments": []
    }
    
    # [11] index is a basic app information
    # https://regex101.com/r/zOMOfo/1
    basic_app_info = json.loads(re.findall(r"<script nonce=\"\w+\" type=\"application/ld\+json\">({.*?)</script>", str(soup.select("script")[11]), re.DOTALL)[0])
     
    # https://regex101.com/r/6Reb0M/1
    additional_basic_info =  re.search(fr"<script nonce=\"\w+\">AF_initDataCallback\(.*?(\"{basic_app_info.get('name')}\".*?)\);<\/script>", 
            str(soup.select("script")), re.M|re.DOTALL).group(1)
    
    app_data["basic_info"]["name"] = basic_app_info.get("name")
    app_data["basic_info"]["type"] = basic_app_info.get("@type")
    app_data["basic_info"]["url"] = basic_app_info.get("url")
    app_data["basic_info"]["description"] = basic_app_info.get("description").replace("\n", "")  # replace new line character to nothing
    app_data["basic_info"]["application_category"] = basic_app_info.get("applicationCategory")
    app_data["basic_info"]["operating_system"] = basic_app_info.get("operatingSystem")
    app_data["basic_info"]["thumbnail"] = basic_app_info.get("image")
    app_data["basic_info"]["content_rating"] = basic_app_info.get("contentRating")
    app_data["basic_info"]["rating"] = round(float(basic_app_info.get("aggregateRating").get("ratingValue")), 1)  # 4.287856 -> 4.3
    app_data["basic_info"]["reviews"] = basic_app_info.get("aggregateRating").get("ratingCount")
    app_data["basic_info"]["reviews"] = basic_app_info.get("aggregateRating").get("ratingCount")
    app_data["basic_info"]["price"] = basic_app_info["offers"][0]["price"]
    
    app_data["basic_info"]["developer"]["name"] = basic_app_info.get("author").get("name")
    app_data["basic_info"]["developer"]["url"] = basic_app_info.get("author").get("url")
    
    # https://regex101.com/r/C1WnuO/1
    app_data["basic_info"]["developer"]["email"] = re.search(r"[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+", additional_basic_info).group(0)
    
    # https://regex101.com/r/Y2mWEX/1 (a few matches but re.search always matches the first occurence)
    app_data["basic_info"]["release_date"] = re.search(r"\d{1,2}\s[A-Z-a-z]{3}\s\d{4}", additional_basic_info).group(0)
    
    # https://regex101.com/r/7yxDJM/1
    app_data["basic_info"]["downloads_info"]["long_form_not_formatted"] = re.search(r"\"(\d+,?\d+,?\d+\+)\"\,(\d+),(\d+),\"(\d+M\+)\"", additional_basic_info).group(1)
    app_data["basic_info"]["downloads_info"]["long_form_formatted"] = re.search(r"\"(\d+,?\d+,?\d+\+)\"\,(\d+),(\d+),\"(\d+M\+)\"", additional_basic_info).group(2)
    app_data["basic_info"]["downloads_info"]["as_displayed_short_form"] = re.search(r"\"(\d+,?\d+,?\d+\+)\"\,(\d+),(\d+),\"(\d+M\+)\"", additional_basic_info).group(4)
    app_data["basic_info"]["downloads_info"]["actual_downloads"] = re.search(r"\"(\d+,?\d+,?\d+\+)\"\,(\d+),(\d+),\"(\d+M\+)\"", additional_basic_info).group(3)
    
    # https://regex101.com/r/jjsdUP/1
    # [2:] skips 2 PEGI logo thumbnails and extracts only app images 
    app_data["basic_info"]["images"] = re.findall(r",\[\d{3,4},\d{3,4}\],.*?(https.*?)\"", additional_basic_info)[2:]
    
    try:
        # https://regex101.com/r/C1WnuO/1
        app_data["basic_info"]["video_trailer"] = "".join(re.findall(r"\"(https:\/\/play-games\.\w+\.com\/vp\/mp4\/\d+x\d+\/\S+\.mp4)\"", additional_basic_info)[0])
    except:
        app_data["basic_info"]["video_trailer"] = None
    
    
    # User reviews
    # https://regex101.com/r/xDVZq7/1
    user_reviews = re.findall(r'Write a short review.*?<script nonce="\w+">AF_initDataCallback\({key:.*data:\[\[\[\"\w.*?\",(.*?)sideChannel: {}}\);<\/script>',
                                       str(soup.select("script")), re.DOTALL)
    
    # https://regex101.com/r/D6BIBP/1
    # [::3] to grab every 2nd (second) picture to avoid duplicates
    avatars = re.findall(r",\"(https:.*?)\"\].*?\d{1}", str(user_reviews))[::3]
    
    # https://regex101.com/r/18EziQ/1
    ratings = re.findall(r"https:.*?\],(\d{1})", str(user_reviews))
    
    # https://regex101.com/r/mSku7n/1
    comments = re.findall(r"https:.*?\],\d{1}.*?\"(.*?)\",\[\d+,\d+\]", str(user_reviews))
    
    for comment, rating, avatar in zip(comments, ratings, avatars):
        app_data["user_comments"].append({
            "user_avatar": avatar,
            "user_rating": rating,
            "user_comment": comment
        })


    print(json.dumps(app_data, indent=2, ensure_ascii=False))


if __name__ == "__main__":
    # https://stackoverflow.com/a/17533149/15164646
    # reruns script if `basic_app_info` or `additional_basic_info` throws an exception due to <script> position change
    while True: 
        try:
            google_store_app_data()
        except:
            pass
        else:
            break

Code explanation

Import libraries:

from bs4 import BeautifulSoup
import requests, lxml, re, json
  • BeautifulSoup, lxml to parse HTML.
  • requests to make a request to a website.
  • re to match parts of the HTML where needed data is located via regular expression.
  • json to convert parsed data from JSON to Python dictionary, and for pretty printing.

Create global request headers, and search query params:

# user-agent headers to act as a "real" user visit
# https://requests.readthedocs.io/en/latest/user/quickstart/#custom-headers
headers = {
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36"
}

# https://requests.readthedocs.io/en/latest/user/quickstart/#passing-parameters-in-urls
# search query parameters
params = {
    "id": "com.nintendo.zara",     # app name
    "gl": "US",                    # country of the search
    "hl": "en_GB"                  # language of the search
}
  • user-agent is used to pretend that it's a real user visit from an actual browser so websites will assume that it's not a bot that sent a request. Make sure you user-agent is up to date.

Pass params, headers to a request:

html = requests.get("https://play.google.com/store/apps/details", params=params, headers=headers, timeout=30)
  • timeout argument will tell request to stop waiting for a response after 30 seconds.

Create a BeautifulSoup object from returned HTML, pass HTML parser which in this case is lxml:

soup = BeautifulSoup(html.text, "lxml")

Create a dict() to store extracted app data. Here I'm creating the overall structure of the data and how it could be organized:

app_data = {
    "basic_info":{
        "developer":{},
        "downloads_info": {}
    },
    "user_comments": []
}

App basic info

Match basic and additional app information via regular expression:

# [11] index is a basic app information
    # https://regex101.com/r/zOMOfo/1
basic_app_info = json.loads(re.findall(r"<script nonce=\"\w+\" type=\"application/ld\+json\">({.*?)</script>", 
                                           str(soup.select("script")[11]), re.DOTALL)[0])

# https://regex101.com/r/6Reb0M/1
additional_basic_info =  re.search(fr"<script nonce=\"\w+\">AF_initDataCallback\(.*?(\"{basic_app_info.get('name')}\".*?)\);<\/script>", 
        str(soup.select("script")), re.M|re.DOTALL).group(1)                                          
  • re.findall() will find all matched patterns in the HTML. Follow commented link to better understand what regular expression is matching.
  • \w+ is a word metacharacter that matches any word.
  • (.*?) is a regex capture group (...), and .*? is a pattern to capture everything.
  • str(soup.select("script")[12]) is the second re.findall() argument which:
  • tells soup to grab all found script tags,
  • then grab only [12] index from returned <script> tags,
  • convert it to a string so re module could process it.
  • re.DOTALL will tell re to match everything, including newlines.
  • re.M is an alias for re.MULTILINE. It will match everything immediately following each newline.
  • re.findall()[0] will access first index from the returned list of matches which is the only match in this case and used to convert the type from list to str.
  • json.loads() will convert (deserialize) parsed JSON to Python dictionary.

Access parsed JSON converted to dictionary data from basic_app_info variable:

app_data["basic_info"]["name"] = basic_app_info.get("name")
app_data["basic_info"]["type"] = basic_app_info.get("@type")
app_data["basic_info"]["url"] = basic_app_info.get("url")
app_data["basic_info"]["description"] = basic_app_info.get("description").replace("\n", "")  # replace new line character to nothing
app_data["basic_info"]["application_category"] = basic_app_info.get("applicationCategory")
app_data["basic_info"]["operating_system"] = basic_app_info.get("operatingSystem")
app_data["basic_info"]["thumbnail"] = basic_app_info.get("image")
app_data["basic_info"]["content_rating"] = basic_app_info.get("contentRating")
app_data["basic_info"]["rating"] = round(float(basic_app_info.get("aggregateRating").get("ratingValue")), 1)  # 4.287856 -> 4.3
app_data["basic_info"]["reviews"] = basic_app_info.get("aggregateRating").get("ratingCount")
app_data["basic_info"]["reviews"] = basic_app_info.get("aggregateRating").get("ratingCount")
app_data["basic_info"]["price"] = basic_app_info["offers"][0]["price"]

app_data["basic_info"]["developer"]["name"] = basic_app_info.get("author").get("name")
app_data["basic_info"]["developer"]["url"] = basic_app_info.get("author").get("url")

Next step is extracting additional data, some it doesn't show on the page like developer email:

# https://regex101.com/r/C1WnuO/1
app_data["basic_info"]["developer"]["email"] = re.search(r"[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+", additional_basic_info).group(0)

# https://regex101.com/r/Y2mWEX/1 (a few matches occures but re.search always matches the first occurence)
app_data["basic_info"]["release_date"] = re.search(r"\d{1,2}\s[A-Z-a-z]{3}\s\d{4}", additional_basic_info).group(0)

# https://regex101.com/r/7yxDJM/1
# using different groups to extract different data
app_data["basic_info"]["downloads_info"]["long_form_not_formatted"] = re.search(r"\"(\d+,?\d+,?\d+\+)\"\,(\d+),(\d+),\"(\d+M\+)\"", additional_basic_info).group(1)
app_data["basic_info"]["downloads_info"]["long_form_formatted"] = re.search(r"\"(\d+,?\d+,?\d+\+)\"\,(\d+),(\d+),\"(\d+M\+)\"", additional_basic_info).group(2)
app_data["basic_info"]["downloads_info"]["as_displayed_short_form"] = re.search(r"\"(\d+,?\d+,?\d+\+)\"\,(\d+),(\d+),\"(\d+M\+)\"", additional_basic_info).group(4)
app_data["basic_info"]["downloads_info"]["actual_downloads"] = re.search(r"\"(\d+,?\d+,?\d+\+)\"\,(\d+),(\d+),\"(\d+M\+)\"", additional_basic_info).group(3)

# ...

try:
    # https://regex101.com/r/C1WnuO/1
    app_data["basic_info"]["video_trailer"] = "".join(re.findall(r"\"(https:\/\/play-games\.\w+\.com\/vp\/mp4\/\d+x\d+\/\S+\.mp4)\"", additional_basic_info)[0])
except:
    app_data["basic_info"]["video_trailer"] = None

App images

App images are located in the inline JSON from which we can extract images using regular expression. Here's an example where they're located as well as app description which is not being currently extracted:

# https://regex101.com/r/jjsdUP/1
# [2:] skips 2 PEGI logo thumbnails and extracts only app images 
app_data["basic_info"]["images"] = re.findall(r",\[\d{3,4},\d{3,4}\],.*?(https.*?)\"", additional_basic_info)[2:]

App comments

Match user comments data using regular expression:

# User reviews
# https://regex101.com/r/xDVZq7/1
user_reviews = re.findall(r'Write a short review.*?<script nonce="\w+">AF_initDataCallback\({key:.*data:\[\[\[\"\w.*?\",(.*?)sideChannel: {}}\);<\/script>',
                                    str(soup.select("script")), re.DOTALL)

Next step is to extract all avatars, ratings and comments itself using re.findall():

# https://regex101.com/r/D6BIBP/1
# [::3] to grab every 2nd (second) picture to avoid duplicates
avatars = re.findall(r",\"(https:.*?)\"\].*?\d{1}", str(user_reviews))[::3]

# https://regex101.com/r/18EziQ/1
ratings = re.findall(r"https:.*?\],(\d{1})", str(user_reviews))

# https://regex101.com/r/mSku7n/1
comments = re.findall(r"https:.*?\],\d{1}.*?\"(.*?)\",\[\d+,\d+\]", str(user_reviews))
  • \d{1} to match exactly 1 digit number.

Finally, we need to interate over multiple iterables (extracted comments data) and append it to the dictionary:

for comment, rating, avatar in zip(comments, ratings, avatars):
    app_data["user_comments"].append({
        "user_avatar": avatar,
        "user_rating": rating,
        "user_comment": comment
    })
  • zip() takes multiple iterables, aggregates them in a tuple and returns it.
    In this case, number of each value will identical for all iterables e.g. 40 avatars, ratings and comments.
  • append() appends certain element to the end of the list.

Append user comments data to temporary list:

# for name, ... in zip(...) is here

app_user_comments.append({
    "user_name": name,
    "user_avatar": avatar,
    "comment": comment,
    "user_app_rating": user_app_rating,
    "user__comment_likes": likes,
    "user_comment_published_at": date,
    "user_comment_id": comment_id
})

Print the data:

print(json.dumps(app_data, indent=2, ensure_ascii=False))

Final step would be to add a boilerplate code that protects users from accidentally invoking the script when they didn't intend to:

if __name__ == "__main__":
    # https://stackoverflow.com/a/17533149/15164646
    # reruns script if `basic_app_info` or `additional_basic_info` throws an exception due to <script> position change
    while True: 
        try:
            google_store_app_data()
        except:
            pass
        else:
            break

while loop was used to rerun the script if exception occurred. In this case, the exception will be IndexError which appears from basic_app_info or additional_basic_info variables.

This error occurs because on each page load, Google Play changes <script> elements position, sometimes it's at index [11] (most often), sometimes at a different index. Rerunning the script fixes this problem for now.

An obviously better approach will be to create a better regex which will be updated on the next blog post update.

Join us on Twitter | YouTube