โ—โ—โ— This blog post was written for an old Google Play page design. The code is currently broken due to a redesign of the Google Play website, we're currently working on a fix.

What will be scraped

image

Preparation

First, we need to create a Node.js project and add npm packages cheerio to parse parts of the HTML markup, and axios to make a request to a website. To do this, in the directory with our project, open the command line and enter npm init -y, and then npm i cheerio axios.

Process

SelectorGadget Chrome extension was used to grab CSS selectors.
The Gif below illustrates the approach of selecting different parts of the results.

how-1

Full code

const cheerio = require("cheerio");
const axios = require("axios");

const AXIOS_OPTIONS = {
    headers: {
        "User-Agent":
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.64 Safari/537.36",
    },                                                  // adding the User-Agent header as one way to prevent the request from being blocked
    params: {
        hl: 'en',                                       // Parameter defines the language to use for the Google search
        gl: 'us'                                        // parameter defines the country to use for the Google search
    },
};

function getMainPageInfo() {
    return axios
        .get(`https://play.google.com/store/apps`, AXIOS_OPTIONS)
        .then(function ({ data }) {
            let $ = cheerio.load(data);

            const mainPageInfo = Array.from($('.Ktdaqe')).reduce((result, block) => {
                const categoryTitle = $(block).find('.sv0AUd').text().trim()
                const apps = Array.from($(block).find('.WHE7ib')).map((app) => {
                    return {
                        title: $(app).find('.WsMG1c').text().trim(),
                        developer: $(app).find('.b8cIId .KoLSrc').text().trim(),
                        link: `https://play.google.com${$(app).find('.b8cIId a').attr('href')}`,
                        rating: parseFloat($(app).find('.pf5lIe > div').attr('aria-label').slice(6, 9)),
                    }
                })
                return {
                    ...result, [categoryTitle]: apps
                }

            }, {})

            return mainPageInfo;
        });
}

getMainPageInfo().then(console.log)

Code explanation

Declare constants from required libraries:

const cheerio = require("cheerio");
const axios = require("axios");
Code Explanation
cheerio library for parsing the html page and access the necessary selectors
axios library for requesting the desired html document

Next, we write down the necessary parameters for making a request:

const AXIOS_OPTIONS = {
    headers: {
        "User-Agent":
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.64 Safari/537.36",
    },
    params: {
        hl: 'en',
        gl: 'us'
    },
};
Code Explanation
headers HTTP headers let the client and the server pass additional information with an HTTP request or response
User-Agent is used to act as a "real" user visit. Default axios requests user-agent is axios/0.27.2 so websites understand that it's a script that sends a request and might block it. Check what's your user-agent.
hl parameter defines the language to use for the Google search
gl parameter defines the country to use for the Google search

And finally a function to get the necessary information:

function getMainPageInfo() {
    return axios
        .get(`https://play.google.com/store/apps`, AXIOS_OPTIONS)
        .then(function ({ data }) {
            let $ = cheerio.load(data);

            const mainPageInfo = Array.from($('.Ktdaqe')).reduce((result, block) => {
                const categoryTitle = $(block).find('.sv0AUd').text().trim()
                const apps = Array.from($(block).find('.WHE7ib')).map((app) => {
                    return {
                        title: $(app).find('.WsMG1c').text().trim(),
                        developer: $(app).find('.b8cIId .KoLSrc').text().trim(),
                        link: `https://play.google.com${$(app).find('.b8cIId a').attr('href')}`,
                        rating: parseFloat($(app).find('.pf5lIe > div').attr('aria-label').slice(6, 9)),
                    }
                })
                return {
                    ...result, [categoryTitle]: apps
                }

            }, {})

            return mainPageInfo;
        });
}
Code Explanation
function ({ data }) we received the response from axios request that have data key that we destructured (this entry is equal to function (response) and in the next line cheerio.load(response.data))
mainPageInfo an object with categories arrays that contains info about apps from page
apps an array that contains all displayed apps in current category
.attr('href') gets the href attribute value of the html element
$(block).find('.sv0AUd') finds element with class name sv0AUd in all child elements and their children of block html element
.text() gets the raw text of html element
.trim() removes whitespace from both ends of a string
{...result, [categoryTitle]: apps} in this code we use spread syntax to create an object from result that was returned from previous reduce call and add to this object new item with key categoryTitle and value apps

Now we can launch our parser. To do this enter node YOUR_FILE_NAME in your command line. Where YOUR_FILE_NAME is the name of your .js file.

Output:

{
   "Popular apps & games":[
      {
         "title":"Netflix",
         "developer":"Netflix, Inc.",
         "link":"https://play.google.com/store/apps/details?id=com.netflix.mediaclient",
         "rating":4.5
      },
      {
         "title":"TikTok",
         "developer":"TikTok Pte. Ltd.",
         "link":"https://play.google.com/store/apps/details?id=com.zhiliaoapp.musically",
         "rating":4.5
      },
      {
         "title":"Instagram",
         "developer":"Instagram",
         "link":"https://play.google.com/store/apps/details?id=com.instagram.android",
         "rating":4
      }
      ... and other results
   ]
}

Google Play Store API

Alternatively, you can use the Google Play Store API from SerpApi. SerpApi is a free API with 100 search per month. If you need more searches, there are paid plans.

The difference is that all that needs to be done is just to iterate over a ready-made, structured JSON instead of coding everything from scratch maintaining, figuring out how to bypass blocks from Google, and selecting correct selectors which could be time-consuming at times. Check out the playground.

First we need to install google-search-results-nodejs. To do this you need to enter in your console: npm i google-search-results-nodejs

const SerpApi = require("google-search-results-nodejs");
const search = new SerpApi.GoogleSearch(process.env.API_KEY);     //your API key from serpapi.com

const params = {
  engine: "google_play",                                // search engine
  gl: "us",                                             // parameter defines the country to use for the Google search
  hl: "en",                                             // parameter defines the language to use for the Google search
  store: "apps"                                         // parameter defines the type of Google Play store
};

const getMainPageInfo = function ({ organic_results }) {
  return organic_results.reduce((result, category) => {
    const { title: categoryTitle, items } = category;
    const apps = items.map((app) => {
      const { title, link, rating, extansion } = app
      return {
        title,
        developer: extansion.name,
        link,
        rating,
      }
    })
    return {
      ...result, [categoryTitle]: apps
    }
  }, {})
};

const getJson = (params) => {
  return new Promise((resolve) => {
    search.json(params, resolve);
  })
}

getJson(params).then(getMainPageInfo).then(console.log)

Declare constants from required libraries:

const SerpApi = require("google-search-results-nodejs");
const search = new SerpApi.GoogleSearch(API_KEY);
Code Explanation
SerpApi SerpApi Node.js library
search new instance of GoogleSearch class
API_KEY your API key from SerpApi

Next, we write down the necessary parameters for making a request:

const params = {
  engine: "google_play",
  gl: "us",
  hl: "en",
  store: "apps"
};
Code Explanation
engine search engine
gl parameter defines the country to use for the Google search
hl parameter defines the language to use for the Google search
store parameter defines the type of Google Play store

Next, we write a callback function in which we describe what data we need from the result of our request:

const getMainPageInfo = function ({ organic_results }) {
  return organic_results.reduce((result, category) => {
    const { title: categoryTitle, items } = category;
    const apps = items.map((app) => {
      const { title, link, rating, extansion } = app
      return {
        title,
        developer: extansion.name,
        link,
        rating,
      }
    })
    return {
      ...result, [categoryTitle]: apps
    }
  }, {})
};
Code Explanation
organic_results an array that we destructured from response
title, items other data that we destructured from element of organic_results array
title: categoryTitle we redefine destructured data title to new categoryTitle
title, link, rating, extansion other data that we destructured from element of items array
{...result, [categoryTitle]: apps} in this code we use spread syntax to create an object from result that was returned from previous reduce call and add to this object new item with key categoryTitle and value apps

Next, we wrap the search method from the SerpApi library in a promise to further work with the search results and run it:

const getJson = (params) => {
  return new Promise((resolve) => {
    search.json(params, resolve);
  })
}

getJson(params).then(getNewsData).then(console.log)

Output:

{
   "Popular apps & games":[
      {
         "title":"Netflix",
         "developer":"Netflix, Inc.",
         "link":"https://play.google.com/store/apps/details?id=com.netflix.mediaclient",
         "rating":4.5
      },
      {
         "title":"TikTok",
         "developer":"TikTok Pte. Ltd.",
         "link":"https://play.google.com/store/apps/details?id=com.zhiliaoapp.musically",
         "rating":4.5
      },
      {
         "title":"Instagram",
         "developer":"Instagram",
         "link":"https://play.google.com/store/apps/details?id=com.instagram.android",
         "rating":4
      },
      ... and other results
   ]
}

If you want to see some project made with SerpApi, please write me a message.


Join us on Twitter | YouTube

Add a Feature Request๐Ÿ’ซ or a Bug๐Ÿž