Web Scraping Google Play Apps with Nodejs
What will be scraped
๐Note: In this blog post I don't show how to scroll the page, because on this page Google uses infinite scroll, which works with Javascript, so if you want to scroll the page you need to use some browser automation(e.g. Puppeteer) which is much slower.
Using Google Play Apps Store API from SerpApi
This section is to show the comparison between the DIY solution and our solution.
The biggest difference is that you don't need to create the parser from scratch and maintain it.
There's also a chance that the request might be blocked at some point from Google, we handle it on our backend so there's no need to figure out how to do it yourself or figure out which CAPTCHA, proxy provider to use.
First, we need to install google-search-results-nodejs
:
npm i google-search-results-nodejs
Here's the full code example, if you don't need an explanation:
const SerpApi = require("google-search-results-nodejs");
const search = new SerpApi.GoogleSearch(process.env.API_KEY); //your API key from serpapi.com
const params = {
engine: "google_play", // search engine
gl: "us", // parameter defines the country to use for the Google search
hl: "en", // parameter defines the language to use for the Google search
store: "apps", // parameter defines the type of Google Play store
};
const getJson = () => {
return new Promise((resolve) => {
search.json(params, resolve);
});
};
const getResults = async () => {
const json = await getJson();
const appsResults = json.organic_results.reduce((result, category) => {
const { title: categoryTitle, items } = category;
const apps = items.map((app) => {
const { title, link, rating, thumbnail, product_id } = app;
return {
title,
link,
rating,
thumbnail,
appId: product_id,
};
});
return {
...result,
[categoryTitle]: apps,
};
}, {});
return appsResults;
};
getResults().then((result) => console.dir(result, { depth: null }));
Code explanation
First, we need to declare SerpApi
from google-search-results-nodejs
library and define new search
instance with your API key from SerpApi:
const SerpApi = require("google-search-results-nodejs");
const search = new SerpApi.GoogleSearch(API_KEY);
Next, we write the necessary parameters for making a request:
const params = {
engine: "google_play", // search engine
gl: "us", // parameter defines the country to use for the Google search
hl: "en", // parameter defines the language to use for the Google search
store: "apps", // parameter defines the type of Google Play store
};
Next, we wrap the search method from the SerpApi library in a promise to further work with the search results:
const getJson = () => {
return new Promise((resolve) => {
search.json(params, resolve);
});
};
And finally, we declare the function getResult
that gets data from the page and return it:
const getResults = async () => {
...
};
In this function first, we get json
with results, then we need to iterate organic_results
array in the received json
. To do this we use reduce()
method (it's allowed to make the object with results). On each iteration step we return previous step result (using spread syntax
) and add the new category with name from categoryTitle
constant:
const json = await getJson();
const appsResults = json.organic_results.reduce((result, category) => {
...
return {
...result,
[categoryTitle]: apps,
};
}, {});
return appsResults;
Next, we destructure category
element, redefine title
to categoryTitle
constant, and iterate the items
array to get all apps from this category. To do this we need to destructure the app
element and return this constants:
const { title: categoryTitle, items } = category;
const apps = items.map((app) => {
const { title, link, rating, thumbnail, product_id } = app;
return {
title,
link,
rating,
thumbnail,
appId: product_id,
};
});
After, we run the getResults
function and print all the received information in the console with the console.dir
method, which allows you to use an object with the necessary parameters to change default output options:
getResults().then((result) => console.dir(result, { depth: null }));
Output
{
"Popular apps":[
{
"title":"WhatsApp Messenger",
"link":"https://play.google.com/store/apps/details?id=com.whatsapp",
"rating":4.3,
"thumbnail":"https://play-lh.googleusercontent.com/bYtqbOcTYOlgc6gqZ2rwb8lptHuwlNE75zYJu6Bn076-hTmvd96HH-6v7S0YUAAJXoJN=s256-rw",
"appId":"com.whatsapp"
},
... and other results
],
"Recommended for you":[
{
"title":"Gmail",
"link":"https://play.google.com/store/apps/details?id=com.google.android.gm",
"rating":4.2,
"thumbnail":"https://play-lh.googleusercontent.com/KSuaRLiI_FlDP8cM4MzJ23ml3og5Hxb9AapaGTMZ2GgR103mvJ3AAnoOFz1yheeQBBI=s256-rw",
"appId":"com.google.android.gm"
},
... and other results
]
... and other categories
}
DIY Code
If you don't need an explanation, have a look at the full code example in the online IDE
const cheerio = require("cheerio");
const axios = require("axios");
const AXIOS_OPTIONS = {
headers: {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.64 Safari/537.36",
}, // adding the User-Agent header as one way to prevent the request from being blocked
params: {
hl: "en", // Parameter defines the language to use for the Google search
gl: "us", // parameter defines the country to use for the Google search
},
};
function getMainPageInfo() {
return axios.get(`https://play.google.com/store/apps`, AXIOS_OPTIONS).then(function ({ data }) {
let $ = cheerio.load(data);
const mainPageInfo = Array.from($(".oVnAB").closest("section")).reduce((result, block) => {
const categoryTitle = $(block).find(".oVnAB").text().trim();
if (categoryTitle !== "Top charts") {
const apps = Array.from($(block).find(".ULeU3b")).map((app) => {
const link = `https://play.google.com${$(app).find(".Si6A0c").attr("href")}`;
const appId = link.slice(link.indexOf("?id=") + 4);
return {
title: $(app).find(".Epkrse").text().trim(),
link,
rating: parseFloat($(app).find(".vlGucd > div:first-child").attr("aria-label").slice(6, 9)),
thumbnail: $(app).find(".TjRVLb img").attr("src"),
appId,
};
});
return {
...result,
[categoryTitle]: apps,
};
}
}, {});
return mainPageInfo;
});
}
getMainPageInfo().then((result) => console.dir(result, { depth: null }));
Preparation
First, we need to create a Node.js* project and add npm
packages cheerio
to parse parts of the HTML markup, and axios
to make a request to a website.
To do this, in the directory with our project, open the command line and enter npm init -y
, and then npm i cheerio axios
.
*If you don't have Node.js installed, you can download it from nodejs.org and follow the installation documentation.
Process
First of all, we need to extract data from HTML elements. The process of getting the right CSS selectors is fairly easy via SelectorGadget Chrome extension which enables us to grab CSS selectors by clicking on the desired element in the browser. However, it is not always working perfectly, especially when the website is heavily used by JavaScript.
We have a dedicated web Scraping with CSS Selectors blog post at SerpApi if you want to know a little bit more about them.
The Gif below illustrates the approach of selecting different parts of the results.
Code explanation
Declare constants from cheerio
and axios
libraries:
const cheerio = require("cheerio");
const axios = require("axios");
Next, we write a request options: HTTP headers
with User-Agent
(is used to act as a "real" user visit. Default axios
request user-agent is axios/<axios_version>
so websites understand that it's a script that sends a request and might block it. Check what's your user-agent), and the necessary parameters for making a request:
const AXIOS_OPTIONS = {
headers: {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.64 Safari/537.36",
}, // adding the User-Agent header as one way to prevent the request from being blocked
params: {
hl: "en", // Parameter defines the language to use for the Google search
gl: "us", // parameter defines the country to use for the Google search
},
};
Next, we write a function that makes the request and returns the received data. We received the response from axios
request that has data
key that we destructured and parse it with cheerio
:
function getMainPageInfo() {
return axios.get(`https://play.google.com/store/apps`, AXIOS_OPTIONS).then(function ({ data }) {
let $ = cheerio.load(data);
...
})
}
Next, we need to get all HTML elements with ".oVnAB"
selector ($()
method) and find closest
"section"
from all parents elements. Then we use reduce()
method (it's allowed to make the object with results) to iterate an array that built with Array.from()
method:
const mainPageInfo = Array.from($(".oVnAB").closest("section")).reduce((result, block) => {
...
}, {});
return mainPageInfo;
And finally, we need to get categoryTitle
, and title
, link
, rating
, thumbnail
and appId
(we can cut it from link
using slice()
and indexOf()
methods) of each app from the selected category (using $()
, find()
, attr()
and text()
methods).
On each iteration step we return previous step result (using spread syntax
) and add the new category with name from categoryTitle
constant:
๐Note: In this case, we skip the "Top charts" category, because there are built on the page dynamically via JavaScript, and to get it we need to use some browser automation(e.g. Puppeteer) which is much slower.
const categoryTitle = $(block).find(".oVnAB").text().trim();
if (categoryTitle !== "Top charts") {
const apps = Array.from($(block).find(".ULeU3b")).map((app) => {
const link = `https://play.google.com${$(app).find(".Si6A0c").attr("href")}`;
const appId = link.slice(link.indexOf("?id=") + 4);
return {
title: $(app).find(".Epkrse").text().trim(),
link,
rating: parseFloat($(app).find(".vlGucd > div:first-child").attr("aria-label").slice(6, 9)),
thumbnail: $(app).find(".TjRVLb img").attr("src"),
appId,
};
});
return {
...result,
[categoryTitle]: apps,
};
}
Now we can launch our parser:
$ node YOUR_FILE_NAME # YOUR_FILE_NAME is the name of your .js file
Output
{
"Popular apps":[
{
"title":"WhatsApp Messenger",
"link":"https://play.google.com/store/apps/details?id=com.whatsapp",
"rating":4.3,
"thumbnail":"https://play-lh.googleusercontent.com/bYtqbOcTYOlgc6gqZ2rwb8lptHuwlNE75zYJu6Bn076-hTmvd96HH-6v7S0YUAAJXoJN=s256-rw",
"appId":"com.whatsapp"
},
... and other results
],
"Recommended for you":[
{
"title":"Google Earth",
"link":"https://play.google.com/store/apps/details?id=com.google.earth",
"rating":4.3,
"thumbnail":"https://play-lh.googleusercontent.com/9ORDOmn8l9dh-j4Sg3_S7CLcy0RRAI_wWt5jZtJOPztwnEkQ4y7mmGgoSYqbFR5jTc3m=s256-rw",
"appId":"com.google.earth"
},
... and other results
]
... and other categories
}
Links
If you want other functionality added to this blog post (e.g. extracting additional categories) or if you want to see some projects made with SerpApi, write me a message.
Add a Feature Request๐ซ or a Bug๐