Web Scraping Google Books Ngram Viewer with Nodejs
Intro
Currently, we don't have an API that supports extracting data from Google Books Ngram Viewer page.
This blog post is to show you how you can do it yourself with the provided DIY solution below while we're working on releasing our proper API.
The solution can be used for personal use as it doesn't include the Legal US Shield that we offer for our paid production and above plans and has its limitations such as the need to bypass blocks, for example, CAPTCHA.
You can check our public roadmap to track the progress for this API:
What will be scraped
Comparing with the scraped data chart:
Full code
const axios = require("axios");
const fs = require("fs");
const { ChartJSNodeCanvas } = require("chartjs-node-canvas");
const searchString = "Albert Einstein,Sherlock Holmes,Frankenstein,Steve Jobs,Taras Shevchenko,William Shakespeare"; // what we want to get
const startYear = 1800; // the start year of the search
const endYear = 2019; // the end year of the search
const AXIOS_OPTIONS = {
headers: {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.64 Safari/537.36",
}, // adding the User-Agent header as one way to prevent the request from being blocked
params: {
content: searchString, // what we want to search
year_start: startYear, // parameter defines the start year of the search
year_end: endYear, // parameter defines the end year of the search
},
};
async function saveChart(chartData) {
const width = 1920; //chart width in pixels
const height = 1080; //chart height in pixels
const backgroundColour = "white"; // Uses https://www.w3schools.com/tags/canvas_fillstyle.asp
const chartJSNodeCanvas = new ChartJSNodeCanvas({ width, height, backgroundColour });
const labels = new Array(endYear - startYear + 1).fill(startYear).map((el, i) => (el += i));
const configuration = {
type: "line", // for line chart
data: {
labels,
datasets: chartData?.map((el) => {
const data = el.timeseries.map((el) => el * 100);
return {
label: el.ngram,
data,
borderColor: [`rgb(${parseInt(Math.random() * 255)}, ${parseInt(Math.random() * 255)}, ${parseInt(Math.random() * 255)})`],
};
}),
},
options: {
scales: {
y: {
title: {
display: true,
text: "%",
},
},
},
},
};
const base64Image = await chartJSNodeCanvas.renderToDataURL(configuration);
const base64Data = base64Image.replace(/^data:image\/png;base64,/, "");
fs.writeFile("chart.png", base64Data, "base64", function (err) {
if (err) {
console.log(err);
}
});
}
function getChart() {
return axios.get(`https://books.google.com/ngrams/json`, AXIOS_OPTIONS).then(({ data }) => data);
}
getChart().then(saveChart);
Preparation
First, we need to create a Node.js* project and add npm
packages axios
to make a request to a website, chart.js
to build chart from received data and chartjs-node-canvas
to render chart with Chart.js using canvas
.
To do this, in the directory with our project, open the command line and enter:
$ npm init -y
And then:
$ npm i axios chart.js chartjs-node-canvas
*If you don't have Node.js installed, you can download it from nodejs.org and follow the installation documentation.
Process
We'll receive Books Ngram data in JSON format, so we need only handle the received data, and create our own chart (if needed):
Request:
axios.get(`https://books.google.com/ngrams/json`, AXIOS_OPTIONS).then(({ data }) => data);
Response JSON:
[
{
"ngram": "Albert Einstein",
"parent": "",
"type": "NGRAM",
"timeseries": [
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 9.077474010561153e-10, 9.077474010561153e-10, 9.077474010561153e-10,
...and other chart data
]
},
{
"ngram": "Sherlock Holmes",
"parent": "",
"type": "NGRAM",
"timeseries": [
4.731798064483428e-9, 3.785438451586742e-9, 3.154532042988952e-9, 2.7038846082762446e-9, 0, 2.47730296593878e-10,
...and other chart data
]
},
...and other Books Ngram data
]
Code explanation
Declare constants from axios
, fs
(fs
library allows you to work with the file system on your computer) and chartjs-node-canvas
libraries:
const axios = require("axios");
const fs = require("fs");
const { ChartJSNodeCanvas } = require("chartjs-node-canvas");
Next, we write what we want to get, start year and end year:
const searchString = "Albert Einstein,Sherlock Holmes,Frankenstein,Steve Jobs,Taras Shevchenko,William Shakespeare";
const startYear = 1800;
const endYear = 2019;
Next, we write a request options: HTTP headers
with User-Agent
which is used to act as a "real" user visit, and the necessary parameters for making a request.
Default axios
request user-agent is axios/<axios_version>
so websites understand that it's a script that sends a request and might block it. Check what's your user-agent:
const AXIOS_OPTIONS = {
headers: {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.64 Safari/537.36",
}, // adding the User-Agent header as one way to prevent the request from being blocked
params: {
content: searchString, // what we want to search
year_start: startYear, // parameter defines the start year of the search
year_end: endYear, // parameter defines the end year of the search
},
};
Next, we write a function that handles and saves received data to the ".png" file:
async function saveChart(chartData) {
...
}
In this function we need to declare the canvas
width
, height
and backgroundColor
, then build it using chartjs-node-canvas
:
const width = 1920; //chart width in pixels
const height = 1080; //chart height in pixels
const backgroundColour = "white"; // Uses https://www.w3schools.com/tags/canvas_fillstyle.asp
const chartJSNodeCanvas = new ChartJSNodeCanvas({ width, height, backgroundColour });
Then, we need to define and create the "x" axis labels. To do this we need to create a new array with a length that equals the numbers of years from startYear
to endYear
(we add '1' because we need to include these years also).
Then we fill
an array with startYear
and add element position (i
) to each value (using map()
method):
const labels = new Array(endYear - startYear + 1)
.fill(startYear)
.map((el, i) => (el += i));
Next, we need to create configuration
object for chart.js
library. In this object, we define chart type
, data
, and options
.
In the chart data
we define the main axis labels
and make datasets
from received chartData
in which we set for each line label, data, and random color (using Math.random()
and parseInt()
methods).
In the chart options
we set the 'y' axis name and allow to show it (display
property):
const configuration = {
type: "line", // for line chart
data: {
labels,
datasets: chartData?.map((el) => {
const data = el.timeseries.map((el) => el * 100);
return {
label: el.ngram,
data,
borderColor: [`rgb(${parseInt(Math.random() * 255)}, ${parseInt(Math.random() * 255)}, ${parseInt(Math.random() * 255)})`],
};
}),
},
options: {
scales: {
y: {
title: {
display: true,
text: "%",
},
},
},
},
};
Next, we wait for building chart in base64
encoding, remove data type properties from base64
string (replace()
method) and save the "chart.png" file with writeFile()
method:
const base64Image = await chartJSNodeCanvas.renderToDataURL(configuration);
const base64Data = base64Image.replace(/^data:image\/png;base64,/, "");
fs.writeFile("chart.png", base64Data, "base64", function (err) {
if (err) {
console.log(err);
}
});
Then, we write a function that makes the request and returns the received data. We received the response from axios
request that has data
key that we destructured and return it:
function getChart() {
return axios
.get(`https://books.google.com/ngrams/json`, AXIOS_OPTIONS)
.then(({ data }) => data);
}
And finally, we need to run our functions:
getChart().then(saveChart);
Now we can launch our parser:
$ node YOUR_FILE_NAME # YOUR_FILE_NAME is the name of your .js file
Saved file
If you want to see some projects made with SerpApi, write me a message.
Add a Feature Requestπ« or a Bugπ