Web Scraping YouTube Autocomplete with Nodejs
What will be scraped
📌Note: For now, we don't have an API that supports extracting autocomplete data.
This blog post is to show you way how you can do it yourself while we're working on releasing our proper API in a meantime. We'll update you on our Twitter once this API will be released.
Full code
If you don't need an explanation, have a look at the full code example in the online IDE
const puppeteer = require("puppeteer-extra");
const StealthPlugin = require("puppeteer-extra-plugin-stealth");
puppeteer.use(StealthPlugin());
const queries = ["javascript", "node", "web scraping"];
const URL = "https://www.youtube.com";
async function getYoutubeAutocomplete() {
const browser = await puppeteer.launch({
headless: false,
args: ["--no-sandbox", "--disable-setuid-sandbox"],
});
const page = await browser.newPage();
await page.setDefaultNavigationTimeout(60000);
await page.goto(URL);
await page.waitForSelector("#contents");
const autocompleteResults = [];
for (query of queries) {
await page.click("#search-input");
await page.keyboard.type(query);
await page.waitForTimeout(5000);
const results = {
query,
autocompleteResults: await page.evaluate(() => {
return Array.from(document.querySelectorAll(".sbdd_a li"))
.map((el) => el.querySelector(".sbqs_c")?.textContent.trim())
.filter((el) => el);
}),
};
autocompleteResults.push(results);
await page.click("#search-clear-button");
await page.waitForTimeout(2000);
}
await browser.close();
return autocompleteResults;
}
getYoutubeAutocomplete().then(console.log);
Preparation
First, we need to create a Node.js* project and add npm
packages puppeteer
, puppeteer-extra
and puppeteer-extra-plugin-stealth
to control Chromium (or Chrome, or Firefox, but now we work only with Chromium which is used by default) over the DevTools Protocol in headless or non-headless mode.
To do this, in the directory with our project, open the command line and enter npm init -y
, and then npm i puppeteer puppeteer-extra puppeteer-extra-plugin-stealth
.
*If you don't have Node.js installed, you can download it from nodejs.org and follow the installation documentation.
📌Note: also, you can use puppeteer
without any extensions, but I strongly recommended use it with puppeteer-extra
with puppeteer-extra-plugin-stealth
to prevent website detection that you are using headless Chromium or that you are using web driver. You can check it on Chrome headless tests website. The screenshot below shows you a difference.
Process
SelectorGadget Chrome extension was used to grab CSS selectors by clicking on the desired element in the browser. If you have any struggles understanding this, we have a dedicated Web Scraping with CSS Selectors blog post at SerpApi.
The Gif below illustrates the approach of selecting different parts of the results.
Code explanation
Declare puppeteer
to control Chromium browser from puppeteer-extra
library and StealthPlugin
to prevent website detection that you are using web driver from puppeteer-extra-plugin-stealth
library:
const puppeteer = require("puppeteer-extra");
const StealthPlugin = require("puppeteer-extra-plugin-stealth");
Next, we "say" to puppeteer
use StealthPlugin
, write search queries and YouTube URL:
puppeteer.use(StealthPlugin());
const queries = ["javascript", "node", "web scraping"];
const URL = "https://www.youtube.com";
Next, write a function to control the browser, and get information:
async function getYoutubeAutocomplete() {
...
}
In this function first we need to define browser
using puppeteer.launch({options})
method with current options
, such as headless: false
and args: ["--no-sandbox", "--disable-setuid-sandbox"]
.
These options mean that we use headless mode and array with arguments which we use to allow the launch of the browser process in the online IDE. And then we open a new page
:
const browser = await puppeteer.launch({
headless: false,
args: ["--no-sandbox", "--disable-setuid-sandbox"],
});
const page = await browser.newPage();
Next, we change default (30 sec) time for waiting for selectors to 60000 ms (1 min) for slow internet connection with .setDefaultNavigationTimeout()
method, go to URL
with .goto()
method and use .waitForSelector()
method to wait until #contents
selector is creating on the page.:
await page.setDefaultNavigationTimeout(60000);
await page.goto(URL);
await page.waitForSelector("#contents");
Then, we define an array with the results, called autocompleteResults
and starts for...of
loop to iterate over all queries:
const autocompleteResults = [];
for (query of queries) {
...
}
Next, in the loop we cick on #search-input
(.click()
method), type current query
with page.keyboard.type(query)
method and wait 5 seconds, using .waitForTimeout(5000)
method:
await page.click("#search-input");
await page.keyboard.type(query);
await page.waitForTimeout(5000);
Then, we make the results object that have query
and autocompleteResults
keys. We get autocompleteResults
using page.evaluate()
method to run code in the brackets in the browser context.
There we need to use .querySelectorAll()
method which returns a static NodeList representing a list of the document's elements that match the css selectors in the brackets and convert result to an array with Array.from()
method to iterate over that array.
After that we find element with class name .sbqs_c
(.querySelector()
method), get raw text (textContent
property) and remove whitespace from both ends of a string with .trim()
method from each of .sbdd_a li
elements. Because sometimes we find empty nodes in the end we need to filter our array and leave true elements (.filter((el) => el)
):
const results = {
query,
autocompleteResults: await page.evaluate(() => {
return Array.from(document.querySelectorAll(".sbdd_a li"))
.map((el) => el.querySelector(".sbqs_c")?.textContent.trim())
.filter((el) => el);
}),
};
Next, we push results
object from current itaration step to the autocompleteResults
array, click #search-clear-button
to clear search input and wait 2 seconds before next itaration:
autocompleteResults.push(results);
await page.click("#search-clear-button");
await page.waitForTimeout(2000);
And finally, we close the browser and return received data:
await browser.close();
return autocompleteResults;
Now we can launch our parser:
$ node YOUR_FILE_NAME # YOUR_FILE_NAME is the name of your .js file
Output
[
{
"query":"javascript",
"autocompleteResults":[
"javascript",
"javascript tutorial for beginners",
"javascript full course",
"javascript tutorial",
"javascript dom",
"javascript mastery",
"javascript course",
"javascript interview questions and answers",
"javascript for beginners",
"javascript с нуля",
"javascript project",
"javascript ninja",
"javascript game",
"javascript interview"
]
},
{
"query":"node",
"autocompleteResults":[
"node js",
"node js tutorial",
"node",
"node js project",
"node js express",
"node js interview",
"node video tutorial",
"node video",
"node js interview questions",
"node js event loop",
"node js уроки",
"nodemailer",
"node red",
"nodemcu"
]
},
{
"query":"web scraping",
"autocompleteResults":[
"web scraping weather data python",
"web scraping",
"web scraping python",
"web scraping javascript",
"web scraping amazon product",
"web scraping amazon price",
"web scraping amazon",
"web scraping amazon reviews",
"web scraping amazon reviews python",
"web scraping indeed",
"web scraping flight prices",
"web scraping using python",
"web scraping tutorial"
]
}
]
Extract suggestions from Google Autocomplete Client
Previous example was a "hard" way. Also you can parse data using following URL which will output a txt file:
"https://clients1.google.com/complete/search?client=youtube&hl=en&q=minecraft"
If you want to see some projects made with SerpApi, please write me a message.
Add a Feature Request💫 or a Bug🐞