In my previous blog post, I already shared how we handle a SerpApi request. Different search engines have different ways of requesting data and extracting information.
In this blog post, I'll discuss all available methods for extracting data from search engines. Each method has advantages and disadvantages. It's about choosing to make your scraper unique and stand out from the crowd.
This is a general survey of available methods to extract data in the scraping industry. At SerpApi, we spent a lot of time experimenting with them and doing many benchmarks before applying them to production.
Let's get started with the most famous method, also the easiest one: the browser.
Browser
We're all familiar with browsers like Chrome, Firefox, Edge... You visit a search engine result page, all available information is there.
The job is to select the correct element, and all is done! Let's do a simple exercise:
- Open the browser with: https://google.com/search?q=coffee
- Open inspect (Command + Option + I on Chrome Mac app)
- Run this script:
document.querySelectorAll(".LC20lb.MBeuO.DKV0Md").forEach(node => console.info(node.innerText))
FYI: In reality, Google has many layouts for their search results. The CSS selector is far more complicated than this example.
Yeah, it will see all titles of all organic results.
Once you visit the page with the HTML file, the browser will fetch HTML, Javascript, and stylesheets.
The stylesheet will help decorate and make the page look beautiful. However, we will skip this UI part when focusing on extracting information.
Once the web page is loaded completely, the browser will execute javascript, change some HTML parts, and then you will get the final results.
That's why using a browser will make the job super easy. Everything is done by the browser; you just need to write the script to extract the data.
The complexity of the extracting information script depends on how many search results layouts you want.
We cannot follow this manual method for scraping massive pages. We need to leverage a web driver to program a web browser to control and execute scripts automatically. Some popular libraries are widely used in the industry, such as Playwright, Selenium, and Puppeteer.
Pros
- Easy to build and maintain
- Large community
- Well documented
Cons
- Slow. We must wait for the page to load completely before executing the scripts.
- Unexpected errors while loading pages, facing new popups, notifications, new UI elements
The big problem of this approach is "speed". Scraping massive pages takes a lot of time. To avoid this issue, we can take a look at the latest technology: Browerless
Browserless
Browserless is a promising technology for scraping web content. As its name suggests, it has full browser features without rendering UI. It will be much faster than Chrome or Firefox. We did many benchmarks and experiments with Browserless. It's fast enough for most usages.
It has a nice API to navigate pages and run the script. You can try it by yourself.
Pros
- Easy to build and maintain
- Well document
Cons
- Need to pay for the service or license for self-host
- For the self-host option, you need to maintain the infrastructure
- Unexpected errors or facing new weird UI elements
Raw HTML
Unlike these approaches above, scraping raw HTML and extracting information from raw HTML is the most reliable and fast. You don't need a browser to load a full page. Just like making an API request, we make a request to Google to get only the HTML file—no extra CSS or Javascript files are needed. All information should be there.
Pros
- Super fast and reliable.
Cons
- Hard to build and maintain
- No document, no guidelines. Different SERPs have different ways of extracting data
AI extracting data
AI is eating the world. Some products use AI to extract data. At SerpApi, we also run some experiments to scrape data with AI. Using AI is pretty straightforward. You just need to define the expected JSON schema and let AI figure out how to return exact data. We also tested AI results with some difficult results, like local pack results. It can return correct results.
Pros
- Very easy to build and get correct results
- Well documented
Cons
- Slow
- Super pricy
Conclusion
In this survey, we can guess which approach SerpApi is implementing. We are famous for our high-quality and fast SERP APIs. We extract data from raw HTML files. We spent years developing our APIs to support many layouts and results. We take our time to save yours.
If you have any questions, feel free to send me email: andy@serpapi.com