How to scrape Google Local Results with Artificial Intelligence?
There is truly a hype in words describing cutting edge technologies, or the word itself cutting edge technology. As in most cases surrounded by hype, it is caused by a series of miscommunications between different parties related to the subject.
People who theorize the technology are most often separated from the people who implement it as a solution, as the people who implement it are separated from the people who use it. In time, if some words are more easily recognizable and transmissive of the might of the idea, you are surrounded by people who talk about a subject without any specification, expected to be understood.
SerpApi does not claim that it is the only company to have a saying in Artificial Intelligence. Yet, we are driven to give an honest effort to it, and I will be sharing the journey, explicitly, every Wednesday as the implementation evolves.
The Starting Point – The Problem
“Exceptio probat regulam de rebus non exceptis.”
As many people don’t know where to start, we had a problem that might be resolved by Artificial Intelligence, more specifically, a Machine Learning Model. SerpApi’s Google Local Results Scraper API was one of the most exhausting parts of our service for engineers developing and finding fixes for recent bugs, customer service specialists trying to go between engineering and customer countless times, and customers themselves.
Constant changes in page structure and class names, alongside too many elements that could be confused by parsers, were and still are causing some breaks in the code.
It is crucial to evolve an implementation around a problem when it comes to implementing new technology to solve old problems.
For example, in the end, this solution should be able to do what the traditional parser does and do it more accurately. It should also work at a reasonable speed (preferably less than a traditional parser). It should be in a state easily picked up by another person with minimal knowledge, and updated (preferably, easier than a traditional parser).
It should not cost more financially than the recent state in fees. It should have a decent way to test it. It should be used in another problem with a little tweak. It should be in line with models on the rest of the internet to compare and contrast with ease.
Breaking up the problem
“Iucundum est narrare sua mala.”
The implementation needs to be broken into self sufficient parts that address different parts of the problem. In our case, the first problem it needs to resolve is classifying different parsed values without needing any conditions within the parser itself.
Why does it need to be broken into parts? Why don’t we just create a model that takes HTML and gives the resulting JSON with all the values in it? The reasons for that are simple but important. Such a model would take too much processing power if not optimized carefully.
This means it would need third-party cloud services such as Azure Machine Learning, Amazon SageMaker, and Google AutoML to train it. Don’t get me wrong, these services, alongside many other services, are great tools to build your models, and train them.
However, solving one problem with Machine Learning should not come at a cost of setting up, renting, and implementing new servers. Be it many problems solved with Machine Learning, it should be feasible to implement such services. Plus, our aim to make it in standard form with the rest of the internet, should make it easy to use these services in the future.
Aside from these, creating a model that addresses the entirety of the problem all at once, cannot be expected to be fixed by someone with minimal knowledge. If any problem occurs at one function of the model, it should be self-contained.
I hope I have relayed the idea behind breaking the problem to be solved by complementary models. Now, let’s talk about why classifying the different parsed values is the starting point of this tapestry.
One of the main problems we face in SerpApi’s Google Local Results Scraper API, is the confusion of values in right keys within JSON. A parsed address might be served as title of the place, a phone number could be served as working hours etc.
Note that we already have traditional solutions to check for the correctness of these values within the parser. However, as Google Local Results evolve with great speed, and with our diverse engine pool, it is getting harder to solve any occurring problem.
For these reasons, it is best to start with a frequent problem that could be solved by a classifier machine learning model.
Game Plan for SerpApi’s Google Local Results Scraper API
“Malum consilium quod mutari non potest”
We have an ideal plan structure for this effort. However, everything is subject to change as the implementation grows. That doesn’t mean, we will start from scratch to implement everything. Let me explain this while explaining the complete set of models that will resolve the problem.
Ideal Model
-
Take HTML, break it into different parts about different fields to fill.
Example: A part that contains alllocal_results
is separated from the rest of the HTML, including their extractable details hidden within the HTML. -
Take the part that contains all
local_results
related html and make it into an HTML that doesn’t contain unrelated parts.
Example: If there are ads, they will not be included in this new HTML -
Break apart different
local_results
into individual fields to be parsed.
Example: Main element that concerns the individuallocal_result
, alongside deeper information about it hidden within the HTML is extracted. -
Extract every value and make an array out of them.
Example: An array containing different fields to be served is created from individual block. -
Classify each extracted value.
Example: Addresses are classified as addresses, phone numbers as phone numbers. -
Take each classified value, compare it in a model to decide if it is correctly classified, or if it is some new kind of value.
Example: If Google starts serving metaverse equivalent of a restaurant, the model should be able to pick it up and trigger. -
Create a new key name based on title of the individual part, or rest of the SerpApi json keys.
Example: A model should be responsible for naming the key of a field. In this example, it should take in title_name, or different fields throughout SerpApi, and come up with the keymetaverse
.
As you can see, step 5 carries significant importance. Step 6 and step 7 are only operational when there is a new field to be served. So they can be implemented later. Everything up to step 5 already has a solution within the traditional parser. This is why it is important to start implementing from step 5.
Next in line is any step up to step 5. There shouldn’t be any order since everything could be implemented part by part into the traditional parser.
Step 6 and step 7 can be resolved by training with a new database including the new field, and the field key could be named manually by the engineer updating the model easily. That’s why they should be resolved at a later stage.
Technical Challenges – Machine Learning on Rails
“Sunt facta verbis difficiliora”
All of the above mentioned considerations don’t have any meaning without a working implementation. Implementing it on Rails is another hard problem because of the limited number of libraries concerning Machine Learning.
But first, let’s be clear about which model to use in classifying different fields. We will be creating a Character Level Recurrent Neural Network (RNN)
model to solve our problem.
RNN is a type of Feedforward Neural Network. Feedforward Neural Networks are artificial neural networks where nodes do not form a cycle. In simpler words, we will feed a value, and the model will forward it to the hidden layer, and then the hidden layer will forward it to the softmax layer, and from the softmax layer to the output layer.
What will be the input level? The input level will be a tensor made from each letter as a tensor. Here’s how; if we have an alphabet of 5 letters, one letter will constitute a 1x5 matrix dependent on its index on the alphabet.
Example:
alphabet = [a,b,c,d,e]
* a = Tensor([1,0,0,0,0])
* b = Tensor([0,1,0,0,0])
* c = Tensor([0,0,1,0,0])
…
Each word will constitute a 10x1x5. Why 10? It is the maximum number of letters in a word we allow.
Example:
dead = Tensor( [0,0,0,1,0], #d
[0,0,0,0,1], #e
[1,0,0,0,0], #a
[0,0,0,1,0], #d
[0,0,0,0,0], #null
[0,0,0,0,0], #null
[0,0,0,0,0], #null
[0,0,0,0,0], #null
[0,0,0,0,0], #null
[0,0,0,0,0]) #null
Notice that we use 0 tensors for the rest of the space assigned in the matrix.
This way, we can create a mathematical expression we can play around with from a word and then feed it to the model to get Softmax outputs. What will be the softmax output? It’ll be a possible distribution of a word (or sentence read as one word). Let’s assume we have 3 keys, namely, address, phone, and title.
Example: 5540 N Lamar Blvd #12 => model => [address: 0.8, telephone: 0.5, title: 0.15]
From softmax distribution, we will take the maximum and the key to the maximum will be our output.
So in the end, we’ll feed 5540 N Lamar Blvd #12
and it’ll give us address
.
For the creation of the training dataset, we will be creating a JSON file filled with classified values, and then we will check manually (for now) if there’s any value that shouldn’t belong to the class.
Thankfully, SerpApi’s Google Local Results Scraper API’s parser isn’t giving any major problems these days, and feeding the dataset would be possible with an easy Rake command.
There aren’t extensive libraries that could address our problem in Ruby in general. So we will be creating the models using Libtorch, and a Ruby gem that translated from C++ to Ruby, I will cover in the next blog post.
Anyone who will create and train the model will create necessary files within Rails in a folder that doesn’t cover the CI. Here’s the reason. Libtorch with CUDA enabled (enables the user to utilize GPUs for training) has around 2.5 GB of size. Deploying it inside each server would be costly.
So for now, any engineer who will train a model will set up Libtorch and then configure it in their local system with step-by-step instructions for each OS. Upon training, a PTH
file will be created. This file will contain the model we trained.
Here’s another challenge for Rails. We cannot read the files without Libtorch, and Libtorch is dependent on the local machine of the engineer. So, we thought about setting up a separate server to host Libtorch.
But, again, we don’t need any extra expenses for a problem that is too narrow at this point. So, instead, we will be converting this PTH
file to an ONNX
file, which is a generally accepted file format. In the end, gems responsible for Libtorch, since they are dependent on it, should be commented to avoid CI failures. The folder where files that work with Libtorch should be out of the scope of CI also.
Training the model, and converting it to ONNX
format will also be made possible via simple Rake command.
Testing the model will be done in Rspec by calling it numerous times and comparing it to the training dataset to come up with a percentage of correctness. Note that this percentage should give a better result than our current parser.
So to give a step by step instruction on a general working standard:
- Set up Libtorch, and relevant libraries
- Create a Model
- Create a Dataset
- Manually Check Dataset for errors
- Train the model and convert it to ONNX with a simple Rake command
- Test the model with Rspec
- Implement the model within the traditional parser (by feeding ONNX file)
- Comment out Ruby gems related to LibTorch (excluding the one responsible for reading
ONNX
files) - Push the PR
- Explain briefly how it is more effective than a traditional parser in PR Description
- Merge it and inform the customers with a blog post.
V- Conclusion
“Vi veri universum vivus vici”
SerpApi values honesty in work. So as a part of that principle, it is crucial to point out that this adventure may or may not give fruits. However, I hope I was able to give you some nuances in implementing Artificial Intelligence.
I’d like to thank the smart, talented, and passionate team for all their support in the creation of this blog post and the incentivization of this implementation. Next week, we'll be covering more in-depth details of the implementation.