Traditional Parser vs Open Source LLM Parser (Mistral 7B)
At SerpApi, we are always looking for methods to improve our web scraping parser to better serve our customers. Previously, we have experimented with Machine Learning for parsing and make a comparison against our traditional parser. This time, we're turning our attention to Large Language Models (LLMs) to evaluate their performance. The surge in popularity of Large Language Models has led to numerous organizations, including Facebook, Yi, and Mistral AI, releasing their own open-source versions. In this blog post, we will find out the capability of open-source LLMs for parsing HTML. We have already experimented with OpenAI's GPT-4 and it is impressive in its parsing capability. For a more in-depth understanding of this, we recommend reading our detailed blog post on the subject.
In our trials, Mistral 7B has emerged as particularly impressive, especially considering its size relative to other LLMs. Based on these findings, we have decided to use it for our upcoming test run.
Prompt:
I did numerous rounds of refactoring of the prompt, especially the User
part. It might not be perfect, but it is the closest I can get for now. Let me know if you have a suggestion!
System:
You are an expert in Web Scraping, so you are capable to find the information in HTML and label them accordingly. Please return the final result in JSON.
User:
Data to scrape:
title: Name of the business
type: The business nature like Cafe, Coffee Shop, many others
phone: The phone number of the business
address: Address of the business, can be a state, country or a full address
years_in_business: Number of years since the business started
hours: Business operating hours
rating: Rating of the business
reviews: Number of reviews on the business
price: Typical spending on the business
description: Extra information that is not mentioned yet in any of the data
service_options: Array of shopping options from the business, for example, in store shopping, delivery and many others. It should be in format -> option_name: true
is_operating: Whether the business is operating
HTML:
{html}
html
is supposed to be the local result section HTML in Google page, which I have to write few lines of code to extract.
This HTML represents the above local result and this is exactly what will be passed into the model.
<div jscontroller="AtSb" class="w7Dbne CR1S4b" data-record-click-time="false" id="tsuid_48" jsdata="zt2wNd;_;A+m88g WDO8Ff;_;A+m88s" jsaction="rcuQ6b:npT2md;e3EWke:kN9HDb" data-hveid="CC4QAA">
<div jsname="jXK9ad" class="uMdZh tIxNaf rllt__borderless" jsaction="mouseover:UI3Kjd;mouseleave:Tx5Rb;focusin:UI3Kjd;focusout:Tx5Rb">
<div class="VkpGBb">
<div class="cXedhc">
<a class="vwVdIc wzN8Ac rllt__link a-no-hover-decoration" jsname="kj0dLd" data-cid="11265938073076301333" jsaction="click:h5M12e;" role="link" tabindex="0" data-ved="2ahUKEwjC7OW8romDAxXwFVkFHbmLAEoQ1YkKegQILhAB">
<div>
<div class="rllt__details">
<div class="dbg0pd" aria-level="3" role="heading"><span class="OSrXXb">Houndstooth Coffee</span></div>
<div><span><span class="Y0A0hc"><span class="yi40Hd YrbPuc" aria-hidden="true">4.6</span><span class="z3HNkc fUNJzc" aria-label="Rated 4.6 out of 5," role="img"><span style="width:12px"></span></span><span class="RDApEe YrbPuc">(922)</span></span></span> · <span aria-label="Moderately expensive" role="img">$$</span> · Coffee shop</div>
<div>401 Congress Ave. #100c</div>
<div><span><span style="color:rgba(217,48,37,1.0)">Closed</span> ⋅ Opens 6:30 AM</span></div>
<div class="pJ3Ci"><span>Cozy hangout for carefully sourced brews</span></div>
</div>
</div>
</a>
<a class="uQ4NLd b9tNq wzN8Ac rllt__link a-no-hover-decoration" aria-hidden="true" tabindex="-1" jsname="kj0dLd" data-cid="11265938073076301333" jsaction="click:h5M12e;" role="link" data-ved="2ahUKEwjC7OW8romDAxXwFVkFHbmLAEoQ1YkKegQILhAP">
<g-img class="gTrj3e"><img id="pimg_2" src="https://lh5.googleusercontent.com/p/AF1QipNRZ1ehiInk8CTrHCD08GnhcnF7e4q-1H8Qs8mG=w114-h114-n-k-no" class="YQ4gaf zr758c wA1Bge" height="114" width="114" alt="" data-csiid="2" data-atf="1"></g-img>
</a>
</div>
</div>
</div>
</div>
P.S. The OpenAI-like API is possible because of LiteLLM. Also the deployment of open-source LLMs is effortless with ollama.
Specs
Traditional parser was running on my local which is a MacBook Air M2 with 16GB of RAM. While Mistral 7B was running on NVIDIA RTX4000 (8 CPU, 30GB RAM). Both are running locally without the need for network connection.
The Comparison
Query: Dentist
If you would like to inspect the data and HTML more in-depth, feel free to check out the playground.
Traditional | Mistral 7B |
---|---|
Processed in 0.37s
|
Processed in 5.90s
|
Mistral 7B nailed it. However, a notable challenge when working with Large Language Models (LLMs) is their tendency to generate unwanted text or comments. We have to update the prompt but we just want to test the correctness of the data, so it is good for now.
Traditional | Mistral 7B |
---|---|
Processed in 0.37s
|
Processed in 3.53s
|
Traditional | Mistral 7B |
---|---|
Processed in 0.37s
|
Processed in 6.40s
|
The rating
output by Mistral 7B is 4.9 out of 5
, it would be perfect if it is just 4.9
, but it is something that can be easily done in post-processing, so it is still good.
Query: Bookstore (Playground)
Traditional | Mistral 7B |
---|---|
Processed in 0.52s
|
Processed in 3.04s
|
Mistral 7B made a mistake on the type
and hours
. Other than that, it is all good.
Traditional | Mistral 7B |
---|---|
Processed in 0.52s
|
Processed in 3.89s
|
We have a new attribute service_options
. It wasn't in the prompt initially, so I added service_options: Array of shopping options from the business, for example, in store shopping, delivery and many others
to the prompt. With that, it produced results similar to our API, pretty impressive. However, it hallucinates on the description
.
Traditional | Mistral 7B |
---|---|
Processed in 0.52s
|
Processed in 3.12s
|
The results from Mistral 7B omitted the opening hours, a detail we successfully extracted from the previous HTML. I'm considering updating the prompt to rectify this, but for now, I've decided to leave it as is. This decision aims to showcase the inconsistencies inherent in Large Language Models.
Query: Mexican Restaurant (Playground)
Traditional | Mistral 7B |
---|---|
Processed in 0.58s
|
Processed in 3.75s
|
Mistral 7B is very impressive, producing the service_options
structure similar to our API. However, the type
is not correct, it should be Mexican
. It does produce correct type
in the next example.
Traditional | Mistral 7B |
---|---|
Processed in 0.58s
|
Processed in 3.67s
|
The type
is spot on. This time it hallucinates on the service_options
.
Traditional | Mistral 7B |
---|---|
Processed in 0.58s
|
Processed in 3.62s
|
Similar to previous example, it hallucinates on the service_options
.
Query: Insurance Agency (Playground)
A little tricky for this example. If you can notice, this example includes Website and Directions buttons. If I have to simply give the HTML to the model, the output is totally unusable.
Sample output:
The given JSON data is a local search result for a restaurant. Here's the relevant information:
* The name of the restaurant is "Bangers & Lace".
* The address is "11507 Circle Back Rd, Suite 218, Austin, TX 78724".
* The phone number is "512-631-9380".
* The opening hours are Monday to Friday from 11:00 AM to 10:00 PM and Saturday and Sunday from 12:00 PM to 10:00 PM.
* The website URL is "<https://www.bangersandlaceaustin.com>".
* There are no user reviews available for this restaurant.
It doesn't follow the instruction to return the output in JSON and more than that it hallucinates on the entire data. I have thought about it and it is probably because of token length, it is too large. This happens because Directions
button in HTML contains a large portion of Google Maps' data.
Here is the initial HTML:
<div class="w7Dbne CR1S4b " data-hveid="CCQQAA" data-record-click-time="true" id="tsuid_32" jsaction="rcuQ6b:npT2md;e3EWke:kN9HDb" jscontroller="AtSb" jsdata="zt2wNd;_;CgGqWk WDO8Ff;_;CgGqW0"><div class="uMdZh tIxNaf rllt__borderless" jsaction="mouseover:UI3Kjd;mouseleave:Tx5Rb;focusin:UI3Kjd;focusout:Tx5Rb" jsname="jXK9ad"><div class="VkpGBb"><div class="cXedhc"><a class="vwVdIc wzN8Ac rllt__link a-no-hover-decoration" data-cid="16496161963606301439" data-ved="2ahUKEwiNo_2yhv-CAxXcF1kFHZSWAFMQvS56BAgkEAE" jsaction="click:h5M12e;" jsname="kj0dLd" role="link" tabindex="0"><div><div class="rllt__details"><div aria-level="3" class="dbg0pd " role="heading"><span class="OSrXXb">Michaels Insurance Group</span></div><div><span><span>No reviews</span></span> · Insurance agency</div><div>3+ years in business · 501 Congress Ave. Suite 150A · (512) 593-1171</div><div class=""><span><span style="color:rgba(217,48,37,1.0)">Closed</span> ⋅ Opens 9 AM Fri</span></div><div class="dXnVAb "><span class="BI0Dve "><span><span aria-label="Onsite services">Onsite services</span></span></span></div></div></div></a></div><a aria-describedby="tsuid_32" class="yYlJEf Q7PwXb L48Cpd brKmxb " data-ved="2ahUKEwiNo_2yhv-CAxXcF1kFHZSWAFMQgU96BAgkEAk" href="https://www.michaelsinsurancegroup.com/" lang="en-US" ping="/url?sa=t&source=web&rct=j&opi=89978449&url=https://www.michaelsinsurancegroup.com/&ved=2ahUKEwiNo_2yhv-CAxXcF1kFHZSWAFMQgU96BAgkEAk"><div class="wLAgVc"><div class="VDgVie k0Jjg fCrZyc NQYJvc FR7ZSc OJeuxf PrjL8c"><div class="niO4u VDgVie SlP8xc iCQO5d"><div class="kHtcsd"><span class="d3o3Ad z1asCe GYDk8c"><svg focusable="false" viewbox="0 0 24 24" xmlns="http://www.w3.org/2000/svg"><path d="M12 2C6.48 2 2 6.48 2 12s4.48 10 10 10 10-4.48 10-10S17.52 2 12 2zm-1 17.93c-3.95-.49-7-3.85-7-7.93 0-.62.08-1.21.21-1.79L9 15v1c0 1.1.9 2 2 2v1.93zm6.9-2.54c-.26-.81-1-1.39-1.9-1.39h-1v-3c0-.55-.45-1-1-1H8v-2h2c.55 0 1-.45 1-1V7h2c1.1 0 2-.9 2-2v-.41c2.93 1.19 5 4.06 5 7.41 0 2.08-.8 3.97-2.1 5.39z"></path></svg></span></div></div><div class="QuU3Wb sjVJQd"><div>Website</div></div></div></div></a><a aria-describedby="tsuid_32" class="yYlJEf VByer Q7PwXb VDgVie brKmxb " data-url="/maps/dir//Michaels+Insurance+Group,+501+Congress+Ave.+Suite+150A,+Austin,+TX+78701/data=!4m6!4m5!1m1!4e2!1m2!1m1!1s0x8644b5d32f23bb83:0xe4ee23e7ce105aff?sa=X&ved=2ahUKEwiNo_2yhv-CAxXcF1kFHZSWAFMQ48ADegQIIRAA&hl=en&gl=us" href="/maps/dir//Michaels+Insurance+Group,+501+Congress+Ave.+Suite+150A,+Austin,+TX+78701/data=!4m6!4m5!1m1!4e2!1m2!1m1!1s0x8644b5d32f23bb83:0xe4ee23e7ce105aff?sa=X&ved=2ahUKEwiNo_2yhv-CAxXcF1kFHZSWAFMQ48ADegQIIRAA&hl=en&gl=us" ping="/url?sa=t&source=web&rct=j&opi=89978449&url=/maps/dir//Michaels%2BInsurance%2BGroup,%2B501%2BCongress%2BAve.%2BSuite%2B150A,%2BAustin,%2BTX%2B78701/data%3D!4m6!4m5!1m1!4e2!1m2!1m1!1s0x8644b5d32f23bb83:0xe4ee23e7ce105aff%3Fsa%3DX%26ved%3D2ahUKEwiNo_2yhv-CAxXcF1kFHZSWAFMQ48ADegQIIRAA%26hl%3Den%26gl%3Dus&ved=2ahUKEwiNo_2yhv-CAxXcF1kFHZSWAFMQ48ADegQIJBAL" style="cursor:pointer" tabindex="0"><div><div class="VDgVie k0Jjg fCrZyc NQYJvc FR7ZSc OJeuxf PrjL8c"><div class="niO4u VDgVie SlP8xc iCQO5d"><div class="kHtcsd"><span class="d3o3Ad TU05kd"></span></div></div><div class="QuU3Wb sjVJQd"><div lang="en-US">Directions</div></div></div></div></a></div></div></div>
As Website and Directions is not part of the test this round, I decided to discard the HTML.
Truncated HTML:
<div class="w7Dbne CR1S4b" data-hveid="CCQQAA" data-record-click-time="true" id="tsuid_32" jsaction="rcuQ6b:npT2md;e3EWke:kN9HDb" jscontroller="AtSb" jsdata="zt2wNd;_;CgGqWk WDO8Ff;_;CgGqW0"><div class="uMdZh tIxNaf rllt__borderless" jsaction="mouseover:UI3Kjd;mouseleave:Tx5Rb;focusin:UI3Kjd;focusout:Tx5Rb" jsname="jXK9ad"><div class="VkpGBb"><div class="cXedhc"><a class="vwVdIc wzN8Ac rllt__link a-no-hover-decoration" data-cid="16496161963606301439" data-ved="2ahUKEwiNo_2yhv-CAxXcF1kFHZSWAFMQvS56BAgkEAE" jsaction="click:h5M12e;" jsname="kj0dLd" role="link" tabindex="0"><div><div class="rllt__details"><div aria-level="3" class="dbg0pd" role="heading"><span class="OSrXXb">Michaels Insurance Group</span></div><div><span><span>No reviews</span></span> · Insurance agency</div><div>3+ years in business · 501 Congress Ave. Suite 150A · (512) 593-1171</div><div><span><span style="color:rgba(217,48,37,1.0)">Closed</span> ⋅ Opens 9 AM Fri</span></div><div class="dXnVAb"><span class="BI0Dve"><span><span aria-label="Onsite services">Onsite services</span></span></span></div></div></div></a></div><a aria-describedby="tsuid_32" class="yYlJEf Q7PwXb L48Cpd brKmxb" data-ved="2ahUKEwiNo_2yhv-CAxXcF1kFHZSWAFMQgU96BAgkEAk" href="https://www.michaelsinsurancegroup.com/" lang="en-US" ping="/url?sa=t&source=web&rct=j&opi=89978449&url=https://www.michaelsinsurancegroup.com/&ved=2ahUKEwiNo_2yhv-CAxXcF1kFHZSWAFMQgU96BAgkEAk"><div class="wLAgVc"><div class="VDgVie k0Jjg fCrZyc NQYJvc FR7ZSc OJeuxf PrjL8c"><div class="niO4u VDgVie SlP8xc iCQO5d"><div class="kHtcsd"><span class="d3o3Ad z1asCe GYDk8c"><svg focusable="false" viewbox="0 0 24 24" xmlns="http://www.w3.org/2000/svg"><path d="M12 2C6.48 2 2 6.48 2 12s4.48 10 10 10 10-4.48 10-10S17.52 2 12 2zm-1 17.93c-3.95-.49-7-3.85-7-7.93 0-.62.08-1.21.21-1.79L9 15v1c0 1.1.9 2 2 2v1.93zm6.9-2.54c-.26-.81-1-1.39-1.9-1.39h-1v-3c0-.55-.45-1-1-1H8v-2h2c.55 0 1-.45 1-1V7h2c1.1 0 2-.9 2-2v-.41c2.93 1.19 5 4.06 5 7.41 0 2.08-.8 3.97-2.1 5.39z"></path></svg></span></div></div><div class="QuU3Wb sjVJQd"><div>Website</div></div></div></div></a><a aria-d
And it works!
Traditional | Mistral 7B |
---|---|
Processed in 0.52s
|
Processed in 4.49s
|
A new attribute is present years_in_business
. The prompt for that is years_in_business: Number of years since the business started
. However, the address
start to fail.
Traditional | Mistral 7B |
---|---|
Processed in 0.52s
|
Processed in 7.28s
|
This time the address
is parsed correctly with the present of years_in_business
. However, the type
is missing.
Traditional | Mistral 7B |
---|---|
Processed in 0.52s
|
Processed in 5.42s
|
It hallucinates on the service_options
and description
.
Query: Yoga Studio (Playground)
Traditional | Mistral 7B |
---|---|
Processed in 0.39s
|
Processed in 7.51s
|
It has a mistake on type
, address
and price
. years_in_business
could be considered correct.
Traditional | Mistral 7B |
---|---|
Processed in 0.39s
|
Processed in 7.73s
|
It would be perfect if address
and years_in_business
is switched.
Traditional | Mistral 7B |
---|---|
Processed in 0.39s
|
Processed in 4.75s
|
The same mistake on address
and price
.
Conclusion
Compared to the Machine learning approach, working with Large Language Models is easier. However, the output can be inconsistent and there is no deterministic way to improve it. I believe we can modify the prompt to make it perform better but there will always be edge cases that are not covered. Nevertheless, the fact that it can do the parsing with a few lines of prompt is already impressive. The traditional parser has way more lines of code and conditions to take care to perform this well. It might not ready for production, but I am still looking forward to integrate Large Language Models into our workflow.
Last but not least, Mistral 7B is definitely impressive considering its size, I can't wait to test again with a more capable model that will come next from the Mistral AI team, which at this time, MIXTRAL 8X7B just launched.