At SerpApi, we are always looking for methods to improve our web scraping parser to better serve our customers. Previously, we have experimented with Machine Learning for parsing and make a comparison against our traditional parser. This time, we're turning our attention to Large Language Models (LLMs) to evaluate their performance. The surge in popularity of Large Language Models has led to numerous organizations, including Facebook, Yi, and Mistral AI, releasing their own open-source versions. In this blog post, we will find out the capability of open-source LLMs for parsing HTML. We have already experimented with OpenAI's GPT-4 and it is impressive in its parsing capability. For a more in-depth understanding of this, we recommend reading our detailed blog post on the subject.

In our trials, Mistral 7B has emerged as particularly impressive, especially considering its size relative to other LLMs. Based on these findings, we have decided to use it for our upcoming test run.

Prompt:

I did numerous rounds of refactoring of the prompt, especially the User part. It might not be perfect, but it is the closest I can get for now. Let me know if you have a suggestion!

System:

You are an expert in Web Scraping, so you are capable to find the information in HTML and label them accordingly. Please return the final result in JSON.

User:

Data to scrape: 
title: Name of the business
type: The business nature like Cafe, Coffee Shop, many others
phone: The phone number of the business
address: Address of the business, can be a state, country or a full address
years_in_business: Number of years since the business started
hours: Business operating hours
rating: Rating of the business
reviews: Number of reviews on the business
price: Typical spending on the business
description: Extra information that is not mentioned yet in any of the data
service_options: Array of shopping options from the business, for example, in store shopping, delivery and many others. It should be in format -> option_name: true
is_operating: Whether the business is operating

HTML: 
{html}

html is supposed to be the local result section HTML in Google page, which I have to write few lines of code to extract.

This HTML represents the above local result and this is exactly what will be passed into the model.

<div jscontroller="AtSb" class="w7Dbne CR1S4b" data-record-click-time="false" id="tsuid_48" jsdata="zt2wNd;_;A+m88g WDO8Ff;_;A+m88s" jsaction="rcuQ6b:npT2md;e3EWke:kN9HDb" data-hveid="CC4QAA">
  <div jsname="jXK9ad" class="uMdZh tIxNaf rllt__borderless" jsaction="mouseover:UI3Kjd;mouseleave:Tx5Rb;focusin:UI3Kjd;focusout:Tx5Rb">
    <div class="VkpGBb">
      <div class="cXedhc">
        <a class="vwVdIc wzN8Ac rllt__link a-no-hover-decoration" jsname="kj0dLd" data-cid="11265938073076301333" jsaction="click:h5M12e;" role="link" tabindex="0" data-ved="2ahUKEwjC7OW8romDAxXwFVkFHbmLAEoQ1YkKegQILhAB">
          <div>
            <div class="rllt__details">
              <div class="dbg0pd" aria-level="3" role="heading"><span class="OSrXXb">Houndstooth Coffee</span></div>
              <div><span><span class="Y0A0hc"><span class="yi40Hd YrbPuc" aria-hidden="true">4.6</span><span class="z3HNkc fUNJzc" aria-label="Rated 4.6 out of 5," role="img"><span style="width:12px"></span></span><span class="RDApEe YrbPuc">(922)</span></span></span> · <span aria-label="Moderately expensive" role="img">$$</span> · Coffee shop</div>
              <div>401 Congress Ave. #100c</div>
              <div><span><span style="color:rgba(217,48,37,1.0)">Closed</span> ⋅ Opens 6:30 AM</span></div>
              <div class="pJ3Ci"><span>Cozy hangout for carefully sourced brews</span></div>
            </div>
          </div>
        </a>
        <a class="uQ4NLd b9tNq wzN8Ac rllt__link a-no-hover-decoration" aria-hidden="true" tabindex="-1" jsname="kj0dLd" data-cid="11265938073076301333" jsaction="click:h5M12e;" role="link" data-ved="2ahUKEwjC7OW8romDAxXwFVkFHbmLAEoQ1YkKegQILhAP">
          <g-img class="gTrj3e"><img id="pimg_2" src="https://lh5.googleusercontent.com/p/AF1QipNRZ1ehiInk8CTrHCD08GnhcnF7e4q-1H8Qs8mG=w114-h114-n-k-no" class="YQ4gaf zr758c wA1Bge" height="114" width="114" alt="" data-csiid="2" data-atf="1"></g-img>
        </a>
      </div>
    </div>
  </div>
</div>

P.S. The OpenAI-like API is possible because of LiteLLM. Also the deployment of open-source LLMs is effortless with ollama.

Specs

Traditional parser was running on my local which is a MacBook Air M2 with 16GB of RAM. While Mistral 7B was running on NVIDIA RTX4000 (8 CPU, 30GB RAM). Both are running locally without the need for network connection.

The Comparison

Query: Dentist

If you would like to inspect the data and HTML more in-depth, feel free to check out the playground.

Query: Dentist (1st Result)
TraditionalMistral 7B
        {
          "title": "Dental Resource Management",
          "type": "Dentist",
          "address": "Austin, TX",
          "hours": "Closed ⋅ Opens 6 AM Fri"
          "reviews_original": "No reviews"
        }
      
Processed in 0.37s
        {
          "title": "Dental Resource Management",
          "type": "Dentist",
          "address": "Austin, TX",
          "hours": "Closed ⋅ Opens 6 AM Fri",
          "rating": null, // Not available in this HTML code snippet
          "reviews": null, // Not available in this HTML code snippet
          "price": null, // Not available in this HTML code snippet
          "description": null // Not available in this HTML code snippet
        }
      
Processed in 5.90s

Mistral 7B nailed it. However, a notable challenge when working with Large Language Models (LLMs) is their tendency to generate unwanted text or comments. We have to update the prompt but we just want to test the correctness of the data, so it is good for now.

Query: Dentist (2nd Result)
TraditionalMistral 7B
        {
            "title": "Swish Dental Downtown",
            "type": "Dentist",
            "address": "201 W 5th St Ste 175",
            "hours": "Closed ⋅ Opens 8 AM Fri"
            "rating": 4.7,
            "reviews_original": "(306)",
            "reviews": 306,
            "description": "\"Wonderful dentist office!\""
        }
      
Processed in 0.37s
        {
            "title": "Swish Dental Downtown",
            "type": "Dentist",
            "address": "201 W 5th St Ste 175",
            "hours": "Closed ⋅ Opens 8 AM Fri",
            "rating": "4.7",
            "reviews": "(306)",
            "price": null,
            "description": "Wonderful dentist office!"
        }
      
Processed in 3.53s
Query: Dentist (3rd Result)
TraditionalMistral 7B
        {
          "title": "Austin Cosmetic Dentistry by Tejas Patel, DDS",
          "type": "Cosmetic dentist",
          "address": "221 W 6th St #940",
          "hours": "Closed ⋅ Opens 8 AM Mon",
          "rating": 4.9,
          "reviews_original": "(275)",
          "reviews": 275,
          "description": "Has online care"
        }
      
Processed in 0.37s
        {
           "title": "Austin Cosmetic Dentistry by Tejas Patel, DDS",
           "type": "Cosmetic dentist",
           "address": "221 W 6th St #940",
           "hours": "Closed ⋅ Opens 8 AM Mon",
           "rating": "4.9 out of 5",
           "reviews": "(275)",
           "price": "",
           "description": "Has online care"
       }
      
Processed in 6.40s

The rating output by Mistral 7B is 4.9 out of 5, it would be perfect if it is just 4.9, but it is something that can be easily done in post-processing, so it is still good.


Query: Bookstore (Playground)

Query: Bookstore (1st Result)
TraditionalMistral 7B
        {
          "title": "BookPeople",
          "type": "Book store",
          "phone": "(512) 472-5050",
          "address": "603 N Lamar Blvd",
          "hours": "Closed ⋅ Opens 9 AM Fri",
          "rating": 4.7,
          "reviews_original": "(3.4K)",
          "reviews": 3400,
          "description": "Independent bookseller with local flavor",
        }
      
Processed in 0.52s
        {
          "title": "BookPeople",
          "type": "Independent Bookstore",
          "phone": "(512) 472-5050",
          "address": "603 N Lamar Blvd",
          "hours": "Closed",
          "rating": "4.7",
          "reviews": "(3.4K)",
          "price": null,
          "description": "Independent bookseller with local flavor"
        }
      
Processed in 3.04s

Mistral 7B made a mistake on the type and hours. Other than that, it is all good.

Query: Bookstore (2nd Result)
TraditionalMistral 7B
        {
          "title": "Wallis Booksellers",
          "type": "Book store",
          "address": "The Alden at Cedar Park, 801 C-Bar Ranch Trl, Cedar Park, No. 1008, 801 Barton Springs Rd",
          "reviews_original": "No reviews",
          "service_options": {
            "in_store_shopping": true
          }
        }
      
Processed in 0.52s
        {
          "title": "Wallis Booksellers",
          "type": "Book store",
          "phone": "",
          "address": "The Alden at Cedar Park, 801 C-Bar Ranch Trl, Cedar Park, No. 1008, 801 Barton Springs Rd",
          "hours": "",
          "rating": "",
          "reviews": "No reviews",
          "price": "",
          "description": "Wallis Booksellers is a book store located in Cedar Park, Texas.",
          "service_options": [
            {
              "in_store_shopping": true
            }
          ]
        }
      
Processed in 3.89s

We have a new attribute service_options. It wasn't in the prompt initially, so I added service_options: Array of shopping options from the business, for example, in store shopping, delivery and many others to the prompt. With that, it produced results similar to our API, pretty impressive. However, it hallucinates on the description.

Query: Bookstore (3rd Result)
TraditionalMistral 7B
        {
          "title": "Half Price Books",
          "type": "Book store",
          "phone": "(512) 443-3138",
          "address": "2929 S Lamar Blvd",
          "hours": "Closed ⋅ Opens 10 AM Fri",
          "rating": 4.7,
          "reviews_original": "(2K)",
          "reviews": 2000,
          "description": "Chain for used books & entertainment"
        }
      
Processed in 0.52s
        {
          "title": "Half Price Books",
          "type": "Book store",
          "phone": "(512) 443-3138",
          "address": "2929 S Lamar Blvd",
          "hours": "Closed",
          "rating": "4.7",
          "reviews": "2K",
          "price": null,
          "description": "Chain for used books & entertainment",
          "service_options": []
        }
      
Processed in 3.12s

The results from Mistral 7B omitted the opening hours, a detail we successfully extracted from the previous HTML. I'm considering updating the prompt to rectify this, but for now, I've decided to leave it as is. This decision aims to showcase the inconsistencies inherent in Large Language Models.


Query: Mexican Restaurant (Playground)

Query: Mexican Restaurant (1st Result)
TraditionalMistral 7B
        {
          "title": "Velvet Taco",
          "type": "Mexican",
          "address": "522 Congress Ave. Suite 100",
          "rating": 4.6,
          "reviews_original": "(1.4K)",
          "reviews": 1400,
          "price": "$$",
          "service_options": {
            "dine_in": true,
            "takeout": true,
            "no_contact_delivery": true
          }
        }
      
Processed in 0.58s
        {
          "title": "Velvet Taco",
          "type": "Cafe",
          "phone": "",
          "address": "522 Congress Ave. Suite 100",
          "hours": "",
          "rating": "4.6",
          "reviews": "(1.4K)",
          "price": "$$",
          "description": "",
          "service_options": {
            "dine_in": true,
            "takeout": true,
            "no_contact_delivery": true
          }
       }
      
Processed in 3.75s

Mistral 7B is very impressive, producing the service_options structure similar to our API. However, the type is not correct, it should be Mexican. It does produce correct type in the next example.

Query: Mexican Restaurant (2nd Result)
TraditionalMistral 7B
        {
          "title": "Taquero Mucho",
          "type": "Mexican",
          "address": "508 West Ave",
          "hours": "Closes soon ⋅ 11 PM",
          "rating": 4,
          "reviews_original": "(1.7K)",
          "reviews": 1700,
          "description": "Trendy Mexican eatery with cocktails"
        }
      
Processed in 0.58s
        {
          "title": "Taquero Mucho",
          "type": "Mexican",
          "phone": null,
          "address": "508 West Ave",
          "hours": "Closes soon ⋅ 11 PM",
          "rating": "4.0",
          "reviews": "(1.7K)",
          "price": null,
          "description": "Trendy Mexican eatery with cocktails",
          "service_options": {
            "in store shopping": true,
            "delivery": false
          }
        }
      
Processed in 3.67s

The type is spot on. This time it hallucinates on the service_options.

Query: Mexican Restaurant (3rd Result)
TraditionalMistral 7B
        {
          "title": "Gabrielas Downtown",
          "type": "Mexican",
          "address": "900 E 7th St",
          "rating": 4.1,
          "reviews_original": "(1.6K)",
          "reviews": 1600,
          "price": "$$",
          "description": "Modern Mexican cuisine & cocktails"
        }
      
Processed in 0.58s
        {
          "title": "Gabrielas Downtown",
          "type": "Mexican",
          "phone": "",
          "address": "900 E 7th St",
          "hours": "",
          "rating": "4.1",
          "reviews": "(1.6K)",
          "price": "$$",
          "description": "Modern Mexican cuisine & cocktails",
          "service_options": {
            "in_store_shopping": true,
            "delivery": true
          }
        }
      
Processed in 3.62s

Similar to previous example, it hallucinates on the service_options.


Query: Insurance Agency (Playground)

Query: Insurance Agency (1st Result)

A little tricky for this example. If you can notice, this example includes Website and Directions buttons. If I have to simply give the HTML to the model, the output is totally unusable.

Sample output:

The given JSON data is a local search result for a restaurant. Here's the relevant information:

* The name of the restaurant is "Bangers &amp; Lace".
* The address is "11507 Circle Back Rd, Suite 218, Austin, TX 78724".
* The phone number is "512-631-9380".
* The opening hours are Monday to Friday from 11:00 AM to 10:00 PM and Saturday and Sunday from 12:00 PM to 10:00 PM.
* The website URL is "<https://www.bangersandlaceaustin.com>".
* There are no user reviews available for this restaurant.

It doesn't follow the instruction to return the output in JSON and more than that it hallucinates on the entire data. I have thought about it and it is probably because of token length, it is too large. This happens because Directions button in HTML contains a large portion of Google Maps' data.

Here is the initial HTML:

<div class="w7Dbne CR1S4b " data-hveid="CCQQAA" data-record-click-time="true" id="tsuid_32" jsaction="rcuQ6b:npT2md;e3EWke:kN9HDb" jscontroller="AtSb" jsdata="zt2wNd;_;CgGqWk WDO8Ff;_;CgGqW0"><div class="uMdZh tIxNaf rllt__borderless" jsaction="mouseover:UI3Kjd;mouseleave:Tx5Rb;focusin:UI3Kjd;focusout:Tx5Rb" jsname="jXK9ad"><div class="VkpGBb"><div class="cXedhc"><a class="vwVdIc wzN8Ac rllt__link a-no-hover-decoration" data-cid="16496161963606301439" data-ved="2ahUKEwiNo_2yhv-CAxXcF1kFHZSWAFMQvS56BAgkEAE" jsaction="click:h5M12e;" jsname="kj0dLd" role="link" tabindex="0"><div><div class="rllt__details"><div aria-level="3" class="dbg0pd " role="heading"><span class="OSrXXb">Michaels Insurance Group</span></div><div><span><span>No reviews</span></span> · Insurance agency</div><div>3+ years in business · 501 Congress Ave. Suite 150A · (512) 593-1171</div><div class=""><span><span style="color:rgba(217,48,37,1.0)">Closed</span> ⋅ Opens 9 AM Fri</span></div><div class="dXnVAb "><span class="BI0Dve "><span><span aria-label="Onsite services">Onsite services</span></span></span></div></div></div></a></div><a aria-describedby="tsuid_32" class="yYlJEf Q7PwXb L48Cpd brKmxb " data-ved="2ahUKEwiNo_2yhv-CAxXcF1kFHZSWAFMQgU96BAgkEAk" href="https://www.michaelsinsurancegroup.com/" lang="en-US" ping="/url?sa=t&amp;source=web&amp;rct=j&amp;opi=89978449&amp;url=https://www.michaelsinsurancegroup.com/&amp;ved=2ahUKEwiNo_2yhv-CAxXcF1kFHZSWAFMQgU96BAgkEAk"><div class="wLAgVc"><div class="VDgVie k0Jjg fCrZyc NQYJvc FR7ZSc OJeuxf PrjL8c"><div class="niO4u VDgVie SlP8xc iCQO5d"><div class="kHtcsd"><span class="d3o3Ad z1asCe GYDk8c"><svg focusable="false" viewbox="0 0 24 24" xmlns="http://www.w3.org/2000/svg"><path d="M12 2C6.48 2 2 6.48 2 12s4.48 10 10 10 10-4.48 10-10S17.52 2 12 2zm-1 17.93c-3.95-.49-7-3.85-7-7.93 0-.62.08-1.21.21-1.79L9 15v1c0 1.1.9 2 2 2v1.93zm6.9-2.54c-.26-.81-1-1.39-1.9-1.39h-1v-3c0-.55-.45-1-1-1H8v-2h2c.55 0 1-.45 1-1V7h2c1.1 0 2-.9 2-2v-.41c2.93 1.19 5 4.06 5 7.41 0 2.08-.8 3.97-2.1 5.39z"></path></svg></span></div></div><div class="QuU3Wb sjVJQd"><div>Website</div></div></div></div></a><a aria-describedby="tsuid_32" class="yYlJEf VByer Q7PwXb VDgVie brKmxb " data-url="/maps/dir//Michaels+Insurance+Group,+501+Congress+Ave.+Suite+150A,+Austin,+TX+78701/data=!4m6!4m5!1m1!4e2!1m2!1m1!1s0x8644b5d32f23bb83:0xe4ee23e7ce105aff?sa=X&amp;ved=2ahUKEwiNo_2yhv-CAxXcF1kFHZSWAFMQ48ADegQIIRAA&amp;hl=en&amp;gl=us" href="/maps/dir//Michaels+Insurance+Group,+501+Congress+Ave.+Suite+150A,+Austin,+TX+78701/data=!4m6!4m5!1m1!4e2!1m2!1m1!1s0x8644b5d32f23bb83:0xe4ee23e7ce105aff?sa=X&amp;ved=2ahUKEwiNo_2yhv-CAxXcF1kFHZSWAFMQ48ADegQIIRAA&amp;hl=en&amp;gl=us" ping="/url?sa=t&amp;source=web&amp;rct=j&amp;opi=89978449&amp;url=/maps/dir//Michaels%2BInsurance%2BGroup,%2B501%2BCongress%2BAve.%2BSuite%2B150A,%2BAustin,%2BTX%2B78701/data%3D!4m6!4m5!1m1!4e2!1m2!1m1!1s0x8644b5d32f23bb83:0xe4ee23e7ce105aff%3Fsa%3DX%26ved%3D2ahUKEwiNo_2yhv-CAxXcF1kFHZSWAFMQ48ADegQIIRAA%26hl%3Den%26gl%3Dus&amp;ved=2ahUKEwiNo_2yhv-CAxXcF1kFHZSWAFMQ48ADegQIJBAL" style="cursor:pointer" tabindex="0"><div><div class="VDgVie k0Jjg fCrZyc NQYJvc FR7ZSc OJeuxf PrjL8c"><div class="niO4u VDgVie SlP8xc iCQO5d"><div class="kHtcsd"><span class="d3o3Ad TU05kd"></span></div></div><div class="QuU3Wb sjVJQd"><div lang="en-US">Directions</div></div></div></div></a></div></div></div>

As Website and Directions is not part of the test this round, I decided to discard the HTML.

Truncated HTML:

<div class="w7Dbne CR1S4b" data-hveid="CCQQAA" data-record-click-time="true" id="tsuid_32" jsaction="rcuQ6b:npT2md;e3EWke:kN9HDb" jscontroller="AtSb" jsdata="zt2wNd;_;CgGqWk WDO8Ff;_;CgGqW0"><div class="uMdZh tIxNaf rllt__borderless" jsaction="mouseover:UI3Kjd;mouseleave:Tx5Rb;focusin:UI3Kjd;focusout:Tx5Rb" jsname="jXK9ad"><div class="VkpGBb"><div class="cXedhc"><a class="vwVdIc wzN8Ac rllt__link a-no-hover-decoration" data-cid="16496161963606301439" data-ved="2ahUKEwiNo_2yhv-CAxXcF1kFHZSWAFMQvS56BAgkEAE" jsaction="click:h5M12e;" jsname="kj0dLd" role="link" tabindex="0"><div><div class="rllt__details"><div aria-level="3" class="dbg0pd" role="heading"><span class="OSrXXb">Michaels Insurance Group</span></div><div><span><span>No reviews</span></span> · Insurance agency</div><div>3+ years in business · 501 Congress Ave. Suite 150A · (512) 593-1171</div><div><span><span style="color:rgba(217,48,37,1.0)">Closed</span> ⋅ Opens 9 AM Fri</span></div><div class="dXnVAb"><span class="BI0Dve"><span><span aria-label="Onsite services">Onsite services</span></span></span></div></div></div></a></div><a aria-describedby="tsuid_32" class="yYlJEf Q7PwXb L48Cpd brKmxb" data-ved="2ahUKEwiNo_2yhv-CAxXcF1kFHZSWAFMQgU96BAgkEAk" href="https://www.michaelsinsurancegroup.com/" lang="en-US" ping="/url?sa=t&amp;source=web&amp;rct=j&amp;opi=89978449&amp;url=https://www.michaelsinsurancegroup.com/&amp;ved=2ahUKEwiNo_2yhv-CAxXcF1kFHZSWAFMQgU96BAgkEAk"><div class="wLAgVc"><div class="VDgVie k0Jjg fCrZyc NQYJvc FR7ZSc OJeuxf PrjL8c"><div class="niO4u VDgVie SlP8xc iCQO5d"><div class="kHtcsd"><span class="d3o3Ad z1asCe GYDk8c"><svg focusable="false" viewbox="0 0 24 24" xmlns="http://www.w3.org/2000/svg"><path d="M12 2C6.48 2 2 6.48 2 12s4.48 10 10 10 10-4.48 10-10S17.52 2 12 2zm-1 17.93c-3.95-.49-7-3.85-7-7.93 0-.62.08-1.21.21-1.79L9 15v1c0 1.1.9 2 2 2v1.93zm6.9-2.54c-.26-.81-1-1.39-1.9-1.39h-1v-3c0-.55-.45-1-1-1H8v-2h2c.55 0 1-.45 1-1V7h2c1.1 0 2-.9 2-2v-.41c2.93 1.19 5 4.06 5 7.41 0 2.08-.8 3.97-2.1 5.39z"></path></svg></span></div></div><div class="QuU3Wb sjVJQd"><div>Website</div></div></div></div></a><a aria-d

And it works!

TraditionalMistral 7B
        {
          "title": "Michaels Insurance Group",
          "type": "Insurance agency",
          "phone": "(512) 593-1171",
          "address": "501 Congress Ave. Suite 150A",
          "years_in_business": "3+ years in business",
          "hours": "Closed ⋅ Opens 9 AM Fri",
          "reviews_original": "No reviews",
          "service_options": {
            "onsite_services": true
          }
        }
      
Processed in 0.52s
        {
          "title": "Michaels Insurance Group",
          "type": "Insurance agency",
          "phone": "(512) 593-1171",
          "address": "3+ years in business · 501 Congress Ave. Suite 150A",
          "years_in_business": "3+ years in business",
          "hours": "Closed · Opens 9 AM Fri",
          "rating": "",
          "reviews": "No reviews",
          "price": "",
          "description": "Onsite services",
          "service_options": {
            "onsite_services": true
          }
        }
      
Processed in 4.49s

A new attribute is present years_in_business. The prompt for that is years_in_business: Number of years since the business started. However, the address start to fail.

Query: Insurance Agency (2nd Result)
TraditionalMistral 7B
        {
          "title": "advisor brokerage services",
          "type": "Insurance agency",
          "phone": "(512) 481-2776",
          "address": "515 Congress Ave.",
          "years_in_business": "5+ years in business",
          "reviews_original": "No reviews"
        }
      
Processed in 0.52s
        {
          "title": "advisor brokerage services",
          "type": "",
          "phone": "(512) 481-2776",
          "address": "515 Congress Ave.",
          "years_in_business": "5+ years in business",
          "hours": "",
          "rating": "",
          "reviews": "No reviews",
          "price": "",
          "description": "",
          "service_options": {
            "in store shopping": true
          }
        }
      
Processed in 7.28s

This time the address is parsed correctly with the present of years_in_business. However, the type is missing.

Query: Insurance Agency (3rd Result)
TraditionalMistral 7B
        {
          "title": "The Keane Insurance Group, Inc",
          "reviews_original": "No reviews",
          "type": "Insurance agency",
          "phone": "(800) 966-7731",
          "address": "106 E 6th St",
          "years_in_business": "10+ years in business"
        }
      
Processed in 0.52s
        {
          "title": "The Keane Insurance Group, Inc",
          "type": "Insurance agency",
          "phone": "(800) 966-7731",
          "address": "106 E 6th St, Austin, TX 78701",
          "years_in_business": "10+ years",
          "hours": "",
          "rating": "",
          "reviews": "No reviews",
          "price": "",
          "description": "Website: http://www.keanegroup.com/",
          "service_options": {
            "in_store_shopping": true,
            "delivery": true
          }
        }
      
Processed in 5.42s

It hallucinates on the service_options and description.


Query: Yoga Studio (Playground)

Query: Yoga Studio (1st Result)
TraditionalMistral 7B
        {
          "title": "YTX Yoga, Strength, & HIIT Training",
          "type": "Yoga studio",
          "phone": "(512) 502-5183",
          "address": "206 E 4th St",
          "years_in_business": "10+ years in business",
          "hours": "Closed ⋅ Opens 6:30 AM Fri",
          "rating": 4.8,
          "reviews_original": "(293)",
          "reviews": 293,
          "service_options": {
            "onsite_services": true,
            "online_classes": true
          }
        }
      
Processed in 0.39s
        {
          "title": "YTX Yoga, Strength, & HIIT Training",
          "type": "Studio",
          "phone": "(512) 502-5183",
          "address": "10+ years in business · 206 E 4th St",
          "years_in_business": "10+",
          "hours": "Closed ⋅ Opens 6:30 AM Fri",
          "rating": "4.8",
          "reviews": "(293)",
          "price": "Typical spending on the business",
          "description": "",
          "service_options": {
            "onsite_services": true,
            "online_classes": true
          }
        }
      
Processed in 7.51s

It has a mistake on type, address and price. years_in_business could be considered correct.

Query: Yoga Studio (2nd Result)
TraditionalMistral 7B
        {
          "title": "Practice Yoga Austin",
          "type": "Yoga studio",
          "phone": "(512) 730-1638",
          "address": "1103 E 6th St",
          "years_in_business": "10+ years in business",
          "hours": "Closed ⋅ Opens 6:45 AM Fri",
          "rating": 5,
          "reviews_original": "(91)",
          "reviews": 91,
          "description": "\"Practice Yoga is a great non fussy studio right in the heart of east side.\""
        }
      
Processed in 0.39s
        {
          "title": "Practice Yoga Austin",
          "type": "Yoga studio",
          "phone": "(512) 730-1638",
          "address": "10+ years in business",
          "years_in_business": 1103 E 6th St,
          "hours": "Closed ⋅ Opens 6:45 AM Fri",
          "rating": "Rated 5.0 out of 5,",
          "reviews": "(91)",
          "price": "Typical spending on the business",
          "description": "\"Practice yoga studio is very unique.\""
        }
      
Processed in 7.73s

It would be perfect if address and years_in_business is switched.

Query: Yoga Studio (3rd Result)
TraditionalMistral 7B
        {
          "title": "CorePower Yoga",
          "type": "Yoga studio",
          "phone": "(833) 448-2561",
          "address": "801 W 5th St Retail Unit I",
          "years_in_business": "10+ years in business",
          "hours": "Closed ⋅ Opens 6 AM Fri",
          "rating": 4.6,
          "reviews_original": "(139)",
          "reviews": 139,
          "service_options": {
            "onsite_services": true,
            "online_classes": true
          }
        }
      
Processed in 0.39s
        {
          "title": "CorePower Yoga",
          "type": "Yoga studio",
          "phone": "(833) 448-2561",
          "address": "10+ years in business",
          "years_in_business": "10+ years",
          "hours": "Closed",
          "rating": "4.6 out of 5, (139 reviews)",
          "reviews": "139",
          "price": "Typical spending on the business is not mentioned",
          "description": "",
          "service_options": {
            "Onsite services": true,
            "Online classes": true
          }
        }
      
Processed in 4.75s

The same mistake on address and price.

Conclusion

Compared to the Machine learning approach, working with Large Language Models is easier. However, the output can be inconsistent and there is no deterministic way to improve it. I believe we can modify the prompt to make it perform better but there will always be edge cases that are not covered. Nevertheless, the fact that it can do the parsing with a few lines of prompt is already impressive. The traditional parser has way more lines of code and conditions to take care to perform this well. It might not ready for production, but I am still looking forward to integrate Large Language Models into our workflow.

Last but not least, Mistral 7B is definitely impressive considering its size, I can't wait to test again with a more capable model that will come next from the Mistral AI team, which at this time, MIXTRAL 8X7B just launched.