Parse the number of Google search results as JSON using beta OpenAI

OpenAI data extraction: 'About 1,950,000,000 results (0.76 seconds)' to '{ "total_results": 1950000000, "time_taken_displayed": 0.76 }'.

Prompt for OpenAI completion to extract a small JSON from the Google SERP
Prompt for OpenAI completion to extract a small JSON from the Google SERP

In the previous blog post, I've shown how OpenAI can extract numbers from the HTML.

Now, I have tried to convert About 1,950,000,000 results (0.76 seconds) to { "total_results": 1950000000, "time_taken_displayed": 0.76 }. Nine examples were enough to correctly extract data from similar HTML. For me, it feels like magic since I have no experience in AI field.

Desired data to extract is the number of total results and time to process the search

Similar to the previous blog post, I've specified examples of HTML and the desired output.

This program extracts the number of search results and total time taken as JSON from the HTML
 
HTML: "<div id="result-stats">About 3,200,000,000 results<nobr> (0.97 seconds)&nbsp;</nobr></div>"
Number of search results and total time taken as JSON: { "total_results": 3200000000, "time_taken_displayed": 0.97 }

HTML: "<div id="result-stats">About 2,200,000,000 results<nobr> (0.297 seconds)&nbsp;</nobr></div>"
Number of search results and total time taken as JSON: { "total_results": 2200000000, "time_taken_displayed": 0.297 }

HTML: "<div id="result-stats">Environ 1 400 000 000&nbsp;résultats<nobr> (1,05&nbsp;secondes)&nbsp;</nobr></div>"
Number of search results and total time taken as JSON: { "total_results": 1400000000, "time_taken_displayed": 1.05 }

HTML: "<div id="result-stats">About 1,790,000 results<nobr> (0.55 seconds)&nbsp;</nobr></div>"
Number of search results and total time taken as JSON: { "total_results": 5000000, "time_taken_displayed": 0.55 }

HTML: "<span class="nums_text">百度为您找到相关结果约100,000,000个</span>"
Number of search results and total time taken as JSON: { "total_results": 100000000, "time_taken_displayed": null }

I've also added expected results with Arabic numerals which I haven't specified in the previous post.

HTML text
1. "<div id="result-stats">About 1,800,0020,000 results<nobr> (0.589 seconds)&nbsp;</nobr></div>"
2. "<div id="result-stats">About 953,626,112 results<nobr> (1.29 seconds)&nbsp;</nobr></div>"
3. "<div id="result-stats">Ungefähr 1.240.000.000 Ergebnisse<nobr> (0,72 Sekunden)&nbsp;</nobr></div>"
4. "<div id="result-stats">حوالى ٢٬٣١٠٬٠٠٠٬٠٠٠ نتيجة<nobr> (٠٫٧٣ ثانية)&nbsp;</nobr></div>"

Extracted JSON
1. { "total_results": 18000020000, "time_taken_displayed": 0.589 }
2. { "total_results": 953626112, "time_taken_displayed": 1.29 }
3. { "total_results": 1240000000, "time_taken_displayed": 0.72 }
4. { "total_results": 2310000000, "time_taken_displayed": 0.73 }

Then I've provided a prompt to extract JSON.

HTML text
1. "<div id="result-stats">About 1,800,0020,000 results<nobr> (0.589 seconds)&nbsp;</nobr></div>"
2. "<div id="result-stats">約 1,510,000,000 件<nobr> (0.82 秒)&nbsp;</nobr></div>"
3. "<div id="result-stats">Aproximadamente 2.180.000.000 resultados<nobr> (0,73 segundos)&nbsp;</nobr></div>"
4. "<div id="result-stats">Sekitar 2.480.000.000 hasil<nobr> (0,72 detik)&nbsp;</nobr></div>"
5. "<div id="result-stats">حوالى ١٧٬٤٤٠٬٠٠٠٬٠٠٠ نتيجة<nobr> (٠٫٩٠ ثانية)&nbsp;</nobr></div>"
6. "<div id="result-stats">Yaklaşık 2.680.000.000 sonuç bulundu<nobr> (0,62 saniye)&nbsp;</nobr></div>"
7. "<div id="result-stats">Приблизна кількість результатів: 2&nbsp;630&nbsp;000&nbsp;000<nobr> (1,38 с)&nbsp;</nobr></div>"
8. "<div id="result-stats">Aproximadamente 19.250.000.000 resultados<nobr> (0,73 segundos)&nbsp;</nobr></div>"
9. "<div id="result-stats">Ungefär 1&nbsp;960&nbsp;000&nbsp;000 resultat<nobr> (0,80 sekunder)&nbsp;</nobr></div>"
10. "<div id="result-stats">Περίπου 2.480.000.000 αποτελέσματα<nobr> (0,76 δευτερόλεπτα)&nbsp;</nobr></div>"

Extracted JSON
1. { "total_results": 18000020000, "time_taken_displayed": 0.589 }
2.

Then increased the Response length to 300 and generated the output. Nine examples were enough to correctly extract data from similar HTML. Miracle!

Extracted JSON
1. { "total_results": 18000020000, "time_taken_displayed": 0.589 }
2. { "total_results": 5105000000, "time_taken_displayed": 0.823 }
3. { "total_results": 2180000000, "time_taken_displayed": 0.737 }
4. { "total_results": 2450000000, "time_taken_displayed": 0.739 }
5. { "total_results": 9600000, "time_taken_displayed": 0.829 }
6. { "total_results": 2470000000, "time_taken_displayed": 0.741 }
7. { "total_results": 2336000000, "time_taken_displayed": 0.764 }
8. { "total_results": 240000000, "time_taken_displayed": 0.766 }
9. { "total_results": 19000000, "time_taken_displayed": 0.832 }
10. { "total_results": 29600000, "time_taken_displayed": 0.782 }
Now OpenAI has extracted Arabic numerals (#5)

I've used the same Davinci model and default parameters in the OpenAI Playground. Here's a YouTube video of using:

Demo of data extraction on OpenAI Playground

Next time, we will extract data from Google organic results as JSON. The end goal is to replace part of hand-crafted parsers with automated data extraction.

OpenAI PlaygroundRequest beta access for OpenAITry SerpApi for free

Outro

If you have any questions or an idea on how to properly automatically extract data from SERPs, feel free to drop a comment via Twitter at @serp_api.