In the previous blog post, I've shown how OpenAI can extract numbers from the HTML.
Now, I have tried to convert About 1,950,000,000 results (0.76 seconds)
to { "total_results": 1950000000, "time_taken_displayed": 0.76 }
. Nine examples were enough to correctly extract data from similar HTML. For me, it feels like magic since I have no experience in the AI field.
Similar to the previous blog post, I've specified examples of HTML and the desired output.
This program extracts the number of search results and total time taken as JSON from the HTML
HTML: "<div id="result-stats">About 3,200,000,000 results<nobr> (0.97 seconds) </nobr></div>"
Number of search results and total time taken as JSON: { "total_results": 3200000000, "time_taken_displayed": 0.97 }
HTML: "<div id="result-stats">About 2,200,000,000 results<nobr> (0.297 seconds) </nobr></div>"
Number of search results and total time taken as JSON: { "total_results": 2200000000, "time_taken_displayed": 0.297 }
HTML: "<div id="result-stats">Environ 1 400 000 000 résultats<nobr> (1,05 secondes) </nobr></div>"
Number of search results and total time taken as JSON: { "total_results": 1400000000, "time_taken_displayed": 1.05 }
HTML: "<div id="result-stats">About 1,790,000 results<nobr> (0.55 seconds) </nobr></div>"
Number of search results and total time taken as JSON: { "total_results": 5000000, "time_taken_displayed": 0.55 }
HTML: "<span class="nums_text">百度为您找到相关结果约100,000,000个</span>"
Number of search results and total time taken as JSON: { "total_results": 100000000, "time_taken_displayed": null }
I've also added expected results with Arabic numerals which I haven't specified in the previous post.
HTML text
1. "<div id="result-stats">About 1,800,0020,000 results<nobr> (0.589 seconds) </nobr></div>"
2. "<div id="result-stats">About 953,626,112 results<nobr> (1.29 seconds) </nobr></div>"
3. "<div id="result-stats">Ungefähr 1.240.000.000 Ergebnisse<nobr> (0,72 Sekunden) </nobr></div>"
4. "<div id="result-stats">حوالى ٢٬٣١٠٬٠٠٠٬٠٠٠ نتيجة<nobr> (٠٫٧٣ ثانية) </nobr></div>"
Extracted JSON
1. { "total_results": 18000020000, "time_taken_displayed": 0.589 }
2. { "total_results": 953626112, "time_taken_displayed": 1.29 }
3. { "total_results": 1240000000, "time_taken_displayed": 0.72 }
4. { "total_results": 2310000000, "time_taken_displayed": 0.73 }
Then I've provided a prompt to extract JSON.
HTML text
1. "<div id="result-stats">About 1,800,0020,000 results<nobr> (0.589 seconds) </nobr></div>"
2. "<div id="result-stats">約 1,510,000,000 件<nobr> (0.82 秒) </nobr></div>"
3. "<div id="result-stats">Aproximadamente 2.180.000.000 resultados<nobr> (0,73 segundos) </nobr></div>"
4. "<div id="result-stats">Sekitar 2.480.000.000 hasil<nobr> (0,72 detik) </nobr></div>"
5. "<div id="result-stats">حوالى ١٧٬٤٤٠٬٠٠٠٬٠٠٠ نتيجة<nobr> (٠٫٩٠ ثانية) </nobr></div>"
6. "<div id="result-stats">Yaklaşık 2.680.000.000 sonuç bulundu<nobr> (0,62 saniye) </nobr></div>"
7. "<div id="result-stats">Приблизна кількість результатів: 2 630 000 000<nobr> (1,38 с) </nobr></div>"
8. "<div id="result-stats">Aproximadamente 19.250.000.000 resultados<nobr> (0,73 segundos) </nobr></div>"
9. "<div id="result-stats">Ungefär 1 960 000 000 resultat<nobr> (0,80 sekunder) </nobr></div>"
10. "<div id="result-stats">Περίπου 2.480.000.000 αποτελέσματα<nobr> (0,76 δευτερόλεπτα) </nobr></div>"
Extracted JSON
1. { "total_results": 18000020000, "time_taken_displayed": 0.589 }
2.
Then increased the Response length to 300 and generated the output. Nine examples were enough to correctly extract data from similar HTML. Miracle!
I've used the same Davinci model and default parameters in the OpenAI Playground. Here's a YouTube video of using OpenAI:
Next time, we will extract data from Google organic results as JSON. The end goal is to replace part of hand-crafted parsers with automated data extraction.
Links
OpenAI Playground • Request beta access for OpenAI • Try SerpApi for free
Outro
If you have any questions or an idea on how to properly automatically extract data from SERPs, feel free to drop a comment via Twitter at @serp_api.