The AI becomes scarier better every day. OpenAI now offers the vision API, which allows you to extract information from an image.

We'll learn how to use Vision API by OpenAI in a simple image and extract data from complex images.

OpenAI vision API to scrape data from an image

We experimented with parsing HTML raw data with AI before, feel free to read the blog post below:

Web scraping experiment with AI (Parsing HTML with GPT-4)
Parsing data from web scraping results can often be cumbersome. But what if there’s a way to turn this painstaking process into a breeze? Let’s experiment with new AI model by OpenAI

Vision API tutorial step-by-step

Let's start with setting up a project to test the Vision API. I'll be using Javascript (Nodejs) in this sample, but feel free to use any language you're comfortable with.

Preparation
Create a new directory and initialize NPM

mkdir openai-vision-api && cd openai-vision-api 
npm init -y // NPM init
npm install openai dotenv --save  // Install openai and dotenv package

Add API Key
Get your API Key from openAI dashboard, and put it in the .env file. Feel free to create a new .env file.

OPENAI_API_KEY=YOUR_API_KEY

Basic code setup
Create a new index.js file and import related packages and create a new openai instance.

require("dotenv").config();
const OpenAI = require('openai');

const { OPENAI_API_KEY } = process.env;

const openai = new OpenAI({
  apiKey: OPENAI_API_KEY,
});

Add vision API method
Here is how to call a vision API in your code:

async function main() {
  const response = await openai.chat.completions.create({
    model: "gpt-4-vision-preview",
    messages: [
      {
        role: "user",
        content: [
          { type: "text", text: "What’s in this image?" },
          {
            type: "image_url",
            image_url: {
              "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
            },
          },
        ],
      },
    ]
  });
  console.log(response.choices[0].message.content);
}
main();

Now run the program with

node index.js 

Here is the result:

Simple example of Vision API

Parsing data from complex image with Vision API

We saw it worked with a simple image. Now, let's try a complex one. I'm going to take a screenshot from Google Shopping results.

I'll upload this image, to use the public URL on our Vision API.
Google shopping results screenshot for coffee

I need to update two things: first, the token parameter since the response should be longer. Second is the prompt, to tell exactly what I want from the AI.

async function main() {
  const response = await openai.chat.completions.create({
    model: "gpt-4-vision-preview",
    messages: [
      {
        role: "user",
        content: [
          { type: "text", text: "Please share the detail information of each item on this product on a nice structure JSON" },
          {
            type: "image_url",
            image_url: {
              "url": "https://i.ibb.co/F8nGWk5/Clean-Shot-2024-01-17-at-13-46-43.png",
            },
          },
        ],
      },
    ],
    max_tokens: 1000 // Add more token
  });
  console.log(response.choices[0].message.content);
}

Here is the result:

Vision API result for a complex image

The result is very good! but here is the catch:
- The response is not always consistent (structure wise). I believe we can solve this by adjusting our prompt.
- The time taken for this particular image is between 10+ to 20+ seconds. (It's just the parsing time, not the scraping time).

Can we use this as a web scraping solution?

As you might know, parsing data is just a part of web scraping. There are other things involved like proxy rotation, solving captchas, and so on. So we can't say that vision API is a web scraping solution.

Here is the idea though, of how to use this as part of our web scraping solution:
- Create a scraping solution, for example using Puppeteer in Javascript to take a screenshot .
- Upload the image to a public URL or get the base64 code.
- Pass this image to the vision API method parameter like the one we provided above.
- Return the results in a nice structured way.
- (Bonus) If you want to have a consistent data structure, you might want to learn about function calling by OpenAI.

Summary

It's very fun to experiment with OpenAI features like vision API and see the possibility to help us with web scraping and parsing.

In the above example, where we try to parse the Google Shopping results page data, it's still far from ready for production, compared to the Google Shopping API, which only take 1-3s to scrape and return the Google Shopping page in a consistent structured format.

FAQ

How much does vision API cost?
Model gpt-4-1106-vision-preview costs $0.01 / 1K tokens for the input and $0.03/1K tokens for the output.

Does it support function calling?
Not right now, the gpt-4-1106-vision-preview hasn't support function calling yet (Per 17th January 2024).

Reference: OpenAI Vision API