Scrapping Naver Images results with Python

Today we are going explore together how to scrape Naver Korean engine image search engine.

This engine can be an alternative to Google images if you want one.

We will go through the process step by step until we have a scraper that provides JSON results identical to what we provide on our API: Naver Images Api documentation

Introduction

By leveraging the power of web scraping and the Requests library, we can extract valuable information from Naver's image search results and save it in a structured format (JSON in our case). Whether you need to gather data for research purposes or build a dataset for machine learning, this script will be handy.

Today we are only going to talk about scraping the results as JSON and in future articles we will discuss how to handling the HTML output.



Understanding and writing the script


To begin with, we import the necessary libraries: requests, json, re, and urllib. The requests library enables us to send HTTP requests to retrieve web pages, while json facilitates working with JSON data. re helps us extract the required JSON data from the response, and urllib assists in unquoting URL-encoded strings.

import requests
import json
import re
import urllib.parse

I believe most of the people who want to scrape Naver image results will find it difficult as Naver is not returning the Images results as other search engines with the HTML page, but it returns the Images in a JS file as an API.

search_url = "https://s.search.naver.com/p/c/image/search.naver?query=coffee&json_type=6&display=50&start=1&_callback=serpapi"
results = get_image_results(search_url)
save_as_json(results)


Once we have the image results, we proceed to save them as a JSON file. We check if there are results available and, if so, write them to a file named image_results.json using the json.dump method.

def save_as_json(image_results):
    if image_results:
        with open('image_results.json', 'w') as json_file:
            json.dump(image_results, json_file, indent=4)
            print("Image results saved successfully as image_results.json")
    else:
        print("No image results to save.")

This function get_image_results takes a URL as input, sends an HTTP GET request to that URL using requests.get, and retrieves the response. It then uses a regular expression pattern (re.search) to extract the JSON data from the response text. The extracted JSON data is converted to a Python dictionary using json.loads.

Next, we initialize an empty list called image_results to store the formatted image results. We iterate over each item in the data dictionary's 'items' list. If an item has a 'title', we extract the required fields from the item and create a dictionary called image_result. We assign the corresponding values to the keys in the image_result dictionary.

def get_image_results(url):
    response = requests.get(url)
    json_data = re.search(r'serpapi\(({.*?})\)', response.text).group(1)
    data = json.loads(json_data)
    image_results = []

    for item in data.get('items', []):
        if 'title' in item:
            image_result = {
                'title': item['title'],
                'source': item['writerTitle'],
                'where': item['source'],
                'is_gif': item['is_gif'],
                'width': item['orgWidth'],
                'height': item['orgHeight'],
                'thumbnail_width': item['thumbWidth'],
                'thumbnail_height': item['thumbHeight'],
                'thumbnail': urllib.parse.unquote(item['thumb']),
                'original': item['originalUrl'],
                'link': item['link']
            }
            image_results.append(image_result)

    return image_results

After extracting the necessary information from each item, we append the image_result dictionary to the image_results list.

Finally, we return the image_results list.


Conclusion

You can easily scrape image results from Naver's search engine. Just like scraping any other search engine may be even easier, in our case the biggest obstacle we had is to discover the way Naver handles the images in the backend, we found the JS file. Some other engines may use different methods. e.g. Home Depot uses GraphQL similar to Naver images.

The full and final script:

import requests
import json
import re
from urllib.parse import unquote

def get_image_results(url):
    response = requests.get(url)
    data = response.text

    # Extract the JSON data from the response
    json_data = re.search(r'serpapi\(({.*?})\)', data)
    if json_data:
        json_string = json_data.group(1)
        data_dict = json.loads(json_string)

        # Extract the images results
        images_results = data_dict.get('items', [])

        # Process and format the results
        formatted_results = []
        for result in images_results:
            formatted_result = {
                "title": unquote(result.get("title", "")),
                "source": result.get("writerTitle", ""),
                "where": result.get("source", ""),
                "is_gif": result.get("is_gif", False),
                "width": result.get("orgWidth", 0),
                "height": result.get("orgHeight", 0),
                "thumbnail_width": result.get("thumbWidth", 0),
                "thumbnail_height": result.get("thumbHeight", 0),
                "thumbnail": result.get("thumb", ""),
                "original": result.get("originalUrl", ""),
                "link": result.get("link", "")
            }
            formatted_results.append(formatted_result)

        return formatted_results

    return None

# URL to scrape
url = "https://s.search.naver.com/p/c/image/search.naver?query=coffee&json_type=6&display=50&start=1&_callback=serpapi"

# Scrape the image results
results = get_image_results(url)

# Save results as JSON
if results:
    with open("image_results.json", "w") as json_file:
        json.dump({"images_results": results}, json_file, indent=4)
    print("Image results saved as image_results.json")
else:
    print("No image results found.")



Ending

Check out our Naver Images Api and don't miss our previous blog post Scrape Naver video results.

If you have any further questions regarding SerpApi please contact us: contact@serpapi.com