Today we are going explore together how to scrape Naver Korean engine image search engine.
This engine can be an alternative to Google images if you want one.
We will go through the process step by step until we have a scraper that provides JSON results identical to what we provide on our API: Naver Images Api documentation
Introduction
By leveraging the power of web scraping and the Requests library, we can extract valuable information from Naver's image search results and save it in a structured format (JSON in our case). Whether you need to gather data for research purposes or build a dataset for machine learning, this script will be handy.
Today we are only going to talk about scraping the results as JSON and in future articles we will discuss how to handling the HTML output.
Understanding and writing the script
To begin with, we import the necessary libraries: requests
, json
, re
, and urllib
. The requests
library enables us to send HTTP requests to retrieve web pages, while json
facilitates working with JSON data. re
helps us extract the required JSON data from the response, and urllib
assists in unquoting URL-encoded strings.
import requests
import json
import re
import urllib.parse
I believe most of the people who want to scrape Naver image results will find it difficult as Naver is not returning the Images results as other search engines with the HTML page, but it returns the Images in a JS file as an API.
search_url = "https://s.search.naver.com/p/c/image/search.naver?query=coffee&json_type=6&display=50&start=1&_callback=serpapi"
results = get_image_results(search_url)
save_as_json(results)
Once we have the image results, we proceed to save them as a JSON file. We check if there are results available and, if so, write them to a file named image_results.json
using the json.dump
method.
def save_as_json(image_results):
if image_results:
with open('image_results.json', 'w') as json_file:
json.dump(image_results, json_file, indent=4)
print("Image results saved successfully as image_results.json")
else:
print("No image results to save.")
This function get_image_results
takes a URL as input, sends an HTTP GET request to that URL using requests.get
, and retrieves the response. It then uses a regular expression pattern (re.search
) to extract the JSON data from the response text. The extracted JSON data is converted to a Python dictionary using json.loads
.
Next, we initialize an empty list called image_results
to store the formatted image results. We iterate over each item in the data
dictionary's 'items' list. If an item has a 'title', we extract the required fields from the item and create a dictionary called image_result
. We assign the corresponding values to the keys in the image_result
dictionary.
def get_image_results(url):
response = requests.get(url)
json_data = re.search(r'serpapi\(({.*?})\)', response.text).group(1)
data = json.loads(json_data)
image_results = []
for item in data.get('items', []):
if 'title' in item:
image_result = {
'title': item['title'],
'source': item['writerTitle'],
'where': item['source'],
'is_gif': item['is_gif'],
'width': item['orgWidth'],
'height': item['orgHeight'],
'thumbnail_width': item['thumbWidth'],
'thumbnail_height': item['thumbHeight'],
'thumbnail': urllib.parse.unquote(item['thumb']),
'original': item['originalUrl'],
'link': item['link']
}
image_results.append(image_result)
return image_results
After extracting the necessary information from each item, we append the image_result
dictionary to the image_results
list.
Finally, we return the image_results
list.
Conclusion
You can easily scrape image results from Naver's search engine. Just like scraping any other search engine may be even easier, in our case the biggest obstacle we had is to discover the way Naver handles the images in the backend, we found the JS file. Some other engines may use different methods. e.g. Home Depot uses GraphQL similar to Naver images.
The full and final script:
import requests
import json
import re
from urllib.parse import unquote
def get_image_results(url):
response = requests.get(url)
data = response.text
# Extract the JSON data from the response
json_data = re.search(r'serpapi\(({.*?})\)', data)
if json_data:
json_string = json_data.group(1)
data_dict = json.loads(json_string)
# Extract the images results
images_results = data_dict.get('items', [])
# Process and format the results
formatted_results = []
for result in images_results:
formatted_result = {
"title": unquote(result.get("title", "")),
"source": result.get("writerTitle", ""),
"where": result.get("source", ""),
"is_gif": result.get("is_gif", False),
"width": result.get("orgWidth", 0),
"height": result.get("orgHeight", 0),
"thumbnail_width": result.get("thumbWidth", 0),
"thumbnail_height": result.get("thumbHeight", 0),
"thumbnail": result.get("thumb", ""),
"original": result.get("originalUrl", ""),
"link": result.get("link", "")
}
formatted_results.append(formatted_result)
return formatted_results
return None
# URL to scrape
url = "https://s.search.naver.com/p/c/image/search.naver?query=coffee&json_type=6&display=50&start=1&_callback=serpapi"
# Scrape the image results
results = get_image_results(url)
# Save results as JSON
if results:
with open("image_results.json", "w") as json_file:
json.dump({"images_results": results}, json_file, indent=4)
print("Image results saved as image_results.json")
else:
print("No image results found.")
Ending
Check out our Naver Images Api and don't miss our previous blog post Scrape Naver video results.
- SerpApi Status (check the performance of our APIs)
- SerpApi Documentation (browse all of our APIs)
- SerpApi Playground (try out some searches)
If you have any further questions regarding SerpApi please contact us: contact@serpapi.com