Scrape Naver Related Search Results with Python
What will be scraped
Prerequisites
Basic knowledge scraping with CSS selectors
If you haven't scraped with CSS
selectors, there's a dedicated blog post of mine about how to use CSS
selectors when web-scraping that covers what it is, pros and cons, and why they matter from a web-scraping perspective.
CSS
selectors declare which part of the markup a style applies to, thus allowing you to extract data from matching tags and attributes.
Separate virtual environment
If you haven't worked with a virtual environment before, have a look at the dedicated Python virtual environments tutorial using Virtualenv and Poetry blog post of mine to get familiar.
In short, it's a thing that creates an independent set of installed libraries including different Python versions that can coexist with each other at the same system, thus preventing libraries or Python version conflicts.
πNote: This is not a strict requirement for this blog post.
Install libraries
pip install requests, parsel
Reduce the chance of being blocked
There's a chance that a request might be blocked. Have a look at how to reduce the chance of being blocked while web-scraping, there are eleven methods to bypass blocks from most websites.
Using Naver Related results API
You can achieve it by using Naver Related results API from SerpApi. It is a paid API with a free plan.
It's almost the same as the DIY solution below, except you don't need to create the parser from scratch, maintain it, bypass blocks from Naver or other search engines, figure out which proxy/CAPTCHA providers are reliable, how to scale it.
from serpapi import NaverSearch
import json
params = {
"api_key": "...", # https://serpapi.com/manage-api-key
"engine": "naver", # search engine to parse results from
"query": "minecraft", # search query
"where": "web" # web results
}
search = NaverSearch(params) # where data extraction happens
results = search.get_dict() # JSON -> Python dictionary
related_results = []
# iterate over "related_results" and extract position, title and link
for related_result in results["related_results"]:
related_results.append({
"position": related_result["position"],
"title": related_result["title"],
"link": related_result["link"]
})
print(json.dumps(related_results, indent=2, ensure_ascii=False))
Output:
[
{
"position": 1,
"title": "λ§μΈν¬λννΈ",
"link": "https://search.naver.com?where=nexearch&query=%EB%A7%88%EC%9D%B8%ED%81%AC%EB%9E%98%ED%94%84%ED%8A%B8&ie=utf8&sm=tab_she&qdt=0"
},
{
"position": 2,
"title": "minecraft λ»",
"link": "https://search.naver.com?where=nexearch&query=minecraft+%EB%9C%BB&ie=utf8&sm=tab_she&qdt=0"
},
{
"position": 3,
"title": "craft",
"link": "https://search.naver.com?where=nexearch&query=craft&ie=utf8&sm=tab_she&qdt=0"
},
{
"position": 4,
"title": "mine",
"link": "https://search.naver.com?where=nexearch&query=mine&ie=utf8&sm=tab_she&qdt=0"
},
{
"position": 5,
"title": "mojang",
"link": "https://search.naver.com?where=nexearch&query=mojang&ie=utf8&sm=tab_she&qdt=0"
}
]
DIY Code
import requests, json
from parsel import Selector # https://parsel.readthedocs.io/
# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
"query": "minecraft", # search query
"where": "web" # web results. works with nexearch as well
}
# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.87 Safari/537.36",
}
html = requests.get("https://search.naver.com/search.naver", params=params, headers=headers, timeout=30)
selector = Selector(html.text)
related_results = []
# https://www.programiz.com/python-programming/methods/built-in/enumerate
for index, related_result in enumerate(selector.css(".related_srch .keyword"), start=1):
keyword = related_result.css(".tit::text").get().strip()
link = f'https://search.naver.com/search.naver{related_result.css("a::attr(href)").get()}'
related_results.append({
"position": index, # 1,2,3..
"title": keyword,
"link": link
})
print(json.dumps(related_results, indent=2, ensure_ascii=False))
Output:
[
{
"position": 1,
"title": "λ§μΈν¬λννΈ",
"link": "https://search.naver.com/search.naver?where=nexearch&query=%EB%A7%88%EC%9D%B8%ED%81%AC%EB%9E%98%ED%94%84%ED%8A%B8&ie=utf8&sm=tab_she&qdt=0"
},
{
"position": 2,
"title": "minecraft λ»",
"link": "https://search.naver.com/search.naver?where=nexearch&query=minecraft+%EB%9C%BB&ie=utf8&sm=tab_she&qdt=0"
},
{
"position": 3,
"title": "craft",
"link": "https://search.naver.com/search.naver?where=nexearch&query=craft&ie=utf8&sm=tab_she&qdt=0"
},
{
"position": 4,
"title": "mine",
"link": "https://search.naver.com/search.naver?where=nexearch&query=mine&ie=utf8&sm=tab_she&qdt=0"
},
{
"position": 5,
"title": "mojang",
"link": "https://search.naver.com/search.naver?where=nexearch&query=mojang&ie=utf8&sm=tab_she&qdt=0"
}
]
Links
Add a Feature Requestπ« or a Bugπ