How to Scrape Naver Organic Results with Python
What will be scraped
Prerequisites and Imports
pip install requests
pip install lxml
pip install beautifulsoup4
- Basic knowledge of Python.
- Basic familiarity of the packages mentioned above.
- Basic understanding of
CSS
selectors because you'll see mostly usage ofselect()
/select_one()
beautifulsoup
methods that acceptCSS
selectors.
I wrote a dedicated blog about web scraping with CSS
selectors to cover what it is, pros and cons, and why they matter from a web-scraping perspective.
Naver Web Organic Results API
You can achieve the same results as in the DIY solution below by using our API.
The difference is that there's no need to create the parser from scratch, maintain it, figure out how to bypass blocks from Google or other search engines, or understand how to scale it. Check out the playground.
Install SerpApi library:
pip install google-search-results
Example code to integrate:
from serpapi import GoogleSearch
import json
def serpapi_get_naver_organic_results():
params = {
"api_key": "...", # https://serpapi.com/manage-api-key
"engine": "naver", # search engine (Google, Bing, DuckDuckGo..)
"query": "Bruce Lee", # search query
"where": "web" # organic results
}
search = GoogleSearch(params)
results = search.get_dict()
data = []
for result in results["organic_results"]:
data.append({
"position": result["position"],
"title": result["title"],
"link": result["link"],
"displayed_link": result["displayed_link"],
"snippet": result["snippet"]
})
print(json.dumps(data, indent=2, ensure_ascii=False))
Let's see what is happening here
Import serpapi
, json
libraries
from serpapi import GoogleSearch
import json
Pass search parameters as a dictionary ({}
)
params = {
"api_key": "...", # https://serpapi.com/manage-api-key
"engine": "naver", # search engine (Google, Bing, DuckDuckGo..)
"query": "Bruce Lee", # search query
"where": "web" # filter to extract data from organic results
}
Data extraction
This is happening under the hood so you don't have to think about these two lines of code.
search = GoogleSearch(params) # data extraction
results = search.get_dict() # structured JSON which is being called later
Create a list()
to temporarily store the data
data = []
Iterate and append()
extracted data to a list()
as a dictionary ({}
)
for result in results["organic_results"]:
data.append({
"position": result["position"],
"title": result["title"],
"link": result["link"],
"displayed_link": result["displayed_link"],
"snippet": result["snippet"]
})
Print added data
print(json.dumps(data, indent=2, ensure_ascii=False))
# ----------------
# part of the output
'''
[
{
"position": 1,
"title": "Bruce Lee",
"link": "https://brucelee.com/",
"displayed_link": "brucelee.com",
"snippet": "New Podcast Episode: #402 Flowing with Dustin Nguyen Watch + Listen to Episode “Your inspiration continues to guide us toward our personal liberation.” - Bruce Lee - More Podcast Episodes HBO Announces Order For Season 3 of Warrior! WARRIOR Seasons 1 & 2 Streaming Now on HBO & HBO Max “Warrior is still the best show you’re"
}
# other results..
]
'''
DIY Process
If you don't need an explanation, jump to the code section.
We have three steps to complete:
- Save HTML locally to test everything before making a lot of direct requests.
- Pick
CSS
selectors for all the needed data. - Extract the data.
Save HTML to test the parser locally (optional)
Saving HTML locally prevents blocking or banning IP address, especially when a bunch of requests need to be made to the same website in order to test the code.
A normal user won't do 100+ requests in a very short period of time, and won't do the same thing over and over again (pattern) as scripts do, so websites might tag this behavior as unusual and block IP address for some period (might be written in the response: requests.get("URL").text
) or ban permanently.
import requests
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"query": "bruce lee",
"where": "web" # theres's also a "nexearch" param that will produce different results
}
def save_naver_organic_results():
html = requests.get("https://search.naver.com/search.naver", params=params, headers=headers).text
# replacing every space to underline (_) so bruce lee will become bruce_lee
query = params['query'].replace(" ", "_")
with open(f"{query}_naver_organic_results.html", mode="w") as file:
file.write(html)
Now, what's happening here
Import requests
library
import requests
Add user-agent
and query parameters
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
# query parameters
params = {
"query": "bruce lee",
"where": "web"
}
I tend to pass query parameters to requests.get(params=params)
instead of leaving them in the URL. I find it more readable, for example, let's look at the exact same URL:
params = {
"where": "web",
"sm": "top_hty",
"fbm": "1",
"ie": "utf8",
"query": "bruce+lee"
}
requests.get("https://search.naver.com/search.naver", params=params)
# VS
requests.get("https://search.naver.com/search.naver?where=web&sm=top_hty&fbm=1&ie=utf8&query=bruce+lee") # Press F.
What about user-agent
, it's needed to act as a "real" user visit otherwise the request might be denied. You can read more about it in my other blog post about how to reduce the chance of being blocked while web scraping search engines.
Pick and test CSS
selectors
Selecting container (CSS
selector that wraps all needed data), title, link, displayed link, and a snippet.
The GIF above translates to this code snippet:
for result in soup.select(".total_wrap"):
title = result.select_one(".total_tit").text.strip()
link = result.select_one(".total_tit .link_tit")["href"]
displayed_link = result.select_one(".total_source").text.strip()
snippet = result.select_one(".dsc_txt").text
Extract data
import lxml, json
from bs4 import BeautifulSoup
def extract_local_html_naver_organic_results():
with open("bruce_lee_naver_organic_results.html", mode="r") as html_file:
html = html_file.read()
soup = BeautifulSoup(html, "lxml")
data = []
for index, result in enumerate(soup.select(".total_wrap")):
title = result.select_one(".total_tit").text.strip()
link = result.select_one(".total_tit .link_tit")["href"]
displayed_link = result.select_one(".total_source").text.strip()
snippet = result.select_one(".dsc_txt").text
data.append({
"position": index + 1, # starts from 1, not from 0
"title": title,
"link": link,
"displayed_link": displayed_link,
"snippet": snippet
})
print(json.dumps(data, indent=2, ensure_ascii=False))
Now let's break down the extraction part
Import bs4
, lxml
, json
libraries
import lxml, json
from bs4 import BeautifulSoup
Open saved HTML file, read it and pass it to BeautifulSoup()
object and assign lxml
as an HTML parser
with open("bruce_lee_naver_organic_results.html", mode="r") as html_file:
html = html_file.read()
soup = BeautifulSoup(html, "lxml")
Create temporary list()
to store extracted data
data = []
Iterate and append as a dictionary to temporary list()
Since we also need to get an index (rank position), we can use enumerate()
method which adds a counter to an iterable and returns it. More examples.
Example:
grocery = ["bread", "milk", "butter"] # iterable
for index, item in enumerate(grocery):
print(f"{index} {item}\n")
'''
0 bread
1 milk
2 butter
'''
Actual code:
# in our case iterable is soup.select() since it returns an iterable as well
for index, result in enumerate(soup.select(".total_wrap")):
title = result.select_one(".total_tit").text.strip()
link = result.select_one(".total_tit .link_tit")["href"]
displayed_link = result.select_one(".total_source").text.strip()
snippet = result.select_one(".dsc_txt").text
data.append({
"position": index + 1, # starts from 1, not from 0
"title": title,
"link": link,
"displayed_link": displayed_link,
"snippet": snippet
})
Full DIY Code
Now, when combining all functions together, we'll get four (4) functions:
- The first function saves HTML locally.
- The second function opens local HTML and calls a parser function.
- The third function makes an actual request and calls a parser function.
- The fourth function is a parser that's being called by the second and third functions.
Note: first and second function could be skipped if you don't really want to do that but take in mind possible consequences that were mentioned above.
import requests
import lxml, json
from bs4 import BeautifulSoup
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"query": "bruce lee", # search query
"where": "web" # nexearch will produce different results
}
# function that saves HTML locally
def save_naver_organic_results():
html = requests.get("https://search.naver.com/search.naver", params=params, headers=headers).text
# replacing every spaces so bruce lee will become bruce_lee
query = params['query'].replace(" ", "_")
with open(f"{query}_naver_organic_results.html", mode="w") as file:
file.write(html)
# fucntion that opens local HTML and calls a parser function
def extract_naver_organic_results_from_html():
with open("bruce_lee_naver_organic_results.html", mode="r") as html_file:
html = html_file.read()
# calls naver_organic_results_parser() function to parse the page
data = naver_organic_results_parser(html)
print(json.dumps(data, indent=2, ensure_ascii=False))
# function that make an actual request and calls a parser function
def extract_naver_organic_results_from_url():
html = requests.get("https://search.naver.com/search.naver", params=params, headers=headers)
# calls naver_organic_results_parser() function to parse the page
data = naver_organic_results_parser(html)
print(json.dumps(data, indent=2, ensure_ascii=False))
# parser that's being called by 2-3 functions
def naver_organic_results_parser(html):
soup = BeautifulSoup(html.text, "lxml")
data = []
for index, result in enumerate(soup.select(".total_wrap")):
title = result.select_one(".total_tit").text.strip()
link = result.select_one(".total_tit .link_tit")["href"]
displayed_link = result.select_one(".total_source").text.strip()
snippet = result.select_one(".dsc_txt").text
data.append({
"position": index + 1, # starts from 1, not from 0
"title": title,
"link": link,
"displayed_link": displayed_link,
"snippet": snippet
})
return data