What will be scraped

What will be scraped from Naver Organic Results

Prerequisites and Imports

pip install requests
pip install lxml 
pip install beautifulsoup4
  • Basic knowledge of Python.
  • Basic familiarity of the packages mentioned above.
  • Basic understanding of CSS selectors because you'll see mostly usage of select()/select_one() beautifulsoup methods that accept CSS selectors.

I wrote a dedicated blog about web scraping with CSS selectors to cover what it is, pros and cons, and why they're matter from a web-scraping perspective.


Process

If you don't need an explanation, jump to the code section.

We need to take three steps to make:

  1. Save HTML locally to test everything before making a lot of direct requests.
  2. Pick CSS selectors for all the needed data.
  3. Extract the data.

Save HTML to test the parser locally

Saving HTML locally prevents blocking or banning IP address, especially when a bunch of requests needs to be made to the same website in order to test the code.

A normal user won't do 100+ requests in a very short period of time, and don't do the same thing over and over again (pattern) as scripts do, so websites might tag this behavior as unusual and block IP address for some period (might be written in the response: requests.get("URL").text) or ban permanently.

import requests

headers = {
    "User-Agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

params = {
    "query": "bruce lee",
    "where": "web"        # theres's also a "nexearch" param that will produce different results
}

def save_naver_organic_results():
    html = requests.get("https://search.naver.com/search.naver", params=params, headers=headers).text

    # replacing every space to underline (_) so bruce lee will become bruce_lee 
    query = params['query'].replace(" ", "_")

    with open(f"{query}_naver_organic_results.html", mode="w") as file:
        file.write(html)

Now, what's happening here

Import requests library
import requests

Add user-agent and query parameters

headers = {
    "User-Agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

# query parameters
params = {
    "query": "bruce lee",
    "where": "web"
}

I tend to pass query parameters to requests.get(params=params) instead of leaving them in the URL. I find it more readable, for example, let's look at the exact same URL:

params = {
    "where": "web",
    "sm": "top_hty",
    "fbm": "1",
    "ie": "utf8",
    "query": "bruce+lee"
}
requests.get("https://search.naver.com/search.naver", params=params)

# VS 

requests.get("https://search.naver.com/search.naver?where=web&sm=top_hty&fbm=1&ie=utf8&query=bruce+lee")  # Press F.

What about user-agent, it's needed to act as a "real" user visit otherwise the request might be denied. You can read more about it in my other blog post about how to reduce the chance of being blocked while web scraping search engines.


Pick and test CSS selectors

Selecting container (CSS selector that wraps all needed data), title, link, displayed link, and a snippet.

gif

The GIF above translates to this code snippet:

for result in soup.select(".total_wrap"):
    title = result.select_one(".total_tit").text.strip()
    link = result.select_one(".total_tit .link_tit")["href"]
    displayed_link = result.select_one(".total_source").text.strip()
    snippet = result.select_one(".dsc_txt").text

Extract data

import lxml, json
from bs4 import BeautifulSoup


def extract_local_html_naver_organic_results():
    with open("bruce_lee_naver_organic_results.html", mode="r") as html_file:
        html = html_file.read()
        soup = BeautifulSoup(html, "lxml")

        data = []

        for index, result in enumerate(soup.select(".total_wrap")):
            title = result.select_one(".total_tit").text.strip()
            link = result.select_one(".total_tit .link_tit")["href"]
            displayed_link = result.select_one(".total_source").text.strip()
            snippet = result.select_one(".dsc_txt").text

            data.append({
                "position": index + 1, # starts from 1, not from 0
                "title": title,
                "link": link,
                "displayed_link": displayed_link,
                "snippet": snippet
            })

        print(json.dumps(data, indent=2, ensure_ascii=False))

Now let's break down the extraction part

Import bs4, lxml, json libraries
import lxml, json
from bs4 import BeautifulSoup
Open saved HTML file, read it and pass it to BeautifulSoup() object and assign lxml as an HTML parser
with open("bruce_lee_naver_organic_results.html", mode="r") as html_file:
    html = html_file.read()
    soup = BeautifulSoup(html, "lxml")
Create temporary list() to store extracted data
data = []
Iterate and append as a dictionary to temporary list()

Since we also need to get an index (rank position), we can use enumerate() method which adds a counter to an iterable and returns it. More examples.

Example:

grocery = ["bread", "milk", "butter"]  # iterable

for index, item in enumerate(grocery):
  print(f"{index} {item}\n")
  
'''
0 bread
1 milk
2 butter
'''

Actual code:

# in our case iterable is soup.select() since it returns an iterable as well
for index, result in enumerate(soup.select(".total_wrap")):
    title = result.select_one(".total_tit").text.strip()
    link = result.select_one(".total_tit .link_tit")["href"]
    displayed_link = result.select_one(".total_source").text.strip()
    snippet = result.select_one(".dsc_txt").text

    data.append({
        "position": index + 1,  # starts from 1, not from 0
        "title": title,
        "link": link,
        "displayed_link": displayed_link,
        "snippet": snippet
    })

Full Code

Now when combining all functions together, we'll get four (4) functions:

  • The first function saves HTML locally.
  • The second function opens local HTML and calls a parser function.
  • The third function makes an actual request and calls a parser function.
  • The fourth function is a parser that's being called by the second and third functions.

Note: first and second function could be skipped if you don't really want to do that but take in mind possible consequences that was mentioned above.

import requests
import lxml, json
from bs4 import BeautifulSoup

headers = {
    "User-Agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

params = {
    "query": "bruce lee",  # search query
    "where": "web"         # nexearch will produce different results
}


# function that saves HTML locally
def save_naver_organic_results():
    html = requests.get("https://search.naver.com/search.naver", params=params, headers=headers).text

    # replacing every spaces so bruce lee will become bruce_lee 
    query = params['query'].replace(" ", "_")

    with open(f"{query}_naver_organic_results.html", mode="w") as file:
        file.write(html)


# fucntion that opens local HTML and calls a parser function
def extract_naver_organic_results_from_html():
    with open("bruce_lee_naver_organic_results.html", mode="r") as html_file:
        html = html_file.read()

        # calls naver_organic_results_parser() function to parse the page
        data = naver_organic_results_parser(html)

        print(json.dumps(data, indent=2, ensure_ascii=False))


# function that make an actual request and calls a parser function
def extract_naver_organic_results_from_url():
    html = requests.get("https://search.naver.com/search.naver", params=params, headers=headers)

    # calls naver_organic_results_parser() function to parse the page
    data = naver_organic_results_parser(html)

    print(json.dumps(data, indent=2, ensure_ascii=False))


# parser that's being called by 2-3 functions
def naver_organic_results_parser(html):
    soup = BeautifulSoup(html.text, "lxml")

    data = []

    for index, result in enumerate(soup.select(".total_wrap")):
        title = result.select_one(".total_tit").text.strip()
        link = result.select_one(".total_tit .link_tit")["href"]
        displayed_link = result.select_one(".total_source").text.strip()
        snippet = result.select_one(".dsc_txt").text

        data.append({
            "position": index + 1, # starts from 1, not from 0
            "title": title,
            "link": link,
            "displayed_link": displayed_link,
            "snippet": snippet
        })

    return data

Alternatively, you can achieve the same results by using SerpApi which is a paid API with a free plan.

The difference is that there's no need to create the parser from scratch, maintain it, fgiure out how to bypass blocks from Google or other search engines, understanding how to scale it. Check out the playground.

Install SerpApi library:

pip install google-search-results

Example code to integrate:

from serpapi import GoogleSearch
import os, json


def serpapi_get_naver_organic_results():
    params = {
        # https://docs.python.org/3/library/os.html#os.getenv
        "api_key": os.getenv("API_KEY"), # your serpapi api key
        "engine": "naver",     # search engine (Google, Bing, DuckDuckGo..)
        "query": "Bruce Lee",  # search query
        "where": "web"         # organic results
    }

    search = GoogleSearch(params)
    results = search.get_dict()

    data = []

    for result in results["organic_results"]:
        data.append({
            "position": result["position"],
            "title": result["title"],
            "link": result["link"],
            "displayed_link": result["displayed_link"],
            "snippet": result["snippet"]
        })

    print(json.dumps(data, indent=2, ensure_ascii=False))

Let's see what is happening here

Import serpapi, os, json libraries
from serpapi import GoogleSearch
import os, json
Pass search parameters as a dictionary ({})
params = {
    "api_key": os.getenv("API_KEY"),
    "engine": "naver",                # search engine (Google, Bing, DuckDuckGo..)
    "query": "Bruce Lee",             # search query
    "where": "web"                    # filter to extract data from organic results
}
Data extraction

This is happening under the hood so you don't have to think about these two lines of code.

search = GoogleSearch(params) # data extraction
results = search.get_dict()   # structured JSON which is being called later
Create a list() to temporary store the data
data = []
Iterate and append() extracted data to a list() as a dictionary ({})
for result in results["organic_results"]:
    data.append({
        "position": result["position"],
        "title": result["title"],
        "link": result["link"],
        "displayed_link": result["displayed_link"],
        "snippet": result["snippet"]
    })
print(json.dumps(data, indent=2, ensure_ascii=False))
    
    
# ----------------
# part of the output
'''
[
  {
    "position": 1,
    "title": "Bruce Lee",
    "link": "https://brucelee.com/",
    "displayed_link": "brucelee.com",
    "snippet": "New Podcast Episode: #402 Flowing with Dustin Nguyen Watch + Listen to Episode “Your inspiration continues to guide us toward our personal liberation.” - Bruce Lee - More Podcast Episodes HBO Announces Order For Season 3 of Warrior! WARRIOR Seasons 1 & 2 Streaming Now on HBO & HBO Max “Warrior is still the best show you’re"
  }
  # other results..
]
'''

If you need more information about the plans, it was explained earlier by SerpApi team member Justin O'Hara in his breakdown of SerpApi’s subscriptions blog post (information is the same except you don't have to login to the SerpApi website).



Join us on Reddit | Twitter | YouTube