Naver is the most widely used platform in South Korea and it is used there more than Google, based on Link Assistant and Croud blog posts.

What will be scraped

All News Results from the first page.

what will be scraped from Naver News results

Prerequisites

pip install requests lxml beautifulsoup4 google-search-results
  • Basic knowledge of Python.
  • Basic familiarity of the packages mentioned above.
  • Basic understanding of CSS selectors because you'll see mostly usage of select()/select_one() beautifulsoup methods that accept CSS selectors.

I wrote a dedicated blog about how to scrape CSS selectors using Python to cover what it is, pros and cons, and why they matter from a web-scraping perspective.

As an alternative, you can achieve the same as in the DIY solution below by using SerpApi. SerpApi is a paid API with a free plan.

The difference is that there's no need to code the parser from scratch and maintain it overtime (if something will be changed in the HTML), figure out what selectors to use, and how to bypass blocks from search engines.

Install SerpApi library

pip install google-search-results

Example code to integrate:

from serpapi import GoogleSearch
import json

params = {
    "api_key": "...",                 # https://serpapi.com/manage-api-key
    "engine": "naver",                # naver search engine
    "query": "Minecraft",             # search query
    "where": "news"                   # news results
}

search = GoogleSearch(params)  # where extraction happens
results = search.get_dict()    # JSON -> Python dict 

news_data = []

for news_result in results["news_results"]:
    title = news_result["title"]
    link = news_result["link"]
    thumbnail = news_result["thumbnail"]
    snippet = news_result["snippet"]
    press_name = news_result["news_info"]["press_name"]
    date_news_poseted = news_result["news_info"]["news_date"]

    news_data.append({
        "title": title,
        "link": link,
        "thumbnail": thumbnail,
        "snippet": snippet,
        "press_name": press_name,
        "news_date": date_news_poseted
    })

print(json.dumps(news_data, indent=2, ensure_ascii=False))

Import serpapi, json libraries

from serpapi import GoogleSearch
import json  # in this case used for pretty printing

Define search parameters

Note these parameters will be different depending on what "engine" you're using (except, in this case, "api_key", "query").

params = {
    "api_key": "...",                  # https://serpapi.com/manage-api-key
    "engine": "naver",                 # search engine
    "query": "Minecraft",              # search query
    "where": "news"                    # news results filter
    # other parameters
}

Create list() to temporary store the data

news_data = []

Iterate over each ["news_resutlts"], and store to the news_data list().

The difference here is that instead of calling some CSS selectors, we're extracting data from the dictionary (provided from SerpApi) by their key.

for news_result in results["news_results"]:
    title = news_result["title"]
    link = news_result["link"]
    thumbnail = news_result["thumbnail"]
    snippet = news_result["snippet"]
    press_name = news_result["news_info"]["press_name"]
    date_news_poseted = news_result["news_info"]["news_date"]

    news_data.append({
        "title": title,
        "link": link,
        "thumbnail": thumbnail,
        "snippet": snippet,
        "press_name": press_name,
        "news_date": date_news_poseted
    })
print(json.dumps(news_data, indent=2, ensure_ascii=False))


---------------
'''
[
  {
    "title": "Xbox, 11월부터 블록버스터 게임 연이어 출시",
    "link": "http://www.gameshot.net/common/con_view.php?code=GA617793ce93c74",
    "thumbnail": "https://search.pstatic.net/common/?src=https%3A%2F%2Fimgnews.pstatic.net%2Fimage%2Forigin%2F5739%2F2021%2F10%2F26%2F19571.jpg&type=ofullfill264_180_gray&expire=2&refresh=true",
    "snippet": "  마인크래프트(Minecraft) – 11월 3일(한국 시간) 마인크래프트는 11월 3일 Xbox Game Pass PC용에 추가될 예정이며, 새로운 마인크래프트 던전스 시즈널 어드벤처(Minecraft Dungeons Seasonal Adventures), 동굴과... ",
    "press_name": "게임샷",
    "news_date": "6일 전"
  }
  # other results...
]
'''

Full DIY Code

Have a look at the second function that will make an actual request to Naver search with passed query parameters. Test in the online IDE yourself.

import lxml, json, requests
from bs4 import BeautifulSoup


headers = {
    "User-Agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36"
}

params = {
    "query": "minecraft",
    "where": "news",
}


# function that parses content from local copy of html
def extract_news_from_html():
    with open("minecraft_naver_news.html", mode="r") as html_file:
        html = html_file.read()

        # calls naver_parser() function to parse the page
        data = naver_parser(html)

        print(json.dumps(data, indent=2, ensure_ascii=False))


# function that makes an actual request
def extract_naver_news_from_url():
    html = requests.get("https://search.naver.com/search.naver", params=params, headers=headers)

    # calls naver_parser() function to parse the page
    data = naver_parser(html)

    print(json.dumps(data, indent=2, ensure_ascii=False))


# parser that accepts html argument from extract_news_from_html() or extract_naver_news_from_url()
def naver_parser(html):
    soup = BeautifulSoup(html.text, "lxml")

    news_data = []

    for news_result in soup.select(".list_news .bx"):
        title = news_result.select_one(".news_tit").text
        link = news_result.select_one(".news_tit")["href"]
        thumbnail = news_result.select_one(".dsc_thumb img")["src"]
        snippet = news_result.select_one(".news_dsc").text

        press_name = news_result.select_one(".info.press").text
        news_date = news_result.select_one("span.info").text

        news_data.append({
            "title": title,
            "link": link,
            "thumbnail": thumbnail,
            "snippet": snippet,
            "press_name": press_name,
            "news_date": news_date
        })
      
    return news_data

DIY Process

There's not a lot of steps that need to be done, we need to:

  1. Make a request and save HTML locally (optional).
  2. Find CSS selectors or HTML elements from where to extract data.
  3. Extract data.

Make a request and save HTML locally

Why save locally?

The main point of this is to make sure that IP won't be banned or blocked for some time, which will delay the script development process.

When requests are being sent constantly (regular user won't do that) from the same IP, this could be detected (tagged or whatever, as unusual behavior) and blocked or banned to secure the website.

Try to save HTML locally first, test everything you need there, and then start making actual requests.

import requests

headers = {
    "User-Agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

params = {
    "query": "minecraft",
    "where": "news",
}

html = requests.get("https://search.naver.com/search.naver", params=params, headers=headers).text

with open(f"{params['query']}_naver_news.html", mode="w") as file:
    file.write(html)

Import a requests library

import requests

Add user-agent

headers = {
    "User-Agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36"
}

Add search query parameters

params = {
    "query": "minecraft",  # search query
    "where": "news",       # news results
}

Pass user-agent and query params

Pass user-agent to request headers and, pass query params while making a request.

You can read more in-depth about it in the article I wrote about why it's a good idea to pass user-agent to request header.

After the request is made, then we receive a response which will be decoded via .text.

html = requests.get("https://search.naver.com/search.naver", params=params, headers=headers).text

Save HTML locally

with open(f"{params['query']}_naver_news.html", mode="w") as file:
    file.write(html)

# output file will be minecraft_naver_news.html

Find correct selectors or HTML elements

Get a CSS selector of the container with all needed data such as title, link, etc

Gif that shows which selectors being used as a container

for news_result in soup.select(".list_news .bx"):
    # further code

Get a CSS selector for title, link, etc. that will be used in extracting part

Gif that shows which selectors being used for title, link, snippet, thumbnail and other data

for news_result in soup.select(".list_news .bx"):
    
    # hey, news_results, grab TEXT value from every element with ".news_tit" selector 
    title = news_result.select_one(".news_tit").text

    # hey, news_results, grab href (link) attribute from every element with ".news_tit" selector 
    link = news_result.select_one(".news_tit")["href"]
    # other elements..

Extract data

import lxml, json
from bs4 import BeautifulSoup

with open("minecraft_naver_news.html", mode="r") as html_file:
    html = html_file.read()
    soup = BeautifulSoup(html, "lxml")

    news_data = []

    for news_result in soup.select(".list_news .bx"):
        title = news_result.select_one(".news_tit").text
        link = news_result.select_one(".news_tit")["href"]
        thumbnail = news_result.select_one(".dsc_thumb img")["src"]
        snippet = news_result.select_one(".news_dsc").text

        press_name = news_result.select_one(".info.press").text
        news_date = news_result.select_one("span.info").text

        news_data.append({
            "title": title,
            "link": link,
            "thumbnail": thumbnail,
            "snippet": snippet,
            "press_name": press_name,
            "news_date": news_date
        })

    print(json.dumps(news_data, indent=2, ensure_ascii=False))

Print the data using json.dumps(), which in this case is just for pretty printing purposes.

print(json.dumps(news_data, indent=2, ensure_ascii=False))

# part of the output
'''
[
  {
    "title": "Xbox, 11월부터 블록버스터 게임 연이어 출시",
    "link": "http://www.gameshot.net/common/con_view.php?code=GA617793ce93c74",
    "thumbnail": "https://search.pstatic.net/common/?src=https%3A%2F%2Fimgnews.pstatic.net%2Fimage%2Forigin%2F5739%2F2021%2F10%2F26%2F19571.jpg&type=ofullfill264_180_gray&expire=2&refresh=true",
    "snippet": "  마인크래프트(Minecraft) – 11월 3일(한국 시간) 마인크래프트는 11월 3일 Xbox Game Pass PC용에 추가될 예정이며, 새로운 마인크래프트 던전스 시즈널 어드벤처(Minecraft Dungeons Seasonal Adventures), 동굴과... ",
    "press_name": "게임샷",
    "news_date": "6일 전"
  }
  # other results...
]
'''

Call newly added data

for news in news_data:
    title = news["title"] 
    # link, snippet, thumbnail.. 
    print(title)
    
    # prints all titles that was appended to the list() 
  1. Code in the online IDE
  2. Naver News Results API
  3. SelectorGadget
  4. An introduction to Naver
  5. Google Vs. Naver: Why Can’t Google Dominate Search in Korea?

Join us on Reddit | Twitter | YouTube