Scrape Qwant Organic and Ad Results using Python

What is Qwant Search?

Qwant is a European Paris-based no user tracking for advertising search engine with its independent indexing engine and available in 26 languages with more than 30 million individual monthly users worldwide.

What will be scraped

Prerequisites

Basic knowledge scraping with CSS selectors

If you haven't scraped with CSS selectors, there's a dedicated blog post of mine about how to use CSS selectors when web-scraping that covers what it is, pros and cons, and why they matter from a web-scraping perspective.

CSS selectors declare which part of the markup a style applies to, thus allowing you to extract data from matching tags and attributes.

Separate virtual environment

If you didn't work with a virtual environment before, have a look at the dedicated Python virtual environments tutorial using Virtualenv and Poetry blog post of mine to get familiar.

In short, it's a thing that creates an independent set of installed libraries including different Python versions that can coexist with each other at the same system thus preventing libraries or Python version conflicts.

Install libraries:

pip install requests
pip install lxml 
pip install beautifulsoup4

Reduce the chance of being blocked

There's a chance that a request might be blocked. Have a look at how to reduce the chance of being blocked while web-scraping, there's eleven methods to bypass blocks from most websites.

Starting code for both organic and ad results:

from bs4 import BeautifulSoup
import requests, lxml, json

headers = {
    "User-Agent": "Mozilla/5.0 (Linux; Android 10; HD1913) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.105 Mobile Safari/537.36 EdgA/46.1.2.5140"
}

params = {
    "q": "minecraft",
    "t": "web"
}


html = requests.get("https://www.qwant.com/", params=params, headers=headers, timeout=20)
soup = BeautifulSoup(html.text, "lxml")
# further code...

Import libraries:

from bs4 import BeautifulSoup
import requests, lxml, json

Add user-agent and query parameters to request:

headers = {
    "User-Agent": "Mozilla/5.0 (Linux; Android 10; HD1913) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.105 Mobile Safari/537.36 EdgA/46.1.2.5140"
}

params = {
    "q": "minecraft",  # search query
    "t": "web"         # qwant query argument for displaying web results 
}

Make a request, add timeout argument, create BeautifulSoup() object:

html = requests.get("https://www.qwant.com/", params=params, headers=headers, timeout=20)
soup = BeautifulSoup(html.text, "lxml")

timeout parameter will tell requests to stop waiting for response after an X number of seconds.
BeautifulSoup() is what pulls all the HTML data. lxml is an HTML parser.

Extract Organic Results

def scrape_organic_results():

    organic_results_data = []

    for index, result in enumerate(soup.select("[data-testid=webResult]"), start=1):
        title = result.select_one(".WebResult-module__title___MOBFg").text
        link = result.select_one(".Stack-module__VerticalStack___2NDle.Stack-module__Spacexxs___3wU9G a")["href"]
        snippet = result.select_one(".Box-module__marginTopxxs___RMB_d").text

        try:
            displayed_link = result.select_one(".WebResult-module__permalink___MJGeh").text
            favicon = result.select_one(".WebResult-module__iconBox___3DAv5 img")["src"]
        except:
            displayed_link = None
            favicon = None

        organic_results_data.append({
            "position": index,
            "title": title,
            "link": link,
            "displayed_link": displayed_link,
            "snippet": snippet,
            "favicon": favicon
        })

    print(json.dumps(organic_results_data, indent=2))


scrape_oragnic_results()

Create temporary list() to store extracted data:

organic_results_data = []

Iterate and extract the data:

for index, result in enumerate(soup.select("[data-testid=webResult]"), start=1):
    title = result.select_one(".WebResult-module__title___MOBFg").text
    link = result.select_one(".Stack-module__VerticalStack___2NDle.Stack-module__Spacexxs___3wU9G a")["href"]
    snippet = result.select_one(".Box-module__marginTopxxs___RMB_d").text

    try:
        displayed_link = result.select_one(".WebResult-module__permalink___MJGeh").text
        favicon = result.select_one(".WebResult-module__iconBox___3DAv5 img")["src"]
    except:
        displayed_link = None
        favicon = None

To get the position index, we can use enumerate() function which adds a counter to an iterable and returns it and set start to 1 so the count would start from 1, not from 0.

To handle None values, we can use try/except block so if there's nothing on the Qwant backend, we'll set it to None as well, otherwise it will throw an error saying that there's no such element or attribute.

Append extracted data to temporary list() as a dictionary:

organic_results_data.append({
    "position": index,
    "title": title,
    "link": link,
    "displayed_link": displayed_link,
    "snippet": snippet,
    "favicon": favicon
})

Print the data:

print(json.dumps(organic_results_data, indent=2))


# part of the output:
'''
[
  {
    "position": 1,
    "title": "Minecraft Official Site | Minecraft",
    "link": "https://www.minecraft.net/",
    "displayed_link": "minecraft.net",
    "snippet": "Get all-new items in the Minecraft Master Chief Mash-Up DLC on 12/10, and the Superintendent shirt in Character Creator, free for a limited time! Learn more. Climb high and dig deep. Explore bigger mountains, caves, and biomes along with an increased world height and updated terrain generation in the Caves & Cliffs Update: Part II! Learn more . Play Minecraft games with Game Pass. Get your ...",
    "favicon": "https://s.qwant.com/fav/m/i/www_minecraft_net.ico"
  },
  
  ... other results
  
  {
    "position": 10,
    "title": "Minecraft - download free full version game for PC ...",
    "link": "http://freegamepick.net/en/minecraft/",
    "displayed_link": "freegamepick.net",
    "snippet": "Minecraft Download Game Overview. Minecraft is a game about breaking and placing blocks. It's developed by Mojang. At first, people built structures to protect against nocturnal monsters, but as the game grew players worked together to create wonderful, imaginative things. It can als o be about adventuring with friends or watching the sun rise over a blocky ocean.",
    "favicon": "https://s.qwant.com/fav/f/r/freegamepick_net.ico"
  }
]
'''

Extract Ad Results

def scrape_ad_results():

    ad_results_data = []

    for index, ad_result in enumerate(soup.select("[data-testid=adResult]"), start=1):
        ad_title = ad_result.select_one(".WebResult-module__title___MOBFg").text
        ad_link = ad_result.select_one(".Stack-module__VerticalStack___2NDle a")["href"]
        ad_displayed_link = ad_result.select_one(".WebResult-module__domain___1LJmo").text
        ad_snippet = ad_result.select_one(".Box-module__marginTopxxs___RMB_d").text
        ad_favicon = ad_result.select_one(".WebResult-module__iconBox___3DAv5 img")["src"]

        ad_results_data.append({
            "ad_position": index,
            "ad_title": ad_title,
            "ad_link": ad_link,
            "ad_displayed_link": ad_displayed_link,
            "ad_snippet": ad_snippet,
            "ad_favicon": ad_favicon
        })

    print(json.dumps(ad_results_data, indent=2))
    

scrape_ad_results()

Create temporary list() to store extracted data:

ad_results_data = []

Iterate and extract:

for index, ad_result in enumerate(soup.select("[data-testid=adResult]"), start=1):
    ad_title = ad_result.select_one(".WebResult-module__title___MOBFg").text
    ad_link = ad_result.select_one(".Stack-module__VerticalStack___2NDle a")["href"]
    ad_displayed_link = ad_result.select_one(".WebResult-module__domain___1LJmo").text
    ad_snippet = ad_result.select_one(".Box-module__marginTopxxs___RMB_d").text
    ad_favicon = ad_result.select_one(".WebResult-module__iconBox___3DAv5 img")["src"]

The same approach was used to get the position index. The only difference is different CSS "container" selector [data-testid=adResult] while in organic results it's [data-testid=webResult].

Append extracted data to temporary list() as a dictionary:

ad_results_data.append({
    "ad_position": index,
    "ad_title": ad_title,
    "ad_link": ad_link,
    "ad_displayed_link": ad_displayed_link,
    "ad_snippet": ad_snippet
})