Scrape DuckDuckGo Inline Images with Python
What will be scraped
DuckDuckGo Inline Images API
The difference that you'll see immediately is that API provides 30 results, rather than ~8-10 results as in the DIY solution below.
Alternatively, all that needs to be done is to iterate over structured JSON
string without thinking about how to scrape data without rendering the page, or how to bypass blocks and taking time to maintain the parser.
import json
from serpapi import GoogleSearch
params = {
"api_key": "...", # https://serpapi.com/manage-api-key
"engine": "duckduckgo",
"q": "elon musk dogecoin",
"kl": "us-en"
}
search = GoogleSearch(params)
results = search.get_dict()
print(json.dumps(results['inline_images'], indent=2, ensure_ascii=False))
----------------------
'''
[
{
"position": 1,
"title": "'Dogefather' Elon Musk Tweets in Support of the ...",
"link": "https://gadgets.ndtv.com/cryptocurrency/news/elon-musk-dogecoin-price-cryptocurrency-bitcoin-ethereum-ether-twitter-tweet-support-market-gain-2483505",
"thumbnail": "https://tse1.mm.bing.net/th?id=OIF.ryyLYCT1jVMZDADJDf1LVA&pid=Api",
"image": "https://i.gadgets360cdn.com/large/elon_musk_reuters_1610084738222.jpg"
}
...
{
"position": 20,
"title": "Beware! Your love for Elon Musk and Dogecoin may land you ...",
"link": "http://www.businesstelegraph.co.uk/beware-your-love-for-elon-musk-and-dogecoin-may-land-you-in-a-scam-economic-times/",
"thumbnail": "https://tse1.mm.bing.net/th?id=OIF.Y4geZY10AJX80AvM8EPCjQ&pid=Api",
"image": "http://www.businesstelegraph.co.uk/wp-content/uploads/2021/07/Beware-Your-love-for-Elon-Musk-and-Dogecoin-may-land.jpg"
}
]
'''
Process
The process is very much like other DuckDuckGo blog posts from our series.
Selecting container, title, link, thumbnail, image URL CSS selectors from which the .get_attribute()
method will be used to grab data-id
, src
, and href
attributes.
SelectorGadget Chrome extension was used in the GIF above to select CSS
selectors.
Code
from selenium import webdriver
import re, urllib.parse
driver = webdriver.Chrome(executable_path='path/to/chromedriver.exe')
driver.get('https://duckduckgo.com/?q=elon musk dogecoin&kl=us-en&ia=web')
for result in driver.find_elements_by_css_selector('.js-images-link'):
title = result.find_element_by_css_selector('.js-images-link a img').get_attribute('alt')
link = result.find_element_by_css_selector('.js-images-link a').get_attribute('href')
thumbnail_encoded = result.find_element_by_css_selector('.js-images-link a img').get_attribute('src')
# https://regex101.com/r/4pgG5m/1
match_thumbnail_urls = ''.join(re.findall(r'https\:\/\/external\-content\.duckduckgo\.com\/iu\/\?u\=(.*)&f=1', thumbnail_encoded))
# https://www.kite.com/python/answers/how-to-decode-a-utf-8-url-in-python
thumbnail = urllib.parse.unquote(match_thumbnail_urls).replace('&h=160', '')
image = result.get_attribute('data-id')
print(f'{title}\n{link}\n{thumbnail}\n{image}\n')
driver.quit()
--------------------------
'''
Dogecoin (DOGE) Price Crash Below Key Support and Even ...
https://duckduckgo.com/?q=elon%20musk%20dogecoin&iax=images&ia=images&iai=https://cdn.coingape.com/wp-content/uploads/2021/07/02195033/dogecoin-elon-musk-snl-memes.jpg&kl=us-en
https://tse1.mm.bing.net/th?id=OIF.UGa1KGFCz%2f5axclMfq0k4w&pid=Api
https://cdn.coingape.com/wp-content/uploads/2021/07/02195033/dogecoin-elon-musk-snl-memes.jpg
...
'''