Scrape Yahoo Shopping with Python
Through Yahoo Shopping, users can search for a wide array of items, from electronics and clothing to home goods and health products. Its interface offers users the ability to easily compare prices, product specifications, and seller ratings across multiple vendors, making it a convenient tool for online shoppers looking for the best deals.
Additionally, Yahoo Shopping incorporates reviews and ratings, both for individual products and for the vendors selling them, ensuring users can make informed decisions about their purchases.
Introduction
In recent years, Yahoo Shopping has leveraged advanced technologies, such as GraphQL for efficient data retrieval, which has resulted in faster, more responsive user experiences.
This made it challenging for some users to scrape Yahoo shopping as it's not regular HTML response scraping.
Getting Started
To begin with, let's first understand what we need:
- Python: You should have Python 3.6 or above installed on your machine.
- Libraries: We'll be using the
requests
library for making HTTP requests and thejson
library to work with JSON data.
You can install the requests
library using pip:
pip install requests
Overview of Yahoo Shopping's GraphQL API
Yahoo Shopping employs a GraphQL API for its data, which differs from typical REST APIs. GraphQL APIs allow us to specify exactly what data we need, leading to more efficient data retrieval.
We'll use a specific GraphQL query called searchProduct
to fetch product data. In this case, we are searching for "coffee."
Scripting
We create a function, shopping_yahoo_gql()
, which prepares the GraphQL request we will send to the Yahoo Shopping API. The request is a dictionary containing information like the operation name (searchProduct
), our search keyword (coffee
), and other necessary details.
def shopping_yahoo_gql():
headers = {
'Content-Type': 'application/json',
}
post = {
"operationName": "searchProduct",
"variables": {
"searchRequest": {
"keyword": "coffee",
"sourceTypes": ["PRODUCT"],
"fieldSets": ["ITEMS"],
"imageSize": '400x400',
"pageId": 'affiliate-shop-srp',
"siteId": 'us-shopping',
"countryCode": 'US',
"lang": 'en',
},
},
"query": "query searchProduct($searchRequest: SearchRequest) { search(searchRequest: $searchRequest) { totalCount items { provider gtin itemId title price salePrice currency image vendor vendorId } }}"
}
url = 'https://shopping.yahoo.com/graphql'
response = requests.post(url, headers=headers, data=json.dumps(post))
return response.json()
The function scrape_yahoo_shopping()
is where we parse the response from the API. We loop through the items in the data
field of the response, extracting the relevant fields and appending them to the shopping_results
list.
def scrape_yahoo_shopping():
data_gql = shopping_yahoo_gql()
if data_gql:
shopping_results = [{
"position": index + 1,
"product_id": item.get('gtin'),
"link": f"https://shopping.yahoo.com/product/{item.get('gtin')}",
"title": item.get('title'),
"seller": item.get('vendor'),
"price": float(item.get('price', 0)),
"sale_price": float(item.get('salePrice', 0)) if item.get('salePrice') != item.get('price') else None,
"thumbnail": item.get('image'),
} for index, item in enumerate(data_gql.get('data', {}).get('search', {}).get('items', []))]
print(json.dumps({"shopping_results": shopping_results}, indent=2))
Finally, we call our scrape_yahoo_shopping()
function in the script's main entry point. This will execute our scraping function and print the results in the console:
if __name__ == "__main__":
scrape_yahoo_shopping()
FULL and Final script
import requests
import json
def shopping_yahoo_gql():
headers = {
'Content-Type': 'application/json',
}
post = {
"operationName": "searchProduct",
"variables": {
"searchRequest": {
"keyword": "coffee", # we're setting the search keyword to "coffee"
"sourceTypes": ["PRODUCT"],
"fieldSets": ["ITEMS"],
"imageSize": '400x400',
"pageId": 'affiliate-shop-srp',
"siteId": 'us-shopping',
"countryCode": 'US',
"lang": 'en',
},
},
"query": "query searchProduct($searchRequest: SearchRequest) { search(searchRequest: $searchRequest) { totalCount items { provider gtin itemId title price salePrice currency image vendor vendorId } }}"
}
url = 'https://shopping.yahoo.com/graphql'
response = requests.post(url, headers=headers, data=json.dumps(post))
return response.json()
def scrape_yahoo_shopping():
data_gql = shopping_yahoo_gql()
if data_gql:
shopping_results = [{
"position": index + 1,
"product_id": item.get('gtin'),
"link": f"https://shopping.yahoo.com/product/{item.get('gtin')}",
"title": item.get('title'),
"seller": item.get('vendor'),
"price": float(item.get('price', 0)),
"sale_price": float(item.get('salePrice', 0)) if item.get('salePrice') != item.get('price') else None,
"thumbnail": item.get('image'),
} for index, item in enumerate(data_gql.get('data', {}).get('search', {}).get('items', []))]
print(json.dumps({"shopping_results": shopping_results}, indent=2))
if __name__ == "__main__":
scrape_yahoo_shopping()
The result of running the script:
The results here are identical to what we provide at SerpApi, check out our Yahoo shopping API documentation, the difference is SerpApi provides faster and captcha-solving solutions, and we provide all the filters that Yahoo shopping has and easily controllable pagination.
Ending
Do you think Yahoo implementing GraphQL could make it a suitable replacement for Google shopping? - Let us know what you think at our Twitter @serp_api
Don't miss the other blog post about Scraping Naver Video Search Results using Python
- You can sign-up for SerpApi here: https://serpapi.com/
- You can find the SerpApi user forum here: https://forum.serpapi.com/
- You can find the API documentation here: https://serpapi.com/search-api/
Happy scraping!