Creating a custom database for a machine learning project can be messy, and most of the times, complicated. In the context of example this blog post will cover retrieving SerpApi's Google Images Scraper API. There are number of challenges one face in creating an image database that can fulfill specific needs. Here are some main concerns about such an undertaking.
(I) Finding images in bulk
a) The source has to have abundance of images with a clear pathway to retrieve them.
(II) Retrieving images
a) The script to retrieve images shouldn't take too much optimization.
b) Accepted image types should be expressed not to cause any incompatibility with other packages.
c) The method has to be fast, transparent about the time it takes.
d) The method has to be reliable, without interruption.
e) The method should allow custom queries to be retrieved, not just a specific set of them.
f) The method should not be reliant on any language for reproducibility.
(III) Placing images in their right place.
a) Duplication of images must be omitted.
b) Each image has to be in its corresponding place.
c) Overall result should be adaptable to any machine learning library of any language.
For this undertaking, programming language that will be used is Julia as it is designed for machine learning and statistics tasks with a faster runtime performance. However, any other language can be used in order to achieve same results.
You can find the github repo for all files mentioned below at custom-image-database-maker
Here are the libraries to use;
module DatabaseMaker using HTTP using JSON3 using CSV using DataFrames export make
To give a clear breakdown of why we use these packages in the code;
HTTP : SerpApi can be used to retrieve data using HTTP requests.
JSON3 : SerpApi's response will be in an easy to understand JSON body.
CSV : This package is used to get list of queries, and to report the images we retrieved in a CSV table.
DataFrames : This package is used to interpret and modulate CSV tables in question.
export make is the overall function of interest to export from this module (
DatabaseMaker) for other files to be used. Every other function feeds
Let's start by giving the
All the other functions below will be used to feed this main function to properly do its task. For overall view of the process, here's a scheme of operation:
Let's define the first function to be used to feed
function ask_for_query_preference() println("Would you like to add queries from terminal or from a local file?") println("1) Terminal (first page per each query)") println("2) Local File (can define page ranges)") println("Enter a number: ") answer = readline() answer == "1" && read_from_terminal() answer == "2" && read_from_file() end
This function asks the user for the preference of entry of queries. Local file in context is a csv file a user can fill, whereas terminal entry is collection of queries seperated by commas.If the answer is
1, the script will read from the terminal. If the answer is
2, the script will read from the file. The CSV file will be able to pick up range of pages(
Ex. 0-11) whereas terminal entry will only retrieve the first page of the query. Here's the output:
Would you like to add queries from terminal or from local file? 1) Terminal (first page per each query) 2) Local File (can define page ranges) Enter a number:
Let's define the function for reading from terminal:
function read_from_terminal() println("Enter queries seperated by commas:") queries = readline() queries = split(queries, ",") global queries = DataFrame( q = queries, page_range = fill(0, length(queries)) ) end
read_from_terminal() function will ask for queries, and read the answer. It will then seperate the queries by commas. For example
dog,cat,banana,apple will create a global array (that is defined for all functions) that consists of 4 elements:
Next line is to make it into a dataframe to make it coherent with reading from a file.
Here's the output:
Enter queries seperated by commas:
Next part is to define the function that reads from the terminal:
function read_from_file() println("Reading from file...") types = Dict( :q => String, :page_range => String, ) queries = DataFrame(CSV.File("load_new_data.csv"; types)) global queries = coalesce.(queries, 0) end
read_from_file() function will read the CSV file called
load_new_data.csv to fetch an array of queries with their desired page range. What is meant by page range here is SerpApi Google Images Scraper API's pagination.
Here's an example of a CSV structure:
To break it down to its components;
q stands for query.
page_range stands for specific page range,or a specific page
Apple has a page range of
0. This means request to SerpApi will be made only for the first page. (
ijn parameter starts from
0 which will be explained in the upcoming parts.)
For second entry,
Banana, we ask only for the second page which its
ijn value is
Cat has a range between
0-11 (0 and 11 included). This means that it is necessary to make 12 requests to SerpApi with the same
q value but different
ijn in order to get all pages.
Dog doesn't have any page range. By default, it will be considered
0 (first page).
The output is;
Reading from file...
The next function we will build is about a nice feature of SerpApi, cache preference.
function ask_for_cache_preference() println("Would you like cached data? (Note that cached searches don't use calls from your account. But if the query is not cached, it'll resort to \"no_cache=true\")") println("1) No Cached Searches (true)") println("2) Cached Searches (false)") println("Enter a number: ") answer = readline() answer == "1" && global cache_preference = true answer == "2" && global cache_preference = false println(cache_preference) end
If a search, has been cached in SerpApi's database, one can call it without spending any credits. However, it might not be a real time data. Hence, it is wise to add a
no_cache=true option in there. This function will set a global variable
Here's the output;
Would you like cached data? (Note that cached searches don't use calls from your account. But if the query is not cached, it'll resort to "no_cache=true") 1) No Cached Searches (true) 2) Cached Searches (false) Enter a number:
We also need a function to get the API key we retrieved from:
function get_api_key() global api_key = read(Base.getpass("Enter your API key:"), String) end
Base.getpass() is used not to show any string on the screen when reading the input from the user. A global variable called
api_key is declared within the function.
Enter your API key:
Here comes function we construct the searches based on our entries:
function construct_searches() for (index,q) ∈ enumerate(queries[:,:q]) page = queries[index,:page_range] occursin("-", string(page)) ? range_page_number(q, page) : call_serpapi(q, page) end end
∈ here stands for
in in many programming languages. This function iterates through global dataframe variable we declared before,
queries. For each query it declares a local
page variable and checks if it includes
- in its substring.
If it includes
-, for example
0-11, the function calls another function called
If it is only one page, for example
0, the function calls
call_serpapi(q, page) function.
Let's continue with how the function for the range of pages work:
function range_page_number(q, page) page = split(page, "-") from = parse(Int,page) to = parse(Int,page) for page ∈ from:to call_serpapi(q, page) end end
range_page_number takes in two variables, namely,
Let's assume that the
0-11. In this function
0-11 is split from
- substring and becomes an array of two strings,
from local variable is declared from the first element of that array as an integer,
to variable is declared.
to variables, iteration is possible, individual call for each page is made to SerpApi with
Let's see how
call_serpapi(q, page) works:
function call_serpapi(q, page) params = [ "q" => q, "tbm" => "isch", "ijn" => page, "api_key" => api_key, "no_cache" => cache_preference, ] uri = "https://serpapi.com/search.json?" println("Querying \"q\":\"$(q)\", \"ijn\":\"$(page)\" with \"no_cache\":\"$(cache_preference)\"...") results = HTTP.get(uri, query = params) results = JSON3.read(results.body) results = results[:images_results] results = [resulting_image[:original] for resulting_image ∈ results] println("Checking if folder and csv exists...") folder_name = replace(q, " " => "_") folder_name = replace(folder_name, "." => "_") check_folder_and_csv(folder_name) check_new_links(folder_name, results) end
This function is used to call SerpApi to get an easy-to-understand JSON body. Fo
This function is used to call SerpApi to get an easy-to-understand JSON body. We will be using 5 parameters in our call to achieve this:
Let's break it down:
q => q, declares the query string from the local variable
tmb => isch declares that the engine to use is
Google Image Search Scraper API.
ijn => page declares the page number from the local variable
api_key => api_key declares the API key gathered from the user.
no_cache => cache_preference declares the cache preference.
Next step is to make a call to SerpApi with HTTP package:
Here's an resulting example of a URI:
Results will be in JSON format. Links of each individual image should be retrieved from this JSON body. In order to parse the JSON body these three lines of code is required;
But first, let's see how an actual result looks like to dig deeper into the concept. In order to demonstrate it, we will use another nice feature of SerpApi, playground. Playground is where you can see the resulting HTML alongside the JSON response, and play around with parameters to come up with different results.
Let's take a look at what can be found at a JSON body:
images_results key contains an array of images.
original key within an individual result gives the original link for the image. Idea is to make an array out of these individual links to be downloaded. Local
results variable will be an array of strings containing links to images in the end. (Ex.
Lines below will make a folder name out of the query:
In the end, images retrieved from the
cat query would end up in
cat folder whereas images of
cat food would be placed inside
Each image has to be inside their own folder so that coherent body of images could be loaded into the model later. But first, the script has to check if there is a folder under that name and if it contains a csv file.
Let's dig into how the folder name is checked:
If the there is no folder with the name
Datasets in our example,
make_folder_and_csv(folder_name) gets into action.
This function creates a
cat folder and also creates an empty
links.csv file within it. This file will be used to avoid duplicate image links to be downloaded in the future.
Last line in
call_serpapi() function is responsible for checking if the links are unique and hasn't been downloaded before.
Here's the full function:
These four lines create a global dataframe variable named
links out of the
links.csv file within the specified folder name. The path to it will be
Datasets/cat/links.csv in our example:
This line will iterate through each result:
Let's declare a specific set of filenames to be picked from the results so that it won't cause a problem when downloaded and loaded to the model:
links is not empty and
result has the specified conditions, append it and its
["https://image.com/cat.jpg","cat"] ) with a function.
Let's dig into how the
append_links() function works:
Making a dataframe row out of the
result and adding it to
links dataframe will keep track of the uniqueness of URIs to be requested.
images_to_be_requested is global array declared within the main
make function which will be revaled below. In the end we will end up with
[["https://image.com/cat.png","cat"]...] as our
If the links array is empty,
folder_name are directly appended in same structure:
When the links are ready, we write them to our
Up until this point; an array full of links and their corresponding folders have been fed into a global array, namely images_to_be_requested`
Up until this point; an array full of links and their corresponding folders have been fed into a global array, namely
images_to_be_requested. All the links in this array has been appended to links.csv inside each corresponding folder without creating duplicates. Now comes the easy part, to download images into their corresponding folder:
These lines iterates through
images_to_be_requested, and downloads individual images to their corresponding folders:
Let's break down
define_filename(uri,folder_name) to understand the process:
This function uses
folder_name to check if there are any images previously downloaded to give the image a number. It starts from
1. For example, maximum number a file has is
5, the name to be picked will be
uri variable is used for giving the proper extension to the file. Resulting
filename will be the full path to be downloaded. It will be fed back into
Let's call the module from another file for further improvements in the future.
This concludes the tasks to build a custom Image database maker using SerpApi's Google Images Scraper API.
Here's a showcase of how it works:
Conclusion: SerpApi is a powerful tool for variety of tasks including creating databases with customized datasets. This is only one usecase of the tool which could be expanded with other features of SerpApi, and be used for machine learning projects.