AI Training at Scale

This is a part of the series of blog posts related to Artificial Intelligence Implementation. If you are interested in the background of the story or how it goes:

In previous week, we have explored how to use chips parameter responsible for narrowing down results, querying images with a specific height in SerpApi's Google Images Scraper API to train machine learning models. This week we will explore updating the image dataset with multiple queries of same kind automatically, and see the results in a bigger scale deep learning.

What is AI at scale?

The term refers to scalability in expanding image dataset to be used in Machine Learning training process, and expansion or retraining of the machine learning model in scale with minimal effort. In simple terms, if you have a model that differentiates between a cat and a dog, you should be able to expand AI training easily by automatically collecting monkey images, and retraining, or expanding the existing classifier by using different frameworks. AI solutions with large models need effective workflows to achieve model development.

How to reduce training time?

By using a clear data, you may reduce the noise and can have effective artificial intelligence models that can outperform other ai models. In computer vision, this aspect is more important since the workloads will change significantly when compared to algorithms responsible for nlp (natural language processing).

Easily Scraping Clear Data

On the previous week, I have shown how to get chips parameter manually from a Google search. SerpApi is capable of creating a list of different chips values and serve them under suggested_searches key.

One thing to notice here is that the chips value for different queries will have different values. However You may acquire the chips value of a particular query, and run it across different pages for that same specific query.

For example chips value for the query American Foxhound imagesize:500x500 and the specification dog is:
q:american+foxhound,g_1:dog:FSAvwTymSjE%3D

Same chips value for American Hairless Terrier imagesize:500x500 is:
q:american+hairless+terrier,g_1:dog:AI24zk7-hcI%3D

As you can observe, the encrypted part of the chips value differs. This is why you need to make at least one call to SerpApi's Google Images Scraper API to get the desired chips value.

I have added a new key called desired_chips_name within our Query class to check for the results with the desired name in suggested_searches:

class Query(BaseModel):
	google_domain: str = "google.com"
	num: str = "100"
	ijn: str = "0"
	q: str
	chips: str = None
	desired_chips_name: str = None
	api_key: str ## You may replace this with `api_key: str = "Your API Key"`

Also there was a need for another function within the Download class for extracting chips value and using it for the search:

	def chips_serpapi_search(self):
		params = {
			"engine": "google",
			"ijn": self.query.ijn,
			"q": self.query.q,
			"google_domain": self.query.google_domain,
			"tbm": "isch",
			"num": self.query.num,
			"chips": self.query.chips,
			"api_key": self.query.api_key
		}

		search = GoogleSearch(params)
		results = search.get_dict()
		suggested_results = results['suggested_searches']
		chips = [x['chips'] for x in suggested_results if x['name'] == self.query.desired_chips_name]
		if chips != []:
			self.query.chips = chips[0]
			return chips[0]

This function is responsible for calling SerpApi with a query and returning desired chips parameter as well as updating it in the Query class.

Now that everything necessary is in place to automatically make a call with a desired specification and add it to our ai model dataset to be used for machine learning training, let's declare a class to be used to do multiple queries:

class MultipleQueries(BaseModel):
	queries: List = ["american foxhound"]
	desired_chips_name: str = "dog"
	height: int = 500
	width: int = 500
	number_of_pages: int = 2
	num: int = 100
	google_domain: str = "google.com"
	api_key: str ## You may replace this with `api_key: str = "Your API Key"`

queries key here will be a list of queries you want to classify between. For example: List of dog breeds
desired_chips_name key will be their common specification. In this case, it is dog.
height and width will be the desired sizes of the images to be collected for deep learning training.
num will be the number of results per page.
number_of_pages will be the number of pages you'd like to get. 1 will give around 100 results, 2 will give around 200 images when num is 100 etc.
google_domain is the Google Domain you'd like to search at.
api_key is your unique API key for SerpApi.

Let's define the function to automatically call queries in order and upload images to storage dataset with their classification:

class QueryCreator:
	def __init__(self, multiplequery: MultipleQueries):
		self.mq = multiplequery

	def add_to_db(self):
		for query_string in self.mq.queries:
			if self.mq.height != None and self.mq.width != None:
				query_string = "{} imagesize:{}x{}".format(query_string, self.mq.height, self.mq.width)

			query = Query(google_domain = self.mq.google_domain, num = self.mq.num, ijn=0, q=query_string, desired_chips_name = self.mq.desired_chips_name, api_key = self.mq.api_key)
			db = ImagesDataBase()
			serpapi = Download(query, db)
			chips = serpapi.chips_serpapi_search()
			serpapi.serpapi_search()
			serpapi.move_all_images_to_db()
			
			if self.mq.number_of_pages > 1:
				for i in range(1,self.mq.number_of_pages):
					query.ijn = i
					query.chips = chips
					db = ImagesDataBase()
					serpapi = Download(query, db)
					serpapi.serpapi_search()
					serpapi.move_all_images_to_db()

We query SerpApi once with chips_serpapi_search and then use the chips value to run actual searches. Then we move all the images to the image dataset storage to train machine learning models we created in earlier weeks.

Finally let's declare an endpoint for it in main.py:

@app.post("/multiple_query/")
def create_query(multiplequery: MultipleQueries):
  serpapi = QueryCreator(multiplequery)
  serpapi.add_to_db()
  return  {"status": "Complete"}

Training at Scale

For showcasing purposes, we will be using 8 species of American dog breeds. In fact, we could've even integrated SerpApi's Google Organic Results Scraper API to automatically fetch Famous American Dog Breeds and run it with our queries key. This is outside the scope of today's blog post. But it is a good indicator of multi-purpose usecase of SerpApi.

If you head to playground with the following link, you will be greeted with the desired results:

Now we'll take the list in the organic_result, and place it in our MultipleQueries object at /multipl_query endpoint:

{
  "queries": [
     "American Hairless Terrier",
     "Alaskan Malamute",
     "American Eskimo Dog",
     "Australian Shepherd",
     "Boston Terrier",
     "Boykin Spaniel",
     "Chesapeake Bay Retriever",
     "Catahoula Leopard Dog",
     "Toy Fox Terrier"
  ],
  "desired_chips_name": "dog",
  "height": 500,
  "width": 500,
  "number_of_pages": 2,
  "num": 100,
  "google_domain": "google.com",
  "api_key": "<YOUR API KEY>"
}

This dictionary will fetch us 2 pages (around 200 images) with the height of 500, and the width of 500, in Google US, with the chips value for dog (specifically narrowed down to only dog images) for each dog breed we entered.

You can observe that the images are uploaded in the N1QL Storage Server:

If you’d like to create your own optimizations for your machine learning project, you may claim a free trial at SerpApi.

Now that we have everything we need, let's train the machine learning model that distinguishes between American Dog Breeds at the /train endpoint, with the following dictionary:

{
  "model_name": "american_dog_species",
  "criterion": {
    "name": "CrossEntropyLoss"
  },
  "optimizer": {
    "name": "SGD",
    "lr": 0.01,
    "momentum": 0.9
  },
  "batch_size": 16,
  "n_epoch": 5,
  "n_labels": 0,
  "image_ops": [
    {
      "resize": {
        "size": [
          500,
          500
        ],
        "resample": "Image.ANTIALIAS"
      }
    },
    {
      "convert": {
        "mode": "'RGB'"
      }
    }
  ],
  "transform": {
    "ToTensor": true,
    "Normalize": {
      "mean": [
        0.5,
        0.5,
        0.5
      ],
      "std": [
        0.5,
        0.5,
        0.5
      ]
    }
  },
  "target_transform": {
    "ToTensor": true
  },
  "label_names": [
     "American Hairless Terrier imagesize:500x500",
     "Alaskan Malamute imagesize:500x500",
     "American Eskimo Dog imagesize:500x500",
     "Australian Shepherd imagesize:500x500",
     "Boston Terrier imagesize:500x500",
     "Boykin Spaniel imagesize:500x500",
     "Chesapeake Bay Retriever imagesize:500x500",
     "Catahoula Leopard Dog imagesize:500x500",
     "Toy Fox Terrier imagesize:500x500"
  ]
}

The artificial intelligence model will be automatically trained at large scale using the images we uploaded to the dataset, via high-performance gpus of my computer in this case (it is also possible to train with a cpu):

Conclusion

I am grateful to the user for their attention and the Brilliant People of SerpApi for making this blog post possible. This week, we didn't focus on creating effective models with good metrics but the process of automating creation. In the following weeks we will work on fixing its deficiencies and missing parts. These parts include ability to do distributed training for ml models, supporting all optimizers and criterions, adding customizable model classes, being able to train machine learning models on different file types (text, video), not using pipelines effectively, utilizing different data science tools, supporting other machine learning libraries such as tensorflow alongside pytorch etc. I don’t aim to make a state-of-the-art project. But I aim to create a solution for easily creating AI systems. I also aim to create a visual for the comparison of different deep learning models created within the ecosystem. We will also discuss visualizing the training process to have effective models. The aim is to have an open source library where you can scale your models at will.