This week we'll showcase the database creation process for implementing Machine Learning models for general testing purposes. We will be using SerpApi's Google Organic Results Scraper API for the data collection. Also, you can check in the playground in more detailed view on the data we will use.
Creating a Linearized CSV out of a Hash
Let's initialize class variable @@pattern_data
and define @@vocab
inside Database
class. The default for vocabulary we will use to vectorize words and sentences should be { "<unk>" => 0, " " => 1 }
.
class Database
def initialize json_data, vocab = { "<unk>" => 0, " " => 1 }
super()
@@pattern_data = []
@@vocab = vocab
end
Next, we need a recursive way to translate hashes into lines of understandable bits consisting of the value, and its key type. For instance, we need to translate this:
{
"position": 1,
"title": "Coffee - Wikipedia",
"link": "https://en.wikipedia.org/wiki/Coffee",
"displayed_link": "https://en.wikipedia.org › wiki › Coffee",
"thumbnail": "https://serpapi.com/searches/62436d12e7d08a5a74994e0f/images/ed8bda76b255c4dc4634911fb134de5319e08af7e374d3ea998b50f738d9f3d2.jpeg",
"snippet": "Coffee is a brewed drink prepared from roasted coffee beans, the seeds of berries from certain flowering plants in the Coffea genus. From the coffee fruit, ...",
...
}
to this:
Result | Position |
---|---|
Coffee - Wikipedia | title |
https://en.wikipedia.org/wiki/Coffee | link |
https://en.wikipedia.org › wiki › Coffee | displayed_link |
... | ... |
https://en.wikipedia.org/wiki/History_of_coffee | sitelinks__inline__n__link |
... | ... |
Notice that position doesn't matter for some elements. So we call them with n
, and inner keys are separated by __
.
Now let's define the main function that'll fill @@pattern_data
(master_array), and write it to a CSV file for future use.
def self.add_new_data_to_database json_data, csv_path = nil
json_data.each do |result|
recursive_hash_pattern result, ""
end
@@pattern_data = @@pattern_data.reject { |pattern| pattern.include? nil }.uniq.compact
path = "#{csv_path}master_database.csv"
File.write(path, @@pattern_data.map(&:to_csv).join)
end
Let's break down recursive_hash_pattern
and its relevant functions:
## For keys that directly contain String, Integer, Float etc.
def self.element_pattern result, pattern
@@pattern_data.append([result, pattern].flatten)
end
## For Arrays that contain String, Integer, Float etc.
def self.element_array_pattern result, pattern
result.each do |element|
element_pattern element, pattern
end
end
## Main Process
def self.assign hash, key, pattern
## If the key contains a hash, it has to be recursed until all
## child components are collected.
if hash[key].is_a?(Hash)
if pattern.present?
pattern = "#{pattern}__#{key}"
else
pattern = "#{key}"
end
recursive_hash_pattern hash[key], pattern
## If the key contains an array, containing multiple hashes, ## all the hashes should be recursed to their components
elsif hash[key].present? && hash[key].is_a?(Array) && hash[key].first.is_a?(Hash)
if pattern.present?
pattern = "#{pattern}__#{key}__n"
else
pattern = "#{key}"
end
hash[key].each do |hash_inside_array|
recursive_hash_pattern hash_inside_array, pattern
end
## If the key contains an array consisting of base elements,
## each element should be added with the right key pattern.
elsif hash[key].present? && hash[key].is_a?(Array)
if pattern.present?
pattern = "#{pattern}__n"
else
pattern = "#{key}"
end
element_array_pattern hash[key], pattern
## If the element contains String, Float, etc.
else
if pattern.present?
pattern = "#{pattern}__#{key}"
else
pattern = "#{key}"
end
element_pattern hash[key], pattern
end
end
def self.recursive_hash_pattern hash, pattern
hash.keys.each do |key|
assign hash, key, pattern
end
end
Notice that each recursive action carries its pattern to the next iteration to make the key classifying distinct.
Now, if we apply these commands, it'll create a csv file called master.csv
inside organic_results
folder.
json_data
represents the organic_results
array containing every organic_result
hash we have.
Database.new json_data
Database.add_new_data_to_database json_data, csv_path = "organic_results/"
End result is as desired:
Result | Position |
---|---|
Coffee - Wikipedia | title |
https://en.wikipedia.org/wiki/Coffee | link |
https://en.wikipedia.org › wiki › Coffee | displayed_link |
... | ... |
https://en.wikipedia.org/wiki/History_of_coffee | sitelinks__inline__n__link |
... | ... |
Tokenizing, and Vocabulary Creation
Before we dive into creating hash specific tables
to be tokenized with ngram iterator
, let's define functions responsible for tokenizing:
def self.default_dictionary_hash
{
/\"/ => "",
/\'/ => " \' ",
/\./ => " . ",
/,/ => ", ",
/\!/ => " ! ",
/\?/ => " ? ",
/\;/ => " ",
/\:/ => " ",
/\(/ => " ( ",
/\)/ => " ) ",
/\// => " / ",
/\s+/ => " ",
/<br \/>/ => " , ",
/http/ => "http",
/https/ => " https ",
}
end
This function is responsible for creating a default dictionary hash for splitting words fed in the tokenizer
, and what they will be replaced with.
We will be able to create understandable bits of vectors from these split points. Notice that I have included http
and https
in them since it is widely used in organic_results
.
def self.tokenizer word, dictionary_hash = default_dictionary_hash
word = word.downcase
dictionary_hash.keys.each do |key|
word.sub!(key, dictionary_hash[key])
end
word.split
end
This is our main tokenizer. To give an example, if we apply such command:
Database.tokenizer "SerpApi, to. the: Moon"
["serpapi,", "to", ".", "the", "moon"]
def self.iterate_ngrams token_list, ngrams = 1
token_list.each do |token|
1.upto(ngrams) do |n|
permutations = (token_list.size - n + 1).times.map { |i| token_list[i...(i + n)] }
permutations.each do |perm|
key = perm.join(" ")
unless @@vocab.keys.include? key
@@vocab[key] = @@vocab.size
end
end
end
end
end
This is our ngram iterator
. token_list
in here is the output of the tokenizer function. With this function, we create permutations out of different cut points. ngrams
define how wide the permutations should be.
To give an example, if we apply such command:
Database.iterate_ngrams ["serpapi,", "to", ".", "the", "moon"], ngrams=3
Our vocabulary (@@vocab
) will be updated as such:
{
"<unk>"=>0,
" "=>1,
"serpapi,"=>2,
"to"=>3,
"."=>4,
"the"=>5,
"moon"=>6,
"serpapi, to"=>7,
"to ."=>8,
". the"=>9,
"the moon"=>10,
"serpapi, to ."=>11,
"to . the"=>12,
". the moon"=>13
}
We will be covering how to use vectors in classifying in next week's blog post. But to give an idea about what it will look like, the function responsible is:
def self.word_to_tensor word
token_list = tokenizer word
token_list.map {|token| @@vocab[token]}
end
So, if we feed our sentence:
Database.word_to_tensor "SerpApi, to. the: Moon"
[2, 3, 4, 5, 6]
This way we can express strings in a mathematical way.
Key Specific CSV Creation
Let's define functions that'll help in the creation of key specific databases, and for their later uses.
First, we need to define a function to save the end result of our vocabulary that will be created by each word fed to key specific databases:
def self.save_vocab vocab_path = ""
path = "#{vocab_path}vocab.json"
vocab = JSON.parse(@@vocab.to_json)
File.write(path, JSON.pretty_generate(vocab))
end
The end result of this will be:
{
"<unk>": 0,
" ": 1,
"1": 2,
"coffee": 3,
"-": 4,
"wikipedia": 5,
"coffee -": 6,
...
}
Just to check if a string consists of only numeric values:
def self.is_numeric?
return true if self =~ /\A\d+\Z/
true if Float(self) rescue false
end
To create example key-value pairs for each key type:
def self.create_keys_and_examples
keys = @@pattern_data.map { |pattern| pattern.second }.uniq
examples = {}
keys.each do |key|
examples[key] = @@pattern_data.find { |pattern| pattern.first.to_s if pattern.second == key }
end
[keys, examples]
end
The end result will be a collection of unique keys and a hash containing one example to eliminate errors with conditions.
def self.create_key_specific_databases result_type = "organic_results", csv_path = nil, dictionary = nil, ngrams = nil, vocab_path = nil
keys, examples = create_keys_and_examples
keys.each do |key|
specific_pattern_data = []
@@pattern_data.each_with_index do |pattern, index|
word = pattern.first.to_s
next if word.blank?
if dictionary.present?
token_list = tokenizer word, dictionary
else
token_list = tokenizer word
end
if ngrams.present?
iterate_ngrams token_list, ngrams
else
iterate_ngrams token_list
end
if key == pattern.second
specific_pattern_data << [ 1, word ]
elsif (examples[key].to_s.to_i == examples[key]) && word.to_i == word
next
elsif (examples[key].to_s.to_i == examples[key]) && word.numeric?
specific_pattern_data << [ 0, word ]
elsif examples[key].numeric? && word.numeric?
next
elsif key.split("__").last == pattern.second.to_s.split("__").last
specific_pattern_data << [ 1, word ]
else
specific_pattern_data << [ 0, word ]
end
end
path = "#{csv_path}#{result_type}__#{key}.csv"
File.write(path, specific_pattern_data.map(&:to_csv).join)
end
if vocab_path.present?
save_vocab vocab_path
else
save_vocab
end
end
This is the main function responsible for creating a database for each key. Notice that keys that are integers within the table are omitted for csv for a key that contains an integer.
This way we can eliminate confusion in cases such as rating:"5"
which could be confused with reviews:"5"
. We also generalize the cases where the last inner keys are the same to avoid the confusion on same kind of elements.
Example would be position
in the main hash and in one of its keys. They represent the same key, so marking them with 1
could come in handy. We also add each word to our vocabulary to expand it, later to be saved to csv.
The end result for one of the CSV files created(organic_results__about_page_link
):
1
represents that this is the kind of result we want for such a key, and 0
represents the opposite.
Full Code
Here's a mindmap of the entire process:
Here's the full code for the class below:
class Database
def initialize json_data, vocab = { "<unk>" => 0, " " => 1 }
super()
@@pattern_data = []
@@vocab = vocab
end
## Related to creating main database
def self.add_new_data_to_database json_data, csv_path = nil
json_data.each do |result|
recursive_hash_pattern result, ""
end
@@pattern_data = @@pattern_data.reject { |pattern| pattern.include? nil }.uniq.compact
path = "#{csv_path}master_database.csv"
File.write(path, @@pattern_data.map(&:to_csv).join)
end
def self.element_pattern result, pattern
@@pattern_data.append([result, pattern].flatten)
end
def self.element_array_pattern result, pattern
result.each do |element|
element_pattern element, pattern
end
end
def self.assign hash, key, pattern
if hash[key].is_a?(Hash)
if pattern.present?
pattern = "#{pattern}__#{key}"
else
pattern = "#{key}"
end
recursive_hash_pattern hash[key], pattern
elsif hash[key].present? && hash[key].is_a?(Array) && hash[key].first.is_a?(Hash)
if pattern.present?
pattern = "#{pattern}__#{key}__n"
else
pattern = "#{key}"
end
hash[key].each do |hash_inside_array|
recursive_hash_pattern hash_inside_array, pattern
end
elsif hash[key].present? && hash[key].is_a?(Array)
if pattern.present?
pattern = "#{pattern}__n"
else
pattern = "#{key}"
end
element_array_pattern hash[key], pattern
else
if pattern.present?
pattern = "#{pattern}__#{key}"
else
pattern = "#{key}"
end
element_pattern hash[key], pattern
end
end
def self.recursive_hash_pattern hash, pattern
hash.keys.each do |key|
assign hash, key, pattern
end
end
## Related to tokenizing
def self.default_dictionary_hash
{
/\"/ => "",
/\'/ => " \' ",
/\./ => " . ",
/,/ => ", ",
/\!/ => " ! ",
/\?/ => " ? ",
/\;/ => " ",
/\:/ => " ",
/\(/ => " ( ",
/\)/ => " ) ",
/\// => " / ",
/\s+/ => " ",
/<br \/>/ => " , ",
/http/ => "http",
/https/ => " https ",
}
end
def self.tokenizer word, dictionary_hash = default_dictionary_hash
word = word.downcase
dictionary_hash.keys.each do |key|
word.sub!(key, dictionary_hash[key])
end
word.split
end
def self.iterate_ngrams token_list, ngrams = 1
token_list.each do |token|
1.upto(ngrams) do |n|
permutations = (token_list.size - n + 1).times.map { |i| token_list[i...(i + n)] }
permutations.each do |perm|
key = perm.join(" ")
unless @@vocab.keys.include? key
@@vocab[key] = @@vocab.size
end
end
end
end
end
def self.word_to_tensor word
token_list = tokenizer word
token_list.map {|token| @@vocab[token]}
end
## Related to creating key-specific databases
def self.create_key_specific_databases result_type = "organic_results", csv_path = nil, dictionary = nil, ngrams = nil, vocab_path = nil
keys, examples = create_keys_and_examples
keys.each do |key|
specific_pattern_data = []
@@pattern_data.each_with_index do |pattern, index|
word = pattern.first.to_s
next if word.blank?
if dictionary.present?
token_list = tokenizer word, dictionary
else
token_list = tokenizer word
end
if ngrams.present?
iterate_ngrams token_list, ngrams
else
iterate_ngrams token_list
end
if key == pattern.second
specific_pattern_data << [ 1, word ]
elsif (examples[key].to_s.to_i == examples[key]) && word.to_i == word
next
elsif (examples[key].to_s.to_i == examples[key]) && word.numeric?
specific_pattern_data << [ 0, word ]
elsif examples[key].numeric? && word.numeric?
next
elsif key.split("__").last == pattern.second.to_s.split("__").last
specific_pattern_data << [ 1, word ]
else
specific_pattern_data << [ 0, word ]
end
end
path = "#{csv_path}#{result_type}__#{key}.csv"
File.write(path, specific_pattern_data.map(&:to_csv).join)
end
if vocab_path.present?
save_vocab vocab_path
else
save_vocab
end
end
def self.create_keys_and_examples
keys = @@pattern_data.map { |pattern| pattern.second }.uniq
examples = {}
keys.each do |key|
examples[key] = @@pattern_data.find { |pattern| pattern.first.to_s if pattern.second == key }
end
[keys, examples]
end
def self.is_numeric?
return true if self =~ /\A\d+\Z/
true if Float(self) rescue false
end
def self.save_vocab vocab_path = ""
path = "#{vocab_path}vocab.json"
vocab = JSON.parse(@@vocab.to_json)
File.write(path, JSON.pretty_generate(vocab))
end
end
Conclusion
Next week we'll utilize these CSV files to vectorize them using tokenizer, and create key-specific models for each key.
The end aim of this project is to create an open-source gem to be implemented by everyone using a JSON Data Structure in their code.
I'd like to thank the reader for their attention, and the brilliant people of SerpApi creating wonders even in times of hardship for all their support.
z