Intro

Choosing between common database solutions can be tough. In this writing, I will be sharing how to choose which database to use for a specific real-world case, give introductory information on how different databases store data, open source a benchmarking script, discuss important factors to take in decision-making, and share benchmark reports in charts.

The Problem

SerpApi is an API for scraping Google and other search engines with fast, easy, and complete solutions. Our team is tackling a challenge within our operational database where we house a Locations collection featuring predefined data structures. We aim to offer these locations to clients reliably and allow them to utilize these in their searches with SerpApi's Google Search API, demonstrating the interconnected functionality of various types of databases. You may register to claim free credits to try out our products.

The Present Solution and Its Shortcomings

We're operating a robust MongoDB server—an example of one of the types of NoSQL database, more specifically a document-oriented database—which stores Locations collection objects. This enables us to efficiently store data, manage complex data structures, and enhance the Location model's scalability. Its adeptness at performing complete searches, fuzzy matches, and geospatial evaluations between nodes stands out. The Location collection is one of the primary keys of the search object and is crucial for a big portion of our searches.

However, this approach has a scalability issue concerning portability. The firm coupling with a centralized database alongside other collections with parent-child relations limits our agility in customizing the schema and affects the playroom for diverse operational settings. Altering a small collection by adding or deleting new entries in such a hierarchical database carries inherent risks due to its monolithic nature.

Additionally, constants loaded during server initialization, which could reside in a cloud database with superior indexing, instead consumes significant amounts of memory. For this analysis, we'll focus on the Location model within this DBMS.

Proposed Solutions

I have devised three prototypes to surmount the portability hurdle. These databases store and serve in various ways. I graph database benchmarks in order to provide different types of data for each prototype.

Using a Static Sqlite3 File

Deploying a static Sqlite3 file, akin to a personal database on each server, offers remarkable portability, allows for server-level data management, and provides the advantages of a relational database management system (RDBMS). This is reflective of the distributed database paradigm, where local updates can be methodically managed and validated through unit testing. Integration with our MongoDB documents is seamless, thanks to the BSON-serialization of Ruby hashes. SQLite is a popular database software choice for utilizing the advantages of relational models.

Streaming from a Static JSON File

Streaming from a static JSON file, which is a form of semi-structured data, is another solution, maintaining control at the server level for safe, localized development. Even updating key-value structures is relatively easy. This represents a move towards a microservices architecture, often hosted on cloud computing platforms. This acts as a kind of database for information management.

External Standalone MongoDB Server

A standalone MongoDB server represents a non-relational database alternative for storing child records of the search object, more specifically the Location collection. This option demands a deep understanding of database technology, offering real-time adjustments through a secured public IP, maintaining data security, and harnessing MongoDB 2dsphere for indexing data models for spherical approximations. It is a popular database to store information with wide community support.

Performance Benchmarks

The benchmarks involve various queries and GPS coordinates, providing an average of real-time performance over multiple iterations. These parameters act as a data set to compare different performance metrics of the prototypes.

I have applied the benchmark on JSON Streaming prototype only up to 5 concurrency. It wasn't efficient enough to benchmark all the way through in my local environment and was facing difficult bottlenecks.

There are important details to consider in this benchmark. The current solution we have with a Monolithic Mongodb database, and the Standalone Mongodb server is executing some commands externally on a remote server. Therefore the data is dependent on the connection time latency. Unlike these two other solutions, the Sqlite3 prototype, and JSON Streaming Prototype are contained within an individual server and are only affected by system load and throughput of hardware. This benchmark is storing different data types when it comes to time. For the purposes of simplifying things, we will be using average real-time taken as our indicator.

You may find the complete benchmark below:

require 'benchmark'
require 'thread'
require 'csv'

SAMPLE_QUERIES = ['USA', 'FalseInput', 'Canada', 'Austin', 'Ankara']
SAMPLE_COORDINATES = [[37.0902, -95.7129], [0, 0], [56.1304, -106.3468], [30.2672, -97.7431], [39.9334, 32.8597]]
TIMES = 5
CONCURRENCY_LEVELS = (0..50).to_a

def perform_searches(model, method, query, latitude = nil, longitude = nil)
  if method == :find_nearest_location
    model.send(method, latitude, longitude)
  else
    model.send(method, query)
  end
end

def concurrent_benchmark(model, method, concurrency_level, query, latitude = nil, longitude = nil)
  threads = []
  concurrency_level.times do
    threads << Thread.new do
      perform_searches(model, method, query, latitude, longitude)
    end
  end
  threads.each(&:join)
end

def benchmark_methods(model)
  results = {}

  CONCURRENCY_LEVELS.each do |concurrency|
    results[concurrency] = {}
    SAMPLE_QUERIES.each do |query|
      results[concurrency][query] = { search: 0.0, search_for_completion: 0.0, find_nearest_location: 0.0 }

      times = Array.new(TIMES) do
        Benchmark.measure do
          concurrent_benchmark(model, :search, concurrency, query)
        end
      end
      results[concurrency][query][:search] = {
        real: times.sum(&:real) / TIMES,
        cpu: times.sum(&:total) / TIMES,
        system: times.sum(&:stime) / TIMES,
        user: times.sum(&:utime) / TIMES
      }

      times = Array.new(TIMES) do
        Benchmark.measure do
          concurrent_benchmark(model, :search_for_completion, concurrency, query)
        end
      end
      results[concurrency][query][:search_for_completion] = {
        real: times.sum(&:real) / TIMES,
        cpu: times.sum(&:total) / TIMES,
        system: times.sum(&:stime) / TIMES,
        user: times.sum(&:utime) / TIMES
      }

      coord_index = SAMPLE_QUERIES.index(query)
      latitude, longitude = SAMPLE_COORDINATES[coord_index]
      times = Array.new(TIMES) do
        Benchmark.measure do
          concurrent_benchmark(model, :find_nearest_location, concurrency, query, latitude, longitude)
        end
      end
      results[concurrency][query][:find_nearest_location] = {
        real: times.sum(&:real) / TIMES,
        cpu: times.sum(&:total) / TIMES,
        system: times.sum(&:stime) / TIMES,
        user: times.sum(&:utime) / TIMES
      }
    end
  end

  results
end

def print_results(results)
  results.each do |concurrency, concurrency_results|
    puts "Concurrency Level: #{concurrency}"
    concurrency_results.each do |query, methods|
      puts "Query: #{query}"
      methods.each do |method, times|
        puts "  #{method}:"
        puts "    Average real time over #{TIMES} runs: #{times[:real].round(5)}s"
        puts "    Average CPU time over #{TIMES} runs: #{times[:cpu].round(5)}s"
        puts "    Average system time over #{TIMES} runs: #{times[:system].round(5)}s"
        puts "    Average user time over #{TIMES} runs: #{times[:user].round(5)}s"
      end
    end
    puts "-" * 30
  end
end
results = benchmark_methods(Location)
print_results(results)

def export_to_csv(results, filename="benchmark_results.csv")
  CSV.open(filename, "wb") do |csv|
    csv << ["Concurrency Level", "Query", "Method", "Average Real Time", "Average CPU Time", "Average System Time", "Average User Time"]
    results.each do |concurrency, concurrency_results|
      concurrency_results.each do |query, methods|
        methods.each do |method, times|
          csv << [concurrency, query, method, times[:real], times[:cpu], times[:system], times[:user]]
        end
      end
    end
  end
end

export_to_csv(results, filename="benchmark_results.csv")

Overall Benchmark Results

The overarching benchmarks, visualized in the graph above reveal that the JSON Streaming model is non-viable for our high-performance standards, primarily due to the demanding optimization it requires for handling big data. It is also easier to read minute details for data points within the charts without its data. It is not a fit prototype to be used in industry-standard methodologies.

Benchmark Results for Proximity Estimation

As depicted above, our prototypes, including the current SQL database solution and the standalone MongoDB server, demonstrate diverse performance levels in geospatial proximity estimation. The MongoDB model showcases the durability of its geospatial indexing capabilities.

While MongoDB has 2dsphere, for Sqlite3 I have used the simplification below:

 where(Sequel.~(gps_lat: nil)).where(Sequel.~(gps_lng: nil)).order(Sequel.lit("ABS(gps_lat - ?) + ABS(gps_lng - ?)", latitude, longitude)).first

The Haversine formula is much more effective in giving more accurate approximations for dense maps. But for the purposes of this task, using absolute difference as an indicator is enough. However, MongoDB is excelling at speed even though it is using the Haversine formula.

Also, the current monolithic solution and the standalone prototype give roughly the same numbers under different concurrencies. Let's point out the fact that we can improve the standalone server's hardware easily.

Benchmark Results for Fuzzy Search Capability

The above chart shows the results for making a fuzzy search, a feature critical in web applications, to find a result with the common words the user is providing. Although the difference is not that big, the Sqlite3 prototype is clearly ahead in terms of speed. Also, the margin between the standalone prototype and the current monolithic server is growing. There is one point that needs attention though. The Sqlite3 prototype uses a LIKE clause in SQL to make a fuzzy search. An equivalent or similar solution for enhancing MongoDB's similar functionality with SQL-like structured query language constructs would result in faster search completions for a standalone MongoDB prototype.

Direct Search Method Benchmark Results

This is the method responsible for the overwhelming majority of the searches SerpApi is processing. So any kind of sign in here is crucial to the entire comparison. Although the Sqlite3 prototype is superior to the Standalone MongoDB prototype in lower concurrencies, the latter is keeping consistency over high concurrency and winning against the former which is pivotal for large amounts of data and complex data management scenarios. The difference between the current Monolithic MongoDB Solution, and the Standalone MongoDB Prototype is not that big in terms of milliseconds. The standalone database can also be configured to have higher resources to compete.

Conclusion

Discarding the JSON Streaming model due to its concurrency shortcomings in big data storage, we are left to decide between the Sqlite3 and MongoDB standalone prototypes. We are left with several factors between different types of databases, and the architecture we use them in.

Any kind of lack of efficiency in the Sqlite3 Prototype should be dealt with by upgrading the resources of each individual server in the cluster. This is definitely one of the bad signs for a high-load task, threatening maintenance of atomicity, and data integrity. The expectation to dig deep into SQL is also a downside although it is not as tough as learning a new programming language. In addition to these, the lack of efficiency in the search method alone is a good enough reason not to go forward with the database model. However, any way of delivering the efficiency of the search for completion method in the Sqlite3 prototype to a Standalone MongoDB server would be beneficial. Anything that works with the same logic of the LIKE clause of SQL in MongoDB could benefit SerpApi's Playground for the end user. Also, the standalone MongoDB server should be optimized with additional resources to get as close to the current solution as possible.

In conclusion, the standalone MongoDB database structure should be used for storing secondary constants and collections for our use case. I would like to thank the reader for their attention. I hope this writing could shed some light on how to benchmark different prototype solutions for database examples to tackle one problem.