Nokolexbor - a performance-focused HTML parser for Ruby

There aren't many choices of HTML parsers in the ruby ecosystem. The most obvious one would be Nokogiri, which we used heavily at SerpApi. As time passed, we became gradually unsatisfied with Nokogiri's performance. Though it's mainly relying on libxml2 which is an XML processor written in C, it's not optimized for HTML-specific tasks. We've once contributed to Nokogiri to improve its performance a lot. But as Illia (the author) said

800 ms to extract data from HTML is still too much.

He also created an experimental library NokogiriRust trying to use scraper and be API-compatible with Nokogiri. The benchmark showed 60x faster on at_css.text. This proves that it is possible to replace libxml2 with a high-performance and production-ready HTML parser. Sadly, the project didn't continue.

Now, we are back to the task. Our goal is to:

  • Create a new Ruby HTML parser with a superfast underlying parser engine.
  • Make it API-compatible with Nokogiri.
  • Make it capable of searching nodes with both CSS selectors and XPath.

Development of Nokolexbor

We investigated recent HTML parsers in the C and Rust world, and picked Lexbor as the core of our new library. It has very similar APIs to Nokogiri, and the performance is much higher than libxml2. Also, C library is easier to compile than Rust when installed by users. Lexbor already had a bunch of bindings such as Python, Erlang and Crystal, which made us more confident that we could do it for Ruby as well. As a result, Nokolexbor was born.

During the development, we soon encountered two problems that we must address at the early stage.

  1. CSS selectors don't support selecting text nodes, but we select text nodes extensively with Nokogiri. We have to patch Lexbor to support it somehow.
  2. Lexbor doesn't support XPath which we used with Nokogiri. XPath can be converted to CSS selectors, but not in all cases. To be maximum compatible with Nokogiri, we'd better implement XPath algorithm using Lexbor's data structures.

Solving the first problem turned out to be easier than we thought. Thanks to Lexbor's great code generators, we soon patched Lexbor and added a ::text pseudo element to represent text nodes. Selecting all text nodes under a div can be as easy as node.css('div ::text')

The second problem was a monster. The only C implementation of XPath we found was libxml2. It has a very large and messy code base. The algorithm xpath.c itself has over 14k lines of code, plus a number of references to other modules. We had a hard time porting the code. Fortunately, we conquered it. The algorithm was nicely integrated with Lexbor. We were able to select nodes using XPath the same way as Nokogiri: node.xpath('.//div//text()').

The rest of the development would be to make Nokolexbor behave the same way as Nokogiri. Some notable updates were:

  • Patch Lexbor on case matching on tag names, class names and ids.
  • Patch Lexbor to be able to select nodes in <template> tags.
  • Make the resulting node-set unique and be of the document traversal order.
  • Enrich APIs and make them compatible with Nokogiri ones.

Apart from functionalities, we've also ensured that the library has no memory leaks. This is one of the biggest concerns of using a C library in production. We've written a separate blog post on this topic.

Benchmarks

We benchmarked parsing a google result page (368 KB) and searching nodes using CSS and XPath. Run on MacBook Pro (2019) 2.3 GHz 8-Core Intel Core i9.

Nokolexbor (iters/s)Nokogiri (iters/s)Diff
parsing487.693.55.22x faster
at_css50798.850.9997.87x faster
css7437.652.3142.11x faster
at_xpath57.07753.176same-ish
xpath51.52358.438same-ish

Parsing and searching with CSS selectors were significantly faster thanks to Lexbor. XPath performs the same as they both use libxml2.

You might be astonished by the 997x improvement on at_css. That is because Nokolexbor will return the matched node as soon as it finds one, and stops further searching. But Nokogiri's at_css implementation was

  def at_css(*args)
    css(*args).first
  end

which has no optimization at all. Searching after the first occurrence is totally a waste of time after all.

What's Next

We are happy to open-source Nokolexbor. Feel free to try it out. And of course, contributions are welcomed!

We'll be continuously developing Nokolexbor to make it 1:1 compatible with Nokogiri as much as possible. We hope it can be an alternative choice for Nokogiri when performance is a concern.