Introduction
With modern, overengineered, and over-obfuscated websites, we at SerpApi face increasing challenges with extracting data from them. Beside the usual HTML parsing, sometimes we're literally forced to fall back to good 'ol regular expressions, e.g. for extracting embedded JS data. And while regexps do the trick, they might come at a cost.
Onigmo, the default regexp engine in Ruby, while substantially updated in Ruby 3.2, still has weak points that may really upset in terms of scan time, adding latency to our search requests.
Let's find out what alternatives are available in the wild and how they compare to Ruby.
Contenders
re2
It's developed by Google, and it's widely used in various Google products. Under the hood it uses what they call "an on-the-fly deterministic finite-state automaton algorithm based on Ken Thompson's Plan 9 grep". It is stated that re2
was designed with an explicit goal of being able to handle regular expressions from untrusted sources, i.e. to be resistant from ReDoS attacks. There is well-maintained Ruby bindings gem.
rust/regex
Native regex engine in Rust. According to rebar, it's one of the fastest engines overall, and it uses the same approach of building DFA during the search time as re2
. There are no up-to-date, ready-to-use Ruby bindings, so I've created a simple PoC for this comparison.
pcre2
One of the best-known regex engines due to wide adoption across many commercial and open-source products, as well as languages like PHP and R, where it's used as a default one. It supports a separate JIT mode that improves search time significantly in most cases. Unfortunately, Ruby bindings are outdated and do not work properly. For instance, mentioned above JIT cannot be enabled with the latest binaries, making the engine not worth to be compared.
Benchmarks
The benchmarks presented here are the variations of rebar ones. Specifically, those that are validated with count and count-spans models.
The following results were gathered using:
ruby 3.4.3 (2025-04-14 revision d0b7e5b6a0) +PRISM [x86_64-linux]
re2 (2.15.0)
rust_regexp (0.1.2)
- DigitalOcean CPU-optimized Intel Premium instance with 4 vCPUs / 8 GB
- Ubuntu 24.04.1 LTS
Benchmark scripts with haystack data and raw benchmark results, including macOS / M4 Max results, can be found on GitHub.
Literal
From rebar:
This group of benchmarks measures regex patterns that are simple literals. It is mainly meant to demonstrate two things. Firstly is whether the regex engine does some of the most basic forms of optimization by recognizing that a pattern is just a literal, and that a full blown regex engine is probably not needed. Indeed, naively using a regex engine for this case is likely to produce measurements much worse than most regex engines. Secondly is how the performance of simple literal searches changes with respect to both case insensitivity and Unicode. Namely, substring search algorithms that work well on ASCII text don't necessarily also work well on UTF-8 that contains many non-ASCII codepoints. This is especially true for case insensitive searches.
With an English haystack with a minimal number of German words, i.e. Unicode characters with umlauts, rust/regex
takes the lead, while re2
and ruby
are not that far apart.
-- [literal/sherlock-en]
Calculating -------------------------------------
ruby 2.169k (± 0.8%) i/s (460.97 μs/i) - 10.850k in 5.001876s
re2 2.510k (± 1.5%) i/s (398.37 μs/i) - 12.550k in 5.000721s
rust/regex 12.248k (± 0.4%) i/s (81.65 μs/i) - 61.350k in 5.009252s
Comparison:
rust/regex: 12247.5 i/s
re2: 2510.2 i/s - 4.88x slower
ruby: 2169.3 i/s - 5.65x slower
With the same haystack but case-insensitive matching, ruby
becomes significantly slower.
Calculating -------------------------------------
ruby 158.153 (± 1.9%) i/s (6.32 ms/i) - 795.000 in 5.028382s
re2 596.211 (± 0.2%) i/s (1.68 ms/i) - 3.009k in 5.046886s
rust/regex 5.605k (± 0.4%) i/s (178.40 μs/i) - 28.050k in 5.004126s
Comparison:
rust/regex: 5605.5 i/s
re2: 596.2 i/s - 9.40x slower
ruby: 158.2 i/s - 35.44x slower
With Unicode-specific haystack, i.e. fully Cyrillic text, re2
suddenly becomes slower than ruby
. This re2
tendency of not being "Unicode-friendly" will be encountered many times later.
-- [literal/sherlock-ru]
Calculating -------------------------------------
ruby 1.157k (± 0.8%) i/s (864.47 μs/i) - 5.865k in 5.070396s
re2 288.332 (± 0.3%) i/s (3.47 ms/i) - 1.456k in 5.049813s
rust/regex 6.721k (± 0.5%) i/s (148.79 μs/i) - 34.221k in 5.091769s
Comparison:
rust/regex: 6721.0 i/s
ruby: 1156.8 i/s - 5.81x slower
re2: 288.3 i/s - 23.31x slower
Eventually, ruby
's poor case-insensitivity performance prevails over re2
's poor Unicode performance, so with Cyrillic text and i
flag, ruby
becomes the slowest one again.
-- [literal/sherlock-casei-ru]
Calculating -------------------------------------
ruby 51.814 (± 0.0%) i/s (19.30 ms/i) - 260.000 in 5.018218s
re2 352.650 (± 0.3%) i/s (2.84 ms/i) - 1.785k in 5.061719s
rust/regex 2.722k (± 0.4%) i/s (367.41 μs/i) - 13.821k in 5.077995s
Comparison:
rust/regex: 2721.8 i/s
re2: 352.6 i/s - 7.72x slower
ruby: 51.8 i/s - 52.53x slower
The last example in this benchmark group is Chinese text, and re2
takes the last place. However, the difference is not that big compared to the Cyrillic one.
-- [literal/sherlock-zh]
Calculating -------------------------------------
ruby 6.360k (± 0.4%) i/s (157.23 μs/i) - 32.334k in 5.083835s
re2 2.233k (± 0.3%) i/s (447.77 μs/i) - 11.373k in 5.092542s
rust/regex 30.128k (± 0.3%) i/s (33.19 μs/i) - 151.200k in 5.018621s
Comparison:
rust/regex: 30128.2 i/s
ruby: 6360.2 i/s - 4.74x slower
re2: 2233.3 i/s - 13.49x slower
Literal with alternation
From rebar:
This group is like
literal
, but expands the complexity from a simple literal to a small alternation of simple literals, including case insensitive variants where applicable. This benchmark ups the ante when it comes to literal optimizations. Namely, for a regex engine to optimize this case, it generally needs to be capable of reasoning about literal optimizations that require one or more literals from a set to match. Many regex engines don't deal with this case well, or at all.
This group of benchmarks reuses haystacks from literal but with alternated patterns. Alternation in ruby
becomes a weak point in the same way as case-insensitivity. The position of engines is static across all examples - rust/regex
, re2
, and then ruby
.
-- [literal-alt/sherlock-en]
Calculating -------------------------------------
ruby 175.748 (± 0.0%) i/s (5.69 ms/i) - 884.000 in 5.029956s
re2 552.693 (± 0.7%) i/s (1.81 ms/i) - 2.805k in 5.075420s
rust/regex 6.407k (± 0.4%) i/s (156.08 μs/i) - 32.100k in 5.010115s
Comparison:
rust/regex: 6407.2 i/s
re2: 552.7 i/s - 11.59x slower
ruby: 175.7 i/s - 36.46x slower
-- [literal-alt/sherlock-casei-en]
Calculating -------------------------------------
ruby 83.630 (± 0.0%) i/s (11.96 ms/i) - 424.000 in 5.070034s
re2 550.273 (± 0.4%) i/s (1.82 ms/i) - 2.805k in 5.097541s
rust/regex 2.896k (± 0.5%) i/s (345.30 μs/i) - 14.739k in 5.089453s
Comparison:
rust/regex: 2896.1 i/s
re2: 550.3 i/s - 5.26x slower
ruby: 83.6 i/s - 34.63x slower
-- [literal-alt/sherlock-ru]
Calculating -------------------------------------
ruby 29.989 (± 0.0%) i/s (33.35 ms/i) - 150.000 in 5.001989s
re2 324.299 (± 0.6%) i/s (3.08 ms/i) - 1.632k in 5.032606s
rust/regex 2.292k (± 0.5%) i/s (436.34 μs/i) - 11.679k in 5.096204s
Comparison:
rust/regex: 2291.8 i/s
re2: 324.3 i/s - 7.07x slower
ruby: 30.0 i/s - 76.42x slower
-- [literal-alt/sherlock-casei-ru]
Calculating -------------------------------------
ruby 12.274 (± 0.0%) i/s (81.47 ms/i) - 62.000 in 5.051406s
re2 314.334 (± 0.6%) i/s (3.18 ms/i) - 1.581k in 5.029859s
rust/regex 627.731 (± 0.6%) i/s (1.59 ms/i) - 3.162k in 5.037377s
Comparison:
rust/regex: 627.7 i/s
re2: 314.3 i/s - 2.00x slower
ruby: 12.3 i/s - 51.14x slower
-- [literal-alt/sherlock-zh]
Calculating -------------------------------------
ruby 84.924 (± 0.0%) i/s (11.78 ms/i) - 432.000 in 5.087020s
re2 737.563 (± 0.1%) i/s (1.36 ms/i) - 3.723k in 5.047722s
rust/regex 10.016k (± 0.4%) i/s (99.84 μs/i) - 50.745k in 5.066428s
Comparison:
rust/regex: 10016.1 i/s
re2: 737.6 i/s - 13.58x slower
ruby: 84.9 i/s - 117.94x slower
Date
From rebar:
This is a monster regex for extracting dates from unstructured text from the datefinder project written in Python. The regex itself was taken from printing the DATES_PATTERN variable in the
datefinder
project. I then removed all names from the capture groups, unnecessary escapes and collapsed it to a single line (because not all regex engines support verbose mode). The regex is more akin to a tokenizer, and thedatefinder
library attempts to combine these tokens into timestamps.
This is another example of automata-oriented engines outperforming ruby
(that is a backtracker).
-- [date/ascii]
Calculating -------------------------------------
ruby 0.583 (± 0.0%) i/s (1.71 s/i) - 3.000 in 5.141671s
re2 13.299 (± 0.0%) i/s (75.19 ms/i) - 67.000 in 5.038107s
rust/regex 18.069 (± 0.0%) i/s (55.34 ms/i) - 91.000 in 5.036226s
Comparison:
rust/regex: 18.1 i/s
re2: 13.3 i/s - 1.36x slower
ruby: 0.6 i/s - 30.97x slower
Cloudflare ReDoS
From rebar:
This benchmark uses a regex that helped cause an outage at Cloudflare. This class of vulnerability is typically called a "regular expression denial of service," or "ReDoS" for short. It doesn't always require a malicious actor to trigger. Since it can be difficult to reason about the worst case performance of a regex when using an unbounded backtracking implementation, it might happen entirely accidentally on valid inputs.
The particular regexp that contributed to the outage was:
(?:(?:"|'|\]|\}|\\|\d|(?:nan|infinity|true|false|null|undefined|symbol|math)|`|\-|\+)+[)]*;?((?:\s|-|~|!|\{\}|\|\||\+)*.*(?:.*=.*)))
As discussed in Cloudflare's post-mortem, the specific problematic portion of the regexp is:
.*(?:.*=.*)
Or more simply:
.*.*=.*;
Here are the results for the original regexp, along with the simplified variant with short and long haystacks compared.
-- [cloudflare-redos/original]
Calculating -------------------------------------
ruby 38.874k (± 0.3%) i/s (25.72 μs/i) - 197.268k in 5.074535s
re2 269.196k (± 0.3%) i/s (3.71 μs/i) - 1.352M in 5.020755s
rust/regex 307.633k (± 0.3%) i/s (3.25 μs/i) - 1.568M in 5.097012s
Comparison:
rust/regex: 307633.0 i/s
re2: 269195.8 i/s - 1.14x slower
ruby: 38874.4 i/s - 7.91x slower
-- [cloudflare-redos/simplified-short]
Calculating -------------------------------------
ruby 118.655k (± 0.4%) i/s (8.43 μs/i) - 596.200k in 5.024702s
re2 275.180k (± 1.8%) i/s (3.63 μs/i) - 1.378M in 5.010738s
rust/regex 2.300M (± 0.4%) i/s (434.81 ns/i) - 11.682M in 5.079644s
Comparison:
rust/regex: 2299833.5 i/s
re2: 275180.0 i/s - 8.36x slower
ruby: 118655.3 i/s - 19.38x slower
-- [cloudflare-redos/simplified-long]
Calculating -------------------------------------
ruby 1.391k (± 1.0%) i/s (719.03 μs/i) - 7.000k in 5.033739s
re2 5.373k (± 0.7%) i/s (186.13 μs/i) - 27.183k in 5.059767s
rust/regex 39.191k (± 1.3%) i/s (25.52 μs/i) - 197.166k in 5.031788s
Comparison:
rust/regex: 39190.6 i/s
re2: 5372.7 i/s - 7.29x slower
ruby: 1390.8 i/s - 28.18x slower
Words
From rebar:
This benchmark measures how long it takes for a regex engine to find words in a haystack. We compare one regex that finds all words,
\b\w+\b
and another regex that only looks for longer words,\b\w{12,}\b
. The split between finding all words and finding only long words tends to highlight the overhead of matching in each regex engine. Regex engines that are quicker to get in and out of its match routine do better at finding all words than regex engines that have higher overhead.
This group of benchmarks uses English and Cyrillic haystacks from the literal group, but there are a few limitations.
\b
matcher is not Unicode aware in re2
, producing a slightly different match count with English text (because of the inclusion of umlauts) and completely different results with Cyrillic text, so the latter one was excluded from the comparison.
Forcing re2
to match Unicode characters makes it slower than ruby
once again.
-- [words/all-english]
Calculating -------------------------------------
ruby 194.375 (± 1.0%) i/s (5.14 ms/i) - 988.000 in 5.083383s
re2 102.214 (± 0.0%) i/s (9.78 ms/i) - 520.000 in 5.087453s
rust/regex 528.470 (± 0.6%) i/s (1.89 ms/i) - 2.652k in 5.018450s
Comparison:
rust/regex: 528.5 i/s
ruby: 194.4 i/s - 2.72x slower
re2: 102.2 i/s - 5.17x slower
Matching long words avoids any with Unicode characters beforehand, so re2
does not get penalized. Even more, long bounded repeats seem to be more efficient in re2
than in rust/regex
.
-- [words/long-english]
Calculating -------------------------------------
ruby 351.391 (± 0.6%) i/s (2.85 ms/i) - 1.785k in 5.079936s
re2 6.397k (± 0.5%) i/s (156.33 μs/i) - 32.000k in 5.002787s
rust/regex 852.843 (± 0.2%) i/s (1.17 ms/i) - 4.335k in 5.083041s
Comparison:
re2: 6396.6 i/s
rust/regex: 852.8 i/s - 7.50x slower
ruby: 351.4 i/s - 18.20x slower
Bounded repeat
From rebar:
This group of benchmarks measures how well regex engines do with bounded repeats. Bounded repeats are sub-expressions that are permitted to match up to some fixed number of times. For example,
a{3,5}
matches3
,4
or5
consecutivea
characters. Unlike unbounded repetition operators, the regex engine needs some way to track when the bound has reached its limit. For this reason, many regex engines will translatea{3,5}
toaaaa?a?
. Given that the bounds may be much higher than5
and that the sub-expression may be much more complicated than a single character, bounded repeats can quickly cause the underlying matcher to balloon in size.
With comparably short bounded repeats of English letters re2
still performs very well compared to ruby
.
-- [bounded-repeat/letters-en]
Calculating -------------------------------------
ruby 160.076 (± 1.2%) i/s (6.25 ms/i) - 816.000 in 5.098412s
re2 694.208 (± 0.9%) i/s (1.44 ms/i) - 3.519k in 5.069443s
rust/regex 2.046k (± 0.3%) i/s (488.76 μs/i) - 10.250k in 5.009798s
Comparison:
rust/regex: 2046.0 i/s
re2: 694.2 i/s - 2.95x slower
ruby: 160.1 i/s - 12.78x slower
But it completely loses with Unicode again.
-- [bounded-repeat/letters-ru]
Calculating -------------------------------------
ruby 84.089 (± 0.0%) i/s (11.89 ms/i) - 424.000 in 5.042387s
re2 16.273 (±18.4%) i/s (61.45 ms/i) - 80.000 in 5.038373s
rust/regex 970.406 (± 0.7%) i/s (1.03 ms/i) - 4.947k in 5.098106s
Comparison:
rust/regex: 970.4 i/s
ruby: 84.1 i/s - 11.54x slower
re2: 16.3 i/s - 59.63x slower
More gnarly context
and capitals
regexps perform more even across the engines. Though, (?:.)
sub-expression coupled with the bounded repeat in capitals
makes rust/regex
quite a bit slower.
-- [bounded-repeat/context]
Calculating -------------------------------------
ruby 3.306 (± 0.0%) i/s (302.47 ms/i) - 17.000 in 5.142134s
re2 8.342 (± 0.0%) i/s (119.88 ms/i) - 42.000 in 5.035165s
rust/regex 8.506 (± 0.0%) i/s (117.56 ms/i) - 43.000 in 5.055945s
Comparison:
rust/regex: 8.5 i/s
re2: 8.3 i/s - 1.02x slower
ruby: 3.3 i/s - 2.57x slower
-- [bounded-repeat/capitals]
Calculating -------------------------------------
ruby 14.767 (± 0.0%) i/s (67.72 ms/i) - 74.000 in 5.011244s
re2 91.607 (± 0.0%) i/s (10.92 ms/i) - 459.000 in 5.010646s
rust/regex 77.138 (± 0.0%) i/s (12.96 ms/i) - 392.000 in 5.081858s
Comparison:
re2: 91.6 i/s
rust/regex: 77.1 i/s - 1.19x slower
ruby: 14.8 i/s - 6.20x slower
Noseyparker
From rebar:
This benchmark measures how well regex engines do when asked to look for matches for many different patterns. The patterns come from the Nosey Parker project, which finds secrets and sensitive information in textual data and source repositories. Nosey Parker operates principally by defining a number of rules for detecting secrets (for example, AWS API keys), and then looking for matches of those rules in various corpora. The rules are, as you might have guessed, defined as regular expressions.
This is a particularly brutal benchmark with way too long scan times. To make it reasonable across all engines, the number of regexps was lowered to 50, and the haystack was shortened to ~7 MB.
Regexps were run one by one (and not joined together) to simulate a scenario when a reference to the matched regexp is required.
Alternatively, rust/regex
and re2
support set
functionality that represents a collection of regular expressions that can be searched for simultaneously. set
scan provides indexes of regexps that matched at least once. This smaller scope of selected regexps can be used as a kind of pre-scan before the full scans. Ideally, such approach should reduce overall search time, but only if set
is significantly faster than regular regexp.
-- [noseyparker/default]
Calculating -------------------------------------
ruby 2.259 (± 0.0%) i/s (442.76 ms/i) - 12.000 in 5.313185s
re2 2.229 (± 0.0%) i/s (448.62 ms/i) - 12.000 in 5.383492s
rust/regex 54.259 (± 0.0%) i/s (18.43 ms/i) - 275.000 in 5.068428s
re2 set 46.733 (± 0.0%) i/s (21.40 ms/i) - 236.000 in 5.050062s
rust/regex set 0.179 (± 0.0%) i/s (5.58 s/i) - 1.000 in 5.576035s
Comparison:
rust/regex: 54.3 i/s
re2 set: 46.7 i/s - 1.16x slower
ruby: 2.3 i/s - 24.02x slower
re2: 2.2 i/s - 24.34x slower
rust/regex set: 0.2 i/s - 302.55x slower
Surprisingly, rust/regex set
appeared to be extremely unoptimized compared to running the same regexps one by one. In contrast, re2 set
showed significant improvement over plain re2
.
Something was wrong, so I started playing with input params to find the weak point. First, I tried disabling Unicode, so \w
, \d
, \s
, \b
matchers in rust/regex
became non-Unicode aware. It improved performance to the level of sequential regexps, but I felt like it could do better.
-- [noseyparker/no-unicode]
Calculating -------------------------------------
ruby 2.261 (± 0.0%) i/s (442.30 ms/i) - 12.000 in 5.307593s
re2 2.222 (± 0.0%) i/s (449.97 ms/i) - 12.000 in 5.399727s
rust/regex 58.438 (± 0.0%) i/s (17.11 ms/i) - 295.000 in 5.048170s
re2 set 46.765 (± 0.0%) i/s (21.38 ms/i) - 236.000 in 5.046541s
rust/regex set 63.660 (± 0.0%) i/s (15.71 ms/i) - 324.000 in 5.089633s
Comparison:
rust/regex set: 63.7 i/s
rust/regex: 58.4 i/s - 1.09x slower
re2 set: 46.8 i/s - 1.36x slower
ruby: 2.3 i/s - 28.16x slower
re2: 2.2 i/s - 28.65x slower
Filtering out regexps one by one, I found that wide scopes like [^a-zA-Z0-9_-]
have the same effect as \w
in Unicode mode, increasing the scan time significantly, especially if included in non-capturing (?:.)
groups. Removing such regexps gave another bump for rust/regexp set
.
-- [noseyparker/no-unicode-no-wide-scopes]
Calculating -------------------------------------
ruby 6.987 (± 0.0%) i/s (143.12 ms/i) - 35.000 in 5.009224s
re2 3.223 (± 0.0%) i/s (310.31 ms/i) - 17.000 in 5.275362s
rust/regex 103.759 (± 1.0%) i/s (9.64 ms/i) - 520.000 in 5.011819s
re2 set 94.412 (± 1.1%) i/s (10.59 ms/i) - 477.000 in 5.053396s
rust/regex set 559.751 (± 0.4%) i/s (1.79 ms/i) - 2.805k in 5.011212s
Comparison:
rust/regex set: 559.8 i/s
rust/regex: 103.8 i/s - 5.39x slower
re2 set: 94.4 i/s - 5.93x slower
ruby: 7.0 i/s - 80.11x slower
re2: 3.2 i/s - 173.70x slower
It's not practical to cherry-pick regexps or disable Unicode, so I would avoid using rust/regex set
unless it is 100% tested to perform better than sequential regexps.
Engine limitations
As mentioned above, \w
, \d
, \s
, \b
matchers behave differently with Unicode in different engines. Namely, in re2
, they don't match extended Unicode characters.
RE2('(\w+)').scan("- Yes, Fräulein.").to_a.flatten
# => ["Yes", "Fr", "ulein"]
RE2('(\d+)').scan("0123٠١٢٣").to_a.flatten
# => ["0123"]
RE2('(\s)').scan(" \u200A\u2000").to_a.size
# => 1
RE2('(\b[0-9A-Za-z_]+\b)').scan("- Yes, Fräulein.").to_a.flatten
# => ["Yes", "Fr", "ulein"]
\w
, \d
, \s
are not Unicode aware in ruby
either, but you have an option to use [[:alpha:]]
, [[:digit:]]
, [[:space:]]
instead. \b
works in Unicode mode by default.
"- Yes, Fräulein.".scan(/\w+/)
# => ["Yes", "Fr", "ulein"]
"- Yes, Fräulein.".scan(/[[:alpha:]]+/)
# => ["Yes", "Fräulein"]
"0123٠١٢٣".scan(/\d+/)
# => ["0123"]
"0123٠١٢٣".scan(/[[:digit:]]+/)
# => ["0123٠١٢٣"]
" \u200A\u2000".scan(/\s/).size
# => 1
" \u200A\u2000".scan(/[[:space:]]/).size
# => 3
"- Yes, Fräulein.".scan(/\b[0-9A-Za-z_]+\b/)
# => ["Yes"]
It's funny that re2
includes [[:alpha:]]
, [[:digit:]]
, [[:space:]]
scopes too, but they're strictly ASCII-like.
RE2('([[:alpha:]]+)').scan("- Yes, Fräulein.").to_a.flatten
# => ["Yes", "Fr", "ulein"]
RE2('([[:digit:]]+)').scan("0123٠١٢٣").to_a.flatten
# => ["0123"]
RE2('([[:space:]])').scan(" \u200A\u2000").to_a.size
# => 1
In rust/regex
all 4 are Unicode aware by default, with the option to fall back to ASCII mode.
RustRegexp.new('\w+').scan("- Yes, Fräulein.")
# => ["Yes", "Fräulein"]
RustRegexp.new('\w+', unicode: false).scan("- Yes, Fräulein.")
# => ["Yes", "Fr", "ulein"]
RustRegexp.new('\d+').scan("0123٠١٢٣")
# => ["0123٠١٢٣"]
RustRegexp.new('\d+', unicode: false).scan("0123٠١٢٣")
# => ["0123"]
RustRegexp.new('\s').scan(" \u200A\u2000").size
# => 3
RustRegexp.new('\s', unicode: false).scan(" \u200A\u2000").size
# => 1
RustRegexp.new('\b[0-9A-Za-z_]+\b').scan("- Yes, Fräulein.")
# => ["Yes"]
RustRegexp.new('\b[0-9A-Za-z_]+\b', unicode: false).scan("- Yes, Fräulein.")
# => ["Yes", "Fr", "ulein"]
Another inconvenience with re2
was found with bounded repeats with a really high max – it does not support values higher than 1000
.
RE2('.{0,1000}').ok?
# => true
RE2('.{0,1001}').ok?
# => false
# => invalid repetition size: {0,1001}
In ruby
, max limit is 100000
.
/.{0,100000}/
# => ok
/.{0,100001}/
# => too big number for repeat range
In rust/regex
, there is a limit for 10485760
bytes (= 10 MBs) per compiled regexp. Bounded repeat max depends on the character class and encoding mode (Unicode vs ASCII).
RustRegexp.new('.{0,10082}')
# => ok
RustRegexp.new('.{0,10083}')
# => ArgumentError: Compiled regex exceeds size limit of 10485760 bytes.
RustRegexp.new('.{0,87379}', unicode: false)
# => ok
RustRegexp.new('.{0,87380}', unicode: false)
# => ArgumentError: Compiled regex exceeds size limit of 10485760 bytes.
RustRegexp.new('\w{0,209}')
# => ok
RustRegexp.new('\w{0,210}')
# => ArgumentError: Compiled regex exceeds size limit of 10485760 bytes.
RustRegexp.new('\w{0,77099}', unicode: false)
# => ok
RustRegexp.new('\w{0,77100}', unicode: false)
# => ArgumentError: Compiled regex exceeds size limit of 10485760 bytes.
Another nuance was found in ruby
, which cannot scan the haystack with invalid UTF-8 byte sequences.
haystack = "\xfc\xa1\xa1\xa1\xa1\xa1abc"
haystack.scan(/.+/)
# => ArgumentError: invalid byte sequence in UTF-8 (ArgumentError)
RE2('(.+)').scan(haystack)
# => ["abc"]
RustRegexp.new('.+').scan(haystack)
# => ["abc"]
rust_regexp
is built on top of regex::bytes
API that makes parsing of invalid UTF-8 possible. The default regex
API would fail similarly to ruby
.
Conclusions
re2
provides a substantial performance improvement overruby
in all cases except those involving Unicode text.re2
has limitations with Unicode awareness of specific matchers.re2 set
is always faster thanre2
with sequential regexps.rust_regexp
is the fastest alternative forruby
overall, with no Unicode concerns on a per-regexp basis.rust_regexp
with sequential regexps is faster thanre2 set
.rust_regexp set
is very picky to regexps and should be used with careful consideration. Otherwise, abysmal performance can be encountered to the point of being~300x
slower thanrust_regexp
with sequential regexps.- both
re2
andrust_regexp
can be used for parsing strings with invalid UTF-8 byte sequences,ruby
can't do that.