# re2 [![Build Status](https://github.com/mudge/re2/actions/workflows/tests.yml/badge.svg?branch=main)](https://github.com/mudge/re2/actions) Ruby bindings to [RE2][], a "fast, safe, thread-friendly alternative to backtracking regular expression engines like those used in PCRE, Perl, and Python". **Current version:** 2.5.0 **Bundled RE2 version:** libre2.11 (2023-11-01) ```ruby RE2('h.*o').full_match?("hello") #=> true RE2('e').full_match?("hello") #=> false RE2('h.*o').partial_match?("hello") #=> true RE2('e').partial_match?("hello") #=> true RE2('(\w+):(\d+)').full_match("ruby:1234") #=> # ``` ## Table of Contents * [Why RE2?](#why-re2) * [Usage](#usage) * [Compiling regular expressions](#compiling-regular-expressions) * [Matching interface](#matching-interface) * [Submatch extraction](#submatch-extraction) * [Scanning text incrementally](#scanning-text-incrementally) * [Searching simultaneously](#searching-simultaneously) * [Encoding](#encoding) * [Requirements](#requirements) * [Native gems](#native-gems) * [Installing the `ruby` platform gem](#installing-the-ruby-platform-gem) * [Using system libraries](#using-system-libraries) * [Thanks](#thanks) * [Contact](#contact) * [License](#license) * [Dependencies](#dependencies) ## Why RE2? > RE2 was designed and implemented with an explicit goal of being able to > handle regular expressions from untrusted users without risk. One of its > primary guarantees is that the match time is linear in the length of the > input string. It was also written with production concerns in mind: the > parser, the compiler and the execution engines limit their memory usage by > working within a configurable budget – failing gracefully when exhausted – > and they avoid stack overflow by eschewing recursion. — [Why RE2?](https://github.com/google/re2/wiki/WhyRE2) ## Usage Install re2 as a dependency: ```ruby # In your Gemfile gem "re2" # Or without Bundler gem install re2 ``` Include in your code: ```ruby require "re2" ``` Full API documentation automatically generated from the latest version is available at https://mudge.name/re2/. While re2 uses the same naming scheme as Ruby's built-in regular expression library (with [`Regexp`](https://mudge.name/re2/RE2/Regexp.html) and [`MatchData`](https://mudge.name/re2/RE2/MatchData.html)), its API is slightly different: ### Compiling regular expressions > [!WARNING] > RE2's regular expression syntax differs from PCRE and Ruby's built-in > [`Regexp`](https://docs.ruby-lang.org/en/3.2/Regexp.html) library, see the > [official syntax page](https://github.com/google/re2/wiki/Syntax) for more > details. The core class is [`RE2::Regexp`](https://mudge.name/re2/RE2/Regexp.html) which takes a regular expression as a string and compiles it internally into an `RE2` object. A global function `RE2` is available to concisely compile a new `RE2::Regexp`: ```ruby re = RE2('(\w+):(\d+)') #=> # re.ok? #=> true re = RE2('abc)def') re.ok? #=> false re.error #=> "missing ): abc(def" ``` > [!TIP] > Note the use of *single quotes* when passing the regular expression as > a string to `RE2` so that the backslashes aren't interpreted as escapes. When compiling a regular expression, an optional second argument can be used to change RE2's default options, e.g. stop logging syntax and execution errors to stderr with `log_errors`: ```ruby RE2('abc)def', log_errors: false) ``` See the API documentation for [`RE2::Regexp#initialize`](https://mudge.name/re2/RE2/Regexp.html#initialize-instance_method) for all the available options. ### Matching interface There are two main methods for matching: [`RE2::Regexp#full_match?`](https://mudge.name/re2/RE2/Regexp.html#full_match%3F-instance_method) requires the regular expression to match the entire input text, and [`RE2::Regexp#partial_match?`](https://mudge.name/re2/RE2/Regexp.html#match%3F-instance_method) looks for a match for a substring of the input text, returning a boolean to indicate whether a match was successful or not. ```ruby RE2('h.*o').full_match?("hello") #=> true RE2('e').full_match?("hello") #=> false RE2('h.*o').partial_match?("hello") #=> true RE2('e').partial_match?("hello") #=> true ``` ### Submatch extraction > [!TIP] > Only extract the number of submatches you need as performance is improved > with fewer submatches (with the best performance when avoiding submatch > extraction altogether). Both matching methods have a second form that can extract submatches as [`RE2::MatchData`](https://mudge.name/re2/RE2/MatchData.html) objects: [`RE2::Regexp#full_match`](https://mudge.name/re2/RE2/Regexp.html#full_match-instance_method) and [`RE2::Regexp#partial_match`](https://mudge.name/re2/RE2/Regexp.html#partial_match-instance_method). ```ruby m = RE2('(\w+):(\d+)').full_match("ruby:1234") #=> # m[0] #=> "ruby:1234" m[1] #=> "ruby" m[2] #=> "1234" m = RE2('(\w+):(\d+)').full_match("r") #=> nil ``` `RE2::MatchData` supports retrieving submatches by numeric index or by name if present in the regular expression: ```ruby m = RE2('(?P\w+):(?P\d+)').full_match("ruby:1234") #=> # m["word"] #=> "ruby" m["number"] #=> "1234" ``` They can also be used with Ruby's [pattern matching](https://docs.ruby-lang.org/en/3.2/syntax/pattern_matching_rdoc.html): ```ruby case RE2('(\w+):(\d+)').full_match("ruby:1234") in [word, number] puts "Word: #{word}, Number: #{number}" else puts "No match" end # Word: ruby, Number: 1234 case RE2('(?P\w+):(?P\d+)').full_match("ruby:1234") in word:, number: puts "Word: #{word}, Number: #{number}" else puts "No match" end # Word: ruby, Number: 1234 ``` By default, both `full_match` and `partial_match` will extract all submatches into the `RE2::MatchData` based on the number of capturing groups in the regular expression. This can be changed by passing an optional second argument when matching: ```ruby m = RE2('(\w+):(\d+)').full_match("ruby:1234", submatches: 1) => # ``` > [!WARNING] > If the regular expression has no capturing groups or you pass `submatches: > 0`, the matching method will behave like its `full_match?` or > `partial_match?` form and only return `true` or `false` rather than > `RE2::MatchData`. ### Scanning text incrementally If you want to repeatedly match regular expressions from the start of some input text, you can use [`RE2::Regexp#scan`](https://mudge.name/re2/RE2/Regexp.html#scan-instance_method) to return an `Enumerable` [`RE2::Scanner`](https://mudge.name/re2/RE2/Scanner.html) object which will lazily consume matches as you iterate over it: ```ruby scanner = RE2('(\w+)').scan(" one two three 4") scanner.each do |match| puts match.inspect end # ["one"] # ["two"] # ["three"] # ["4"] ``` ### Searching simultaneously [`RE2::Set`](https://mudge.name/re2/RE2/Set.html) represents a collection of regular expressions that can be searched for simultaneously. Calling [`RE2::Set#add`](https://mudge.name/re2/RE2/Set.html#add-instance_method) with a regular expression will return the integer index at which it is stored within the set. After all patterns have been added, the set can be compiled using [`RE2::Set#compile`](https://mudge.name/re2/RE2/Set.html#compile-instance_method), and then [`RE2::Set#match`](https://mudge.name/re2/RE2/Set.html#match-instance_method) will return an array containing the indices of all the patterns that matched. ```ruby set = RE2::Set.new set.add("abc") #=> 0 set.add("def") #=> 1 set.add("ghi") #=> 2 set.compile #=> true set.match("abcdefghi") #=> [0, 1, 2] set.match("ghidefabc") #=> [2, 1, 0] ``` ### Encoding > [!WARNING] > Note RE2 only supports UTF-8 and ISO-8859-1 encoding so strings will be > returned in UTF-8 by default or ISO-8859-1 if the `:utf8` option for the > `RE2::Regexp` is set to `false` (any other encoding's behaviour is undefined). For backward compatibility: re2 won't automatically convert string inputs to the right encoding so this is the responsibility of the caller, e.g. ```ruby # By default, RE2 will process patterns and text as UTF-8 RE2(non_utf8_pattern.encode("UTF-8")).match(non_utf8_text.encode("UTF-8")) # If the :utf8 option is false, RE2 will process patterns and text as ISO-8859-1 RE2(non_latin1_pattern.encode("ISO-8859-1"), utf8: false).match(non_latin1_text.encode("ISO-8859-1")) ``` ## Requirements This gem requires the following to run: * [Ruby](https://www.ruby-lang.org/en/) 2.6 to 3.3 It supports the following RE2 ABI versions: * libre2.0 (prior to release 2020-03-02) to libre2.11 (2023-07-01 to 2023-11-01) ### Native gems Where possible, a pre-compiled native gem will be provided for the following platforms: * Linux `aarch64-linux` and `arm-linux` (requires [glibc](https://www.gnu.org/software/libc/) 2.29+) * Linux `x86-linux` and `x86_64-linux` (requires [glibc](https://www.gnu.org/software/libc/) 2.17+) including [musl](https://musl.libc.org/)-based systems such as [Alpine](https://alpinelinux.org) * macOS `x86_64-darwin` and `arm64-darwin` * Windows `x64-mingw32` and `x64-mingw-ucrt` ### Installing the `ruby` platform gem > [!WARNING] > We strongly recommend using the native gems where possible to avoid the need > for compiling the C++ extension and its dependencies which will take longer > and be less reliable. If you wish to compile the gem, you will need to explicitly install the `ruby` platform gem: ```ruby # In your Gemfile with Bundler 2.3.18+ gem "re2", force_ruby_platform: true # With Bundler 2.1+ bundle config set force_ruby_platform true # With older versions of Bundler bundle config force_ruby_platform true # Without Bundler gem install re2 --platform=ruby ``` You will need a full compiler toolchain for compiling Ruby C extensions (see [Nokogiri's "The Compiler Toolchain"](https://nokogiri.org/tutorials/installing_nokogiri.html#appendix-a-the-compiler-toolchain)) plus the toolchain required for compiling the vendored version of RE2 and its dependency [Abseil][] which includes [CMake](https://cmake.org) and a compiler with C++14 support such as [clang](http://clang.llvm.org/) 3.4 or [gcc](https://gcc.gnu.org/) 5. On Windows, you'll also need pkgconf 2.1.0+ to avoid [`undefined reference` errors](https://github.com/pkgconf/pkgconf/issues/322) when attempting to compile Abseil. ### Using system libraries If you already have RE2 installed, you can instruct the gem not to use its own vendored version: ```ruby gem install re2 --platform=ruby -- --enable-system-libraries # If RE2 is not installed in /usr/local, /usr, or /opt/homebrew: gem install re2 --platform=ruby -- --enable-system-libraries --with-re2-dir=/path/to/re2/prefix ``` Alternatively, you can set the `RE2_USE_SYSTEM_LIBRARIES` environment variable instead of passing `--enable-system-libraries` to the `gem` command. ## Thanks * Thanks to [Jason Woods](https://github.com/driskell) who contributed the original implementations of `RE2::MatchData#begin` and `RE2::MatchData#end`. * Thanks to [Stefano Rivera](https://github.com/stefanor) who first contributed C++11 support. * Thanks to [Stan Hu](https://github.com/stanhu) for reporting a bug with empty patterns and `RE2::Regexp#scan`, contributing support for libre2.11 (2023-07-01) and for vendoring RE2 and abseil and compiling native gems in 2.0. * Thanks to [Sebastian Reitenbach](https://github.com/buzzdeee) for reporting the deprecation and removal of the `utf8` encoding option in RE2. * Thanks to [Sergio Medina](https://github.com/serch) for reporting a bug when using `RE2::Scanner#scan` with an invalid regular expression. * Thanks to [Pritam Baral](https://github.com/pritambaral) for contributing the initial support for `RE2::Set`. * Thanks to [Mike Dalessio](https://github.com/flavorjones) for reviewing the precompilation of native gems in 2.0. * Thanks to [Peter Zhu](https://github.com/peterzhu2118) for [ruby_memcheck](https://github.com/Shopify/ruby_memcheck) and helping find the memory leaks fixed in 2.1.3. * Thanks to [Jean Boussier](https://github.com/byroot) for contributing the switch to Ruby's `TypedData` API and the resulting garbage collection improvements in 2.4.0. ## Contact All issues and suggestions should go to [GitHub Issues](https://github.com/mudge/re2/issues). ## License This library is licensed under the BSD 3-Clause License, see `LICENSE.txt`. Copyright © 2010, Paul Mucur. ### Dependencies The source code of [RE2][] is distributed in the `ruby` platform gem. This code is licensed under the BSD 3-Clause License, see `LICENSE-DEPENDENCIES.txt`. The source code of [Abseil][] is distributed in the `ruby` platform gem. This code is licensed under the Apache License 2.0, see `LICENSE-DEPENDENCIES.txt`. [RE2]: https://github.com/google/re2 [Abseil]: https://abseil.io