/*! Provides [`Automaton`] trait for abstracting over Aho-Corasick automata. The `Automaton` trait provides a way to write generic code over any Aho-Corasick automaton. It also provides access to lower level APIs that permit walking the state transitions of an Aho-Corasick automaton manually. */ use alloc::{string::String, vec::Vec}; use crate::util::{ error::MatchError, primitives::PatternID, search::{Anchored, Input, Match, MatchKind, Span}, }; pub use crate::util::{ prefilter::{Candidate, Prefilter}, primitives::{StateID, StateIDError}, }; /// We seal the `Automaton` trait for now. It's a big trait, and it's /// conceivable that I might want to add new required methods, and sealing the /// trait permits doing that in a backwards compatible fashion. On other the /// hand, if you have a solid use case for implementing the trait yourself, /// please file an issue and we can discuss it. This was *mostly* done as a /// conservative step. pub(crate) mod private { pub trait Sealed {} } impl private::Sealed for crate::nfa::noncontiguous::NFA {} impl private::Sealed for crate::nfa::contiguous::NFA {} impl private::Sealed for crate::dfa::DFA {} impl<'a, T: private::Sealed + ?Sized> private::Sealed for &'a T {} /// A trait that abstracts over Aho-Corasick automata. /// /// This trait primarily exists for niche use cases such as: /// /// * Using an NFA or DFA directly, bypassing the top-level /// [`AhoCorasick`](crate::AhoCorasick) searcher. Currently, these include /// [`noncontiguous::NFA`](crate::nfa::noncontiguous::NFA), /// [`contiguous::NFA`](crate::nfa::contiguous::NFA) and /// [`dfa::DFA`](crate::dfa::DFA). /// * Implementing your own custom search routine by walking the automaton /// yourself. This might be useful for implementing search on non-contiguous /// strings or streams. /// /// For most use cases, it is not expected that users will need /// to use or even know about this trait. Indeed, the top level /// [`AhoCorasick`](crate::AhoCorasick) searcher does not expose any details /// about this trait, nor does it implement it itself. /// /// Note that this trait defines a number of default methods, such as /// [`Automaton::try_find`] and [`Automaton::try_find_iter`], which implement /// higher level search routines in terms of the lower level automata API. /// /// # Sealed /// /// Currently, this trait is sealed. That means users of this crate can write /// generic routines over this trait but cannot implement it themselves. This /// restriction may be lifted in the future, but sealing the trait permits /// adding new required methods in a backwards compatible fashion. /// /// # Special states /// /// This trait encodes a notion of "special" states in an automaton. Namely, /// a state is treated as special if it is a dead, match or start state: /// /// * A dead state is a state that cannot be left once entered. All transitions /// on a dead state lead back to itself. The dead state is meant to be treated /// as a sentinel indicating that the search should stop and return a match if /// one has been found, and nothing otherwise. /// * A match state is a state that indicates one or more patterns have /// matched. Depending on the [`MatchKind`] of the automaton, a search may /// stop once a match is seen, or it may continue looking for matches until /// it enters a dead state or sees the end of the haystack. /// * A start state is a state that a search begins in. It is useful to know /// when a search enters a start state because it may mean that a prefilter can /// be used to skip ahead and quickly look for candidate matches. Unlike dead /// and match states, it is never necessary to explicitly handle start states /// for correctness. Indeed, in this crate, implementations of `Automaton` /// will only treat start states as "special" when a prefilter is enabled and /// active. Otherwise, treating it as special has no purpose and winds up /// slowing down the overall search because it results in ping-ponging between /// the main state transition and the "special" state logic. /// /// Since checking whether a state is special by doing three different /// checks would be too expensive inside a fast search loop, the /// [`Automaton::is_special`] method is provided for quickly checking whether /// the state is special. The `Automaton::is_dead`, `Automaton::is_match` and /// `Automaton::is_start` predicates can then be used to determine which kind /// of special state it is. /// /// # Panics /// /// Most of the APIs on this trait should panic or give incorrect results /// if invalid inputs are given to it. For example, `Automaton::next_state` /// has unspecified behavior if the state ID given to it is not a valid /// state ID for the underlying automaton. Valid state IDs can only be /// retrieved in one of two ways: calling `Automaton::start_state` or calling /// `Automaton::next_state` with a valid state ID. /// /// # Safety /// /// This trait is not safe to implement so that code may rely on the /// correctness of implementations of this trait to avoid undefined behavior. /// The primary correctness guarantees are: /// /// * `Automaton::start_state` always returns a valid state ID or an error or /// panics. /// * `Automaton::next_state`, when given a valid state ID, always returns /// a valid state ID for all values of `anchored` and `byte`, or otherwise /// panics. /// /// In general, the rest of the methods on `Automaton` need to uphold their /// contracts as well. For example, `Automaton::is_dead` should only returns /// true if the given state ID is actually a dead state. /// /// Note that currently this crate does not rely on the safety property defined /// here to avoid undefined behavior. Instead, this was done to make it /// _possible_ to do in the future. /// /// # Example /// /// This example shows how one might implement a basic but correct search /// routine. We keep things simple by not using prefilters or worrying about /// anchored searches, but do make sure our search is correct for all possible /// [`MatchKind`] semantics. (The comments in the code below note the parts /// that are needed to support certain `MatchKind` semantics.) /// /// ``` /// use aho_corasick::{ /// automaton::Automaton, /// nfa::noncontiguous::NFA, /// Anchored, Match, MatchError, MatchKind, /// }; /// /// // Run an unanchored search for 'aut' in 'haystack'. Return the first match /// // seen according to the automaton's match semantics. This returns an error /// // if the given automaton does not support unanchored searches. /// fn find( /// aut: A, /// haystack: &[u8], /// ) -> Result, MatchError> { /// let mut sid = aut.start_state(Anchored::No)?; /// let mut at = 0; /// let mut mat = None; /// let get_match = |sid, at| { /// let pid = aut.match_pattern(sid, 0); /// let len = aut.pattern_len(pid); /// Match::new(pid, (at - len)..at) /// }; /// // Start states can be match states! /// if aut.is_match(sid) { /// mat = Some(get_match(sid, at)); /// // Standard semantics require matches to be reported as soon as /// // they're seen. Otherwise, we continue until we see a dead state /// // or the end of the haystack. /// if matches!(aut.match_kind(), MatchKind::Standard) { /// return Ok(mat); /// } /// } /// while at < haystack.len() { /// sid = aut.next_state(Anchored::No, sid, haystack[at]); /// if aut.is_special(sid) { /// if aut.is_dead(sid) { /// return Ok(mat); /// } else if aut.is_match(sid) { /// mat = Some(get_match(sid, at + 1)); /// // As above, standard semantics require that we return /// // immediately once a match is found. /// if matches!(aut.match_kind(), MatchKind::Standard) { /// return Ok(mat); /// } /// } /// } /// at += 1; /// } /// Ok(mat) /// } /// /// // Show that it works for standard searches. /// let nfa = NFA::new(&["samwise", "sam"]).unwrap(); /// assert_eq!(Some(Match::must(1, 0..3)), find(&nfa, b"samwise")?); /// /// // But also works when using leftmost-first. Notice how the match result /// // has changed! /// let nfa = NFA::builder() /// .match_kind(MatchKind::LeftmostFirst) /// .build(&["samwise", "sam"]) /// .unwrap(); /// assert_eq!(Some(Match::must(0, 0..7)), find(&nfa, b"samwise")?); /// /// # Ok::<(), Box>(()) /// ``` pub unsafe trait Automaton: private::Sealed { /// Returns the starting state for the given anchor mode. /// /// Upon success, the state ID returned is guaranteed to be valid for /// this automaton. /// /// # Errors /// /// This returns an error when the given search configuration is not /// supported by the underlying automaton. For example, if the underlying /// automaton only supports unanchored searches but the given configuration /// was set to an anchored search, then this must return an error. fn start_state(&self, anchored: Anchored) -> Result; /// Performs a state transition from `sid` for `byte` and returns the next /// state. /// /// `anchored` should be [`Anchored::Yes`] when executing an anchored /// search and [`Anchored::No`] otherwise. For some implementations of /// `Automaton`, it is required to know whether the search is anchored /// or not in order to avoid following failure transitions. Other /// implementations may ignore `anchored` altogether and depend on /// `Automaton::start_state` returning a state that walks a different path /// through the automaton depending on whether the search is anchored or /// not. /// /// # Panics /// /// This routine may panic or return incorrect results when the given state /// ID is invalid. A state ID is valid if and only if: /// /// 1. It came from a call to `Automaton::start_state`, or /// 2. It came from a previous call to `Automaton::next_state` with a /// valid state ID. /// /// Implementations must treat all possible values of `byte` as valid. /// /// Implementations may panic on unsupported values of `anchored`, but are /// not required to do so. fn next_state( &self, anchored: Anchored, sid: StateID, byte: u8, ) -> StateID; /// Returns true if the given ID represents a "special" state. A special /// state is a dead, match or start state. /// /// Note that implementations may choose to return false when the given ID /// corresponds to a start state. Namely, it always correct to treat start /// states as non-special. Implementations must return true for states that /// are dead or contain matches. /// /// This has unspecified behavior when given an invalid state ID. fn is_special(&self, sid: StateID) -> bool; /// Returns true if the given ID represents a dead state. /// /// A dead state is a type of "sink" in a finite state machine. It /// corresponds to a state whose transitions all loop back to itself. That /// is, once entered, it can never be left. In practice, it serves as a /// sentinel indicating that the search should terminate. /// /// This has unspecified behavior when given an invalid state ID. fn is_dead(&self, sid: StateID) -> bool; /// Returns true if the given ID represents a match state. /// /// A match state is always associated with one or more pattern IDs that /// matched at the position in the haystack when the match state was /// entered. When a match state is entered, the match semantics dictate /// whether it should be returned immediately (for `MatchKind::Standard`) /// or if the search should continue (for `MatchKind::LeftmostFirst` and /// `MatchKind::LeftmostLongest`) until a dead state is seen or the end of /// the haystack has been reached. /// /// This has unspecified behavior when given an invalid state ID. fn is_match(&self, sid: StateID) -> bool; /// Returns true if the given ID represents a start state. /// /// While it is never incorrect to ignore start states during a search /// (except for the start of the search of course), knowing whether one has /// entered a start state can be useful for certain classes of performance /// optimizations. For example, if one is in a start state, it may be legal /// to try to skip ahead and look for match candidates more quickly than /// would otherwise be accomplished by walking the automaton. /// /// Implementations of `Automaton` in this crate "unspecialize" start /// states when a prefilter is not active or enabled. In this case, it /// is possible for `Automaton::is_special(sid)` to return false while /// `Automaton::is_start(sid)` returns true. /// /// This has unspecified behavior when given an invalid state ID. fn is_start(&self, sid: StateID) -> bool; /// Returns the match semantics that this automaton was built with. fn match_kind(&self) -> MatchKind; /// Returns the total number of matches for the given state ID. /// /// This has unspecified behavior if the given ID does not refer to a match /// state. fn match_len(&self, sid: StateID) -> usize; /// Returns the pattern ID for the match state given by `sid` at the /// `index` given. /// /// Typically, `index` is only ever greater than `0` when implementing an /// overlapping search. Otherwise, it's likely that your search only cares /// about reporting the first pattern ID in a match state. /// /// This has unspecified behavior if the given ID does not refer to a match /// state, or if the index is greater than or equal to the total number of /// matches in this match state. fn match_pattern(&self, sid: StateID, index: usize) -> PatternID; /// Returns the total number of patterns compiled into this automaton. fn patterns_len(&self) -> usize; /// Returns the length of the pattern for the given ID. /// /// This has unspecified behavior when given an invalid pattern /// ID. A pattern ID is valid if and only if it is less than /// `Automaton::patterns_len`. fn pattern_len(&self, pid: PatternID) -> usize; /// Returns the length, in bytes, of the shortest pattern in this /// automaton. fn min_pattern_len(&self) -> usize; /// Returns the length, in bytes, of the longest pattern in this automaton. fn max_pattern_len(&self) -> usize; /// Returns the heap memory usage, in bytes, used by this automaton. fn memory_usage(&self) -> usize; /// Returns a prefilter, if available, that can be used to accelerate /// searches for this automaton. /// /// The typical way this is used is when the start state is entered during /// a search. When that happens, one can use a prefilter to skip ahead and /// look for candidate matches without having to walk the automaton on the /// bytes between candidates. /// /// Typically a prefilter is only available when there are a small (<100) /// number of patterns built into the automaton. fn prefilter(&self) -> Option<&Prefilter>; /// Executes a non-overlapping search with this automaton using the given /// configuration. /// /// See /// [`AhoCorasick::try_find`](crate::AhoCorasick::try_find) /// for more documentation and examples. fn try_find( &self, input: &Input<'_>, ) -> Result, MatchError> { try_find_fwd(&self, input) } /// Executes a overlapping search with this automaton using the given /// configuration. /// /// See /// [`AhoCorasick::try_find_overlapping`](crate::AhoCorasick::try_find_overlapping) /// for more documentation and examples. fn try_find_overlapping( &self, input: &Input<'_>, state: &mut OverlappingState, ) -> Result<(), MatchError> { try_find_overlapping_fwd(&self, input, state) } /// Returns an iterator of non-overlapping matches with this automaton /// using the given configuration. /// /// See /// [`AhoCorasick::try_find_iter`](crate::AhoCorasick::try_find_iter) /// for more documentation and examples. fn try_find_iter<'a, 'h>( &'a self, input: Input<'h>, ) -> Result, MatchError> where Self: Sized, { FindIter::new(self, input) } /// Returns an iterator of overlapping matches with this automaton /// using the given configuration. /// /// See /// [`AhoCorasick::try_find_overlapping_iter`](crate::AhoCorasick::try_find_overlapping_iter) /// for more documentation and examples. fn try_find_overlapping_iter<'a, 'h>( &'a self, input: Input<'h>, ) -> Result, MatchError> where Self: Sized, { if !self.match_kind().is_standard() { return Err(MatchError::unsupported_overlapping( self.match_kind(), )); } // We might consider lifting this restriction. The reason why I added // it was to ban the combination of "anchored search" and "overlapping // iteration." The match semantics aren't totally clear in that case. // Should we allow *any* matches that are adjacent to *any* previous // match? Or only following the most recent one? Or only matches // that start at the beginning of the search? We might also elect to // just keep this restriction in place, as callers should be able to // implement it themselves if they want to. if input.get_anchored().is_anchored() { return Err(MatchError::invalid_input_anchored()); } let _ = self.start_state(input.get_anchored())?; let state = OverlappingState::start(); Ok(FindOverlappingIter { aut: self, input, state }) } /// Replaces all non-overlapping matches in `haystack` with /// strings from `replace_with` depending on the pattern that /// matched. The `replace_with` slice must have length equal to /// `Automaton::patterns_len`. /// /// See /// [`AhoCorasick::try_replace_all`](crate::AhoCorasick::try_replace_all) /// for more documentation and examples. fn try_replace_all( &self, haystack: &str, replace_with: &[B], ) -> Result where Self: Sized, B: AsRef, { assert_eq!( replace_with.len(), self.patterns_len(), "replace_all requires a replacement for every pattern \ in the automaton" ); let mut dst = String::with_capacity(haystack.len()); self.try_replace_all_with(haystack, &mut dst, |mat, _, dst| { dst.push_str(replace_with[mat.pattern()].as_ref()); true })?; Ok(dst) } /// Replaces all non-overlapping matches in `haystack` with /// strings from `replace_with` depending on the pattern that /// matched. The `replace_with` slice must have length equal to /// `Automaton::patterns_len`. /// /// See /// [`AhoCorasick::try_replace_all_bytes`](crate::AhoCorasick::try_replace_all_bytes) /// for more documentation and examples. fn try_replace_all_bytes( &self, haystack: &[u8], replace_with: &[B], ) -> Result, MatchError> where Self: Sized, B: AsRef<[u8]>, { assert_eq!( replace_with.len(), self.patterns_len(), "replace_all requires a replacement for every pattern \ in the automaton" ); let mut dst = Vec::with_capacity(haystack.len()); self.try_replace_all_with_bytes(haystack, &mut dst, |mat, _, dst| { dst.extend(replace_with[mat.pattern()].as_ref()); true })?; Ok(dst) } /// Replaces all non-overlapping matches in `haystack` by calling the /// `replace_with` closure given. /// /// See /// [`AhoCorasick::try_replace_all_with`](crate::AhoCorasick::try_replace_all_with) /// for more documentation and examples. fn try_replace_all_with( &self, haystack: &str, dst: &mut String, mut replace_with: F, ) -> Result<(), MatchError> where Self: Sized, F: FnMut(&Match, &str, &mut String) -> bool, { let mut last_match = 0; for m in self.try_find_iter(Input::new(haystack))? { // Since there are no restrictions on what kinds of patterns are // in an Aho-Corasick automaton, we might get matches that split // a codepoint, or even matches of a partial codepoint. When that // happens, we just skip the match. if !haystack.is_char_boundary(m.start()) || !haystack.is_char_boundary(m.end()) { continue; } dst.push_str(&haystack[last_match..m.start()]); last_match = m.end(); if !replace_with(&m, &haystack[m.start()..m.end()], dst) { break; }; } dst.push_str(&haystack[last_match..]); Ok(()) } /// Replaces all non-overlapping matches in `haystack` by calling the /// `replace_with` closure given. /// /// See /// [`AhoCorasick::try_replace_all_with_bytes`](crate::AhoCorasick::try_replace_all_with_bytes) /// for more documentation and examples. fn try_replace_all_with_bytes( &self, haystack: &[u8], dst: &mut Vec, mut replace_with: F, ) -> Result<(), MatchError> where Self: Sized, F: FnMut(&Match, &[u8], &mut Vec) -> bool, { let mut last_match = 0; for m in self.try_find_iter(Input::new(haystack))? { dst.extend(&haystack[last_match..m.start()]); last_match = m.end(); if !replace_with(&m, &haystack[m.start()..m.end()], dst) { break; }; } dst.extend(&haystack[last_match..]); Ok(()) } /// Returns an iterator of non-overlapping matches with this automaton /// from the stream given. /// /// See /// [`AhoCorasick::try_stream_find_iter`](crate::AhoCorasick::try_stream_find_iter) /// for more documentation and examples. #[cfg(feature = "std")] fn try_stream_find_iter<'a, R: std::io::Read>( &'a self, rdr: R, ) -> Result, MatchError> where Self: Sized, { Ok(StreamFindIter { it: StreamChunkIter::new(self, rdr)? }) } /// Replaces all non-overlapping matches in `rdr` with strings from /// `replace_with` depending on the pattern that matched, and writes the /// result to `wtr`. The `replace_with` slice must have length equal to /// `Automaton::patterns_len`. /// /// See /// [`AhoCorasick::try_stream_replace_all`](crate::AhoCorasick::try_stream_replace_all) /// for more documentation and examples. #[cfg(feature = "std")] fn try_stream_replace_all( &self, rdr: R, wtr: W, replace_with: &[B], ) -> std::io::Result<()> where Self: Sized, R: std::io::Read, W: std::io::Write, B: AsRef<[u8]>, { assert_eq!( replace_with.len(), self.patterns_len(), "streaming replace_all requires a replacement for every pattern \ in the automaton", ); self.try_stream_replace_all_with(rdr, wtr, |mat, _, wtr| { wtr.write_all(replace_with[mat.pattern()].as_ref()) }) } /// Replaces all non-overlapping matches in `rdr` by calling the /// `replace_with` closure given and writing the result to `wtr`. /// /// See /// [`AhoCorasick::try_stream_replace_all_with`](crate::AhoCorasick::try_stream_replace_all_with) /// for more documentation and examples. #[cfg(feature = "std")] fn try_stream_replace_all_with( &self, rdr: R, mut wtr: W, mut replace_with: F, ) -> std::io::Result<()> where Self: Sized, R: std::io::Read, W: std::io::Write, F: FnMut(&Match, &[u8], &mut W) -> std::io::Result<()>, { let mut it = StreamChunkIter::new(self, rdr).map_err(|e| { let kind = std::io::ErrorKind::Other; std::io::Error::new(kind, e) })?; while let Some(result) = it.next() { let chunk = result?; match chunk { StreamChunk::NonMatch { bytes, .. } => { wtr.write_all(bytes)?; } StreamChunk::Match { bytes, mat } => { replace_with(&mat, bytes, &mut wtr)?; } } } Ok(()) } } // SAFETY: This just defers to the underlying 'AcAutomaton' and thus inherits // its safety properties. unsafe impl<'a, A: Automaton + ?Sized> Automaton for &'a A { #[inline(always)] fn start_state(&self, anchored: Anchored) -> Result { (**self).start_state(anchored) } #[inline(always)] fn next_state( &self, anchored: Anchored, sid: StateID, byte: u8, ) -> StateID { (**self).next_state(anchored, sid, byte) } #[inline(always)] fn is_special(&self, sid: StateID) -> bool { (**self).is_special(sid) } #[inline(always)] fn is_dead(&self, sid: StateID) -> bool { (**self).is_dead(sid) } #[inline(always)] fn is_match(&self, sid: StateID) -> bool { (**self).is_match(sid) } #[inline(always)] fn is_start(&self, sid: StateID) -> bool { (**self).is_start(sid) } #[inline(always)] fn match_kind(&self) -> MatchKind { (**self).match_kind() } #[inline(always)] fn match_len(&self, sid: StateID) -> usize { (**self).match_len(sid) } #[inline(always)] fn match_pattern(&self, sid: StateID, index: usize) -> PatternID { (**self).match_pattern(sid, index) } #[inline(always)] fn patterns_len(&self) -> usize { (**self).patterns_len() } #[inline(always)] fn pattern_len(&self, pid: PatternID) -> usize { (**self).pattern_len(pid) } #[inline(always)] fn min_pattern_len(&self) -> usize { (**self).min_pattern_len() } #[inline(always)] fn max_pattern_len(&self) -> usize { (**self).max_pattern_len() } #[inline(always)] fn memory_usage(&self) -> usize { (**self).memory_usage() } #[inline(always)] fn prefilter(&self) -> Option<&Prefilter> { (**self).prefilter() } } /// Represents the current state of an overlapping search. /// /// This is used for overlapping searches since they need to know something /// about the previous search. For example, when multiple patterns match at the /// same position, this state tracks the last reported pattern so that the next /// search knows whether to report another matching pattern or continue with /// the search at the next position. Additionally, it also tracks which state /// the last search call terminated in and the current offset of the search /// in the haystack. /// /// This type provides limited introspection capabilities. The only thing a /// caller can do is construct it and pass it around to permit search routines /// to use it to track state, and to ask whether a match has been found. /// /// Callers should always provide a fresh state constructed via /// [`OverlappingState::start`] when starting a new search. That same state /// should be reused for subsequent searches on the same `Input`. The state /// given will advance through the haystack itself. Callers can detect the end /// of a search when neither an error nor a match is returned. /// /// # Example /// /// This example shows how to manually iterate over all overlapping matches. If /// you need this, you might consider using /// [`AhoCorasick::find_overlapping_iter`](crate::AhoCorasick::find_overlapping_iter) /// instead, but this shows how to correctly use an `OverlappingState`. /// /// ``` /// use aho_corasick::{ /// automaton::OverlappingState, /// AhoCorasick, Input, Match, /// }; /// /// let patterns = &["append", "appendage", "app"]; /// let haystack = "append the app to the appendage"; /// /// let ac = AhoCorasick::new(patterns).unwrap(); /// let mut state = OverlappingState::start(); /// let mut matches = vec![]; /// /// loop { /// ac.find_overlapping(haystack, &mut state); /// let mat = match state.get_match() { /// None => break, /// Some(mat) => mat, /// }; /// matches.push(mat); /// } /// let expected = vec![ /// Match::must(2, 0..3), /// Match::must(0, 0..6), /// Match::must(2, 11..14), /// Match::must(2, 22..25), /// Match::must(0, 22..28), /// Match::must(1, 22..31), /// ]; /// assert_eq!(expected, matches); /// ``` #[derive(Clone, Debug)] pub struct OverlappingState { /// The match reported by the most recent overlapping search to use this /// state. /// /// If a search does not find any matches, then it is expected to clear /// this value. mat: Option, /// The state ID of the state at which the search was in when the call /// terminated. When this is a match state, `last_match` must be set to a /// non-None value. /// /// A `None` value indicates the start state of the corresponding /// automaton. We cannot use the actual ID, since any one automaton may /// have many start states, and which one is in use depends on search-time /// factors (such as whether the search is anchored or not). id: Option, /// The position of the search. /// /// When `id` is None (i.e., we are starting a search), this is set to /// the beginning of the search as given by the caller regardless of its /// current value. Subsequent calls to an overlapping search pick up at /// this offset. at: usize, /// The index into the matching patterns of the next match to report if the /// current state is a match state. Note that this may be 1 greater than /// the total number of matches to report for the current match state. (In /// which case, no more matches should be reported at the current position /// and the search should advance to the next position.) next_match_index: Option, } impl OverlappingState { /// Create a new overlapping state that begins at the start state. pub fn start() -> OverlappingState { OverlappingState { mat: None, id: None, at: 0, next_match_index: None } } /// Return the match result of the most recent search to execute with this /// state. /// /// Every search will clear this result automatically, such that if no /// match is found, this will always correctly report `None`. pub fn get_match(&self) -> Option { self.mat } } /// An iterator of non-overlapping matches in a particular haystack. /// /// This iterator yields matches according to the [`MatchKind`] used by this /// automaton. /// /// This iterator is constructed via the [`Automaton::try_find_iter`] method. /// /// The type variable `A` refers to the implementation of the [`Automaton`] /// trait used to execute the search. /// /// The lifetime `'a` refers to the lifetime of the [`Automaton`] /// implementation. /// /// The lifetime `'h` refers to the lifetime of the haystack being searched. #[derive(Debug)] pub struct FindIter<'a, 'h, A> { /// The automaton used to drive the search. aut: &'a A, /// The input parameters to give to each search call. /// /// The start position of the search is mutated during iteration. input: Input<'h>, /// Records the end offset of the most recent match. This is necessary to /// handle a corner case for preventing empty matches from overlapping with /// the ending bounds of a prior match. last_match_end: Option, } impl<'a, 'h, A: Automaton> FindIter<'a, 'h, A> { /// Creates a new non-overlapping iterator. If the given automaton would /// return an error on a search with the given input configuration, then /// that error is returned here. fn new( aut: &'a A, input: Input<'h>, ) -> Result, MatchError> { // The only way this search can fail is if we cannot retrieve the start // state. e.g., Asking for an anchored search when only unanchored // searches are supported. let _ = aut.start_state(input.get_anchored())?; Ok(FindIter { aut, input, last_match_end: None }) } /// Executes a search and returns a match if one is found. /// /// This does not advance the input forward. It just executes a search /// based on the current configuration/offsets. fn search(&self) -> Option { // The unwrap is OK here because we check at iterator construction time // that no subsequent search call (using the same configuration) will // ever return an error. self.aut .try_find(&self.input) .expect("already checked that no match error can occur") } /// Handles the special case of an empty match by ensuring that 1) the /// iterator always advances and 2) empty matches never overlap with other /// matches. /// /// (1) is necessary because we principally make progress by setting the /// starting location of the next search to the ending location of the last /// match. But if a match is empty, then this results in a search that does /// not advance and thus does not terminate. /// /// (2) is not strictly necessary, but makes intuitive sense and matches /// the presiding behavior of most general purpose regex engines. /// (Obviously this crate isn't a regex engine, but we choose to match /// their semantics.) The "intuitive sense" here is that we want to report /// NON-overlapping matches. So for example, given the patterns 'a' and /// '' (an empty string) against the haystack 'a', without the special /// handling, you'd get the matches [0, 1) and [1, 1), where the latter /// overlaps with the end bounds of the former. /// /// Note that we mark this cold and forcefully prevent inlining because /// handling empty matches like this is extremely rare and does require /// quite a bit of code, comparatively. Keeping this code out of the main /// iterator function keeps it smaller and more amenable to inlining /// itself. #[cold] #[inline(never)] fn handle_overlapping_empty_match( &mut self, mut m: Match, ) -> Option { assert!(m.is_empty()); if Some(m.end()) == self.last_match_end { self.input.set_start(self.input.start().checked_add(1).unwrap()); m = self.search()?; } Some(m) } } impl<'a, 'h, A: Automaton> Iterator for FindIter<'a, 'h, A> { type Item = Match; #[inline(always)] fn next(&mut self) -> Option { let mut m = self.search()?; if m.is_empty() { m = self.handle_overlapping_empty_match(m)?; } self.input.set_start(m.end()); self.last_match_end = Some(m.end()); Some(m) } } /// An iterator of overlapping matches in a particular haystack. /// /// This iterator will report all possible matches in a particular haystack, /// even when the matches overlap. /// /// This iterator is constructed via the /// [`Automaton::try_find_overlapping_iter`] method. /// /// The type variable `A` refers to the implementation of the [`Automaton`] /// trait used to execute the search. /// /// The lifetime `'a` refers to the lifetime of the [`Automaton`] /// implementation. /// /// The lifetime `'h` refers to the lifetime of the haystack being searched. #[derive(Debug)] pub struct FindOverlappingIter<'a, 'h, A> { aut: &'a A, input: Input<'h>, state: OverlappingState, } impl<'a, 'h, A: Automaton> Iterator for FindOverlappingIter<'a, 'h, A> { type Item = Match; #[inline(always)] fn next(&mut self) -> Option { self.aut .try_find_overlapping(&self.input, &mut self.state) .expect("already checked that no match error can occur here"); self.state.get_match() } } /// An iterator that reports matches in a stream. /// /// This iterator yields elements of type `io::Result`, where an error /// is reported if there was a problem reading from the underlying stream. /// The iterator terminates only when the underlying stream reaches `EOF`. /// /// This iterator is constructed via the [`Automaton::try_stream_find_iter`] /// method. /// /// The type variable `A` refers to the implementation of the [`Automaton`] /// trait used to execute the search. /// /// The type variable `R` refers to the `io::Read` stream that is being read /// from. /// /// The lifetime `'a` refers to the lifetime of the [`Automaton`] /// implementation. #[cfg(feature = "std")] #[derive(Debug)] pub struct StreamFindIter<'a, A, R> { it: StreamChunkIter<'a, A, R>, } #[cfg(feature = "std")] impl<'a, A: Automaton, R: std::io::Read> Iterator for StreamFindIter<'a, A, R> { type Item = std::io::Result; fn next(&mut self) -> Option> { loop { match self.it.next() { None => return None, Some(Err(err)) => return Some(Err(err)), Some(Ok(StreamChunk::NonMatch { .. })) => {} Some(Ok(StreamChunk::Match { mat, .. })) => { return Some(Ok(mat)); } } } } } /// An iterator that reports matches in a stream. /// /// (This doesn't actually implement the `Iterator` trait because it returns /// something with a lifetime attached to a buffer it owns, but that's OK. It /// still has a `next` method and is iterator-like enough to be fine.) /// /// This iterator yields elements of type `io::Result`, where /// an error is reported if there was a problem reading from the underlying /// stream. The iterator terminates only when the underlying stream reaches /// `EOF`. /// /// The idea here is that each chunk represents either a match or a non-match, /// and if you concatenated all of the chunks together, you'd reproduce the /// entire contents of the stream, byte-for-byte. /// /// This chunk machinery is a bit complicated and it isn't strictly required /// for a stream searcher that just reports matches. But we do need something /// like this to deal with the "replacement" API, which needs to know which /// chunks it can copy and which it needs to replace. #[cfg(feature = "std")] #[derive(Debug)] struct StreamChunkIter<'a, A, R> { /// The underlying automaton to do the search. aut: &'a A, /// The source of bytes we read from. rdr: R, /// A roll buffer for managing bytes from `rdr`. Basically, this is used /// to handle the case of a match that is split by two different /// calls to `rdr.read()`. This isn't strictly needed if all we needed to /// do was report matches, but here we are reporting chunks of non-matches /// and matches and in order to do that, we really just cannot treat our /// stream as non-overlapping blocks of bytes. We need to permit some /// overlap while we retain bytes from a previous `read` call in memory. buf: crate::util::buffer::Buffer, /// The unanchored starting state of this automaton. start: StateID, /// The state of the automaton. sid: StateID, /// The absolute position over the entire stream. absolute_pos: usize, /// The position we're currently at within `buf`. buffer_pos: usize, /// The buffer position of the end of the bytes that we last returned /// to the caller. Basically, whenever we find a match, we look to see if /// there is a difference between where the match started and the position /// of the last byte we returned to the caller. If there's a difference, /// then we need to return a 'NonMatch' chunk. buffer_reported_pos: usize, } #[cfg(feature = "std")] impl<'a, A: Automaton, R: std::io::Read> StreamChunkIter<'a, A, R> { fn new( aut: &'a A, rdr: R, ) -> Result, MatchError> { // This restriction is a carry-over from older versions of this crate. // I didn't have the bandwidth to think through how to handle, say, // leftmost-first or leftmost-longest matching, but... it should be // possible? The main problem is that once you see a match state in // leftmost-first semantics, you can't just stop at that point and // report a match. You have to keep going until you either hit a dead // state or EOF. So how do you know when you'll hit a dead state? Well, // you don't. With Aho-Corasick, I believe you can put a bound on it // and say, "once a match has been seen, you'll need to scan forward at // most N bytes" where N=aut.max_pattern_len(). // // Which is fine, but it does mean that state about whether we're still // looking for a dead state or not needs to persist across buffer // refills. Which this code doesn't really handle. It does preserve // *some* state across buffer refills, basically ensuring that a match // span is always in memory. if !aut.match_kind().is_standard() { return Err(MatchError::unsupported_stream(aut.match_kind())); } // This is kind of a cop-out, but empty matches are SUPER annoying. // If we know they can't happen (which is what we enforce here), then // it makes a lot of logic much simpler. With that said, I'm open to // supporting this case, but we need to define proper semantics for it // first. It wasn't totally clear to me what it should do at the time // of writing, so I decided to just be conservative. // // It also seems like a very weird case to support anyway. Why search a // stream if you're just going to get a match at every position? // // ¯\_(ツ)_/¯ if aut.min_pattern_len() == 0 { return Err(MatchError::unsupported_empty()); } let start = aut.start_state(Anchored::No)?; Ok(StreamChunkIter { aut, rdr, buf: crate::util::buffer::Buffer::new(aut.max_pattern_len()), start, sid: start, absolute_pos: 0, buffer_pos: 0, buffer_reported_pos: 0, }) } fn next(&mut self) -> Option> { // This code is pretty gnarly. It IS simpler than the equivalent code // in the previous aho-corasick release, in part because we inline // automaton traversal here and also in part because we have abdicated // support for automatons that contain an empty pattern. // // I suspect this code could be made a bit simpler by designing a // better buffer abstraction. // // But in general, this code is basically write-only. So you'll need // to go through it step-by-step to grok it. One of the key bits of // complexity is tracking a few different offsets. 'buffer_pos' is // where we are in the buffer for search. 'buffer_reported_pos' is the // position immediately following the last byte in the buffer that // we've returned to the caller. And 'absolute_pos' is the overall // current absolute position of the search in the entire stream, and // this is what match spans are reported in terms of. loop { if self.aut.is_match(self.sid) { let mat = self.get_match(); if let Some(r) = self.get_non_match_chunk(mat) { self.buffer_reported_pos += r.len(); let bytes = &self.buf.buffer()[r]; return Some(Ok(StreamChunk::NonMatch { bytes })); } self.sid = self.start; let r = self.get_match_chunk(mat); self.buffer_reported_pos += r.len(); let bytes = &self.buf.buffer()[r]; return Some(Ok(StreamChunk::Match { bytes, mat })); } if self.buffer_pos >= self.buf.buffer().len() { if let Some(r) = self.get_pre_roll_non_match_chunk() { self.buffer_reported_pos += r.len(); let bytes = &self.buf.buffer()[r]; return Some(Ok(StreamChunk::NonMatch { bytes })); } if self.buf.buffer().len() >= self.buf.min_buffer_len() { self.buffer_pos = self.buf.min_buffer_len(); self.buffer_reported_pos -= self.buf.buffer().len() - self.buf.min_buffer_len(); self.buf.roll(); } match self.buf.fill(&mut self.rdr) { Err(err) => return Some(Err(err)), Ok(true) => {} Ok(false) => { // We've hit EOF, but if there are still some // unreported bytes remaining, return them now. if let Some(r) = self.get_eof_non_match_chunk() { self.buffer_reported_pos += r.len(); let bytes = &self.buf.buffer()[r]; return Some(Ok(StreamChunk::NonMatch { bytes })); } // We've reported everything! return None; } } } let start = self.absolute_pos; for &byte in self.buf.buffer()[self.buffer_pos..].iter() { self.sid = self.aut.next_state(Anchored::No, self.sid, byte); self.absolute_pos += 1; if self.aut.is_match(self.sid) { break; } } self.buffer_pos += self.absolute_pos - start; } } /// Return a match chunk for the given match. It is assumed that the match /// ends at the current `buffer_pos`. fn get_match_chunk(&self, mat: Match) -> core::ops::Range { let start = self.buffer_pos - mat.len(); let end = self.buffer_pos; start..end } /// Return a non-match chunk, if necessary, just before reporting a match. /// This returns `None` if there is nothing to report. Otherwise, this /// assumes that the given match ends at the current `buffer_pos`. fn get_non_match_chunk( &self, mat: Match, ) -> Option> { let buffer_mat_start = self.buffer_pos - mat.len(); if buffer_mat_start > self.buffer_reported_pos { let start = self.buffer_reported_pos; let end = buffer_mat_start; return Some(start..end); } None } /// Look for any bytes that should be reported as a non-match just before /// rolling the buffer. /// /// Note that this only reports bytes up to `buffer.len() - /// min_buffer_len`, as it's not possible to know whether the bytes /// following that will participate in a match or not. fn get_pre_roll_non_match_chunk(&self) -> Option> { let end = self.buf.buffer().len().saturating_sub(self.buf.min_buffer_len()); if self.buffer_reported_pos < end { return Some(self.buffer_reported_pos..end); } None } /// Return any unreported bytes as a non-match up to the end of the buffer. /// /// This should only be called when the entire contents of the buffer have /// been searched and EOF has been hit when trying to fill the buffer. fn get_eof_non_match_chunk(&self) -> Option> { if self.buffer_reported_pos < self.buf.buffer().len() { return Some(self.buffer_reported_pos..self.buf.buffer().len()); } None } /// Return the match at the current position for the current state. /// /// This panics if `self.aut.is_match(self.sid)` isn't true. fn get_match(&self) -> Match { get_match(self.aut, self.sid, 0, self.absolute_pos) } } /// A single chunk yielded by the stream chunk iterator. /// /// The `'r` lifetime refers to the lifetime of the stream chunk iterator. #[cfg(feature = "std")] #[derive(Debug)] enum StreamChunk<'r> { /// A chunk that does not contain any matches. NonMatch { bytes: &'r [u8] }, /// A chunk that precisely contains a match. Match { bytes: &'r [u8], mat: Match }, } #[inline(never)] pub(crate) fn try_find_fwd( aut: &A, input: &Input<'_>, ) -> Result, MatchError> { if input.is_done() { return Ok(None); } let earliest = aut.match_kind().is_standard() || input.get_earliest(); if input.get_anchored().is_anchored() { try_find_fwd_imp(aut, input, None, Anchored::Yes, earliest) } else if let Some(pre) = aut.prefilter() { if earliest { try_find_fwd_imp(aut, input, Some(pre), Anchored::No, true) } else { try_find_fwd_imp(aut, input, Some(pre), Anchored::No, false) } } else { if earliest { try_find_fwd_imp(aut, input, None, Anchored::No, true) } else { try_find_fwd_imp(aut, input, None, Anchored::No, false) } } } #[inline(always)] fn try_find_fwd_imp( aut: &A, input: &Input<'_>, pre: Option<&Prefilter>, anchored: Anchored, earliest: bool, ) -> Result, MatchError> { let mut sid = aut.start_state(input.get_anchored())?; let mut at = input.start(); let mut mat = None; if aut.is_match(sid) { mat = Some(get_match(aut, sid, 0, at)); if earliest { return Ok(mat); } } if let Some(pre) = pre { match pre.find_in(input.haystack(), input.get_span()) { Candidate::None => return Ok(None), Candidate::Match(m) => return Ok(Some(m)), Candidate::PossibleStartOfMatch(i) => { at = i; } } } while at < input.end() { // I've tried unrolling this loop and eliding bounds checks, but no // matter what I did, I could not observe a consistent improvement on // any benchmark I could devise. (If someone wants to re-litigate this, // the way to do it is to add an 'next_state_unchecked' method to the // 'Automaton' trait with a default impl that uses 'next_state'. Then // use 'aut.next_state_unchecked' here and implement it on DFA using // unchecked slice index acces.) sid = aut.next_state(anchored, sid, input.haystack()[at]); if aut.is_special(sid) { if aut.is_dead(sid) { return Ok(mat); } else if aut.is_match(sid) { // We use 'at + 1' here because the match state is entered // at the last byte of the pattern. Since we use half-open // intervals, the end of the range of the match is one past the // last byte. let m = get_match(aut, sid, 0, at + 1); // For the automata in this crate, we make a size trade off // where we reuse the same automaton for both anchored and // unanchored searches. We achieve this, principally, by simply // not following failure transitions while computing the next // state. Instead, if we fail to find the next state, we return // a dead state, which instructs the search to stop. (This // is why 'next_state' needs to know whether the search is // anchored or not.) In addition, we have different start // states for anchored and unanchored searches. The latter has // a self-loop where as the former does not. // // In this way, we can use the same trie to execute both // anchored and unanchored searches. There is a catch though. // When building an Aho-Corasick automaton for unanchored // searches, we copy matches from match states to other states // (which would otherwise not be match states) if they are // reachable via a failure transition. In the case of an // anchored search, we *specifically* do not want to report // these matches because they represent matches that start past // the beginning of the search. // // Now we could tweak the automaton somehow to differentiate // anchored from unanchored match states, but this would make // 'aut.is_match' and potentially 'aut.is_special' slower. And // also make the automaton itself more complex. // // Instead, we insert a special hack: if the search is // anchored, we simply ignore matches that don't begin at // the start of the search. This is not quite ideal, but we // do specialize this function in such a way that unanchored // searches don't pay for this additional branch. While this // might cause a search to continue on for more than it // otherwise optimally would, it will be no more than the // longest pattern in the automaton. The reason for this is // that we ensure we don't follow failure transitions during // an anchored search. Combined with using a different anchored // starting state with no self-loop, we guarantee that we'll // at worst move through a number of transitions equal to the // longest pattern. // // Now for DFAs, the whole point of them is to eliminate // failure transitions entirely. So there is no way to say "if // it's an anchored search don't follow failure transitions." // Instead, we actually have to build two entirely separate // automatons into the transition table. One with failure // transitions built into it and another that is effectively // just an encoding of the base trie into a transition table. // DFAs still need this check though, because the match states // still carry matches only reachable via a failure transition. // Why? Because removing them seems difficult, although I // haven't given it a lot of thought. if !(anchored.is_anchored() && m.start() > input.start()) { mat = Some(m); if earliest { return Ok(mat); } } } else if let Some(pre) = pre { // If we're here, we know it's a special state that is not a // dead or a match state AND that a prefilter is active. Thus, // it must be a start state. debug_assert!(aut.is_start(sid)); // We don't care about 'Candidate::Match' here because if such // a match were possible, it would have been returned above // when we run the prefilter before walking the automaton. let span = Span::from(at..input.end()); match pre.find_in(input.haystack(), span).into_option() { None => return Ok(None), Some(i) => { if i > at { at = i; continue; } } } } else { // When pre.is_none(), then starting states should not be // treated as special. That is, without a prefilter, is_special // should only return true when the state is a dead or a match // state. // // It is possible to execute a search without a prefilter even // when the underlying searcher has one: an anchored search. // But in this case, the automaton makes it impossible to move // back to the start state by construction, and thus, we should // never reach this branch. debug_assert!(false, "unreachable"); } } at += 1; } Ok(mat) } #[inline(never)] fn try_find_overlapping_fwd( aut: &A, input: &Input<'_>, state: &mut OverlappingState, ) -> Result<(), MatchError> { state.mat = None; if input.is_done() { return Ok(()); } // Searching with a pattern ID is always anchored, so we should only ever // use a prefilter when no pattern ID is given. if aut.prefilter().is_some() && !input.get_anchored().is_anchored() { let pre = aut.prefilter().unwrap(); try_find_overlapping_fwd_imp(aut, input, Some(pre), state) } else { try_find_overlapping_fwd_imp(aut, input, None, state) } } #[inline(always)] fn try_find_overlapping_fwd_imp( aut: &A, input: &Input<'_>, pre: Option<&Prefilter>, state: &mut OverlappingState, ) -> Result<(), MatchError> { let mut sid = match state.id { None => { let sid = aut.start_state(input.get_anchored())?; // Handle the case where the start state is a match state. That is, // the empty string is in our automaton. We report every match we // can here before moving on and updating 'state.at' and 'state.id' // to find more matches in other parts of the haystack. if aut.is_match(sid) { let i = state.next_match_index.unwrap_or(0); let len = aut.match_len(sid); if i < len { state.next_match_index = Some(i + 1); state.mat = Some(get_match(aut, sid, i, input.start())); return Ok(()); } } state.at = input.start(); state.id = Some(sid); state.next_match_index = None; state.mat = None; sid } Some(sid) => { // If we still have matches left to report in this state then // report them until we've exhausted them. Only after that do we // advance to the next offset in the haystack. if let Some(i) = state.next_match_index { let len = aut.match_len(sid); if i < len { state.next_match_index = Some(i + 1); state.mat = Some(get_match(aut, sid, i, state.at + 1)); return Ok(()); } // Once we've reported all matches at a given position, we need // to advance the search to the next position. state.at += 1; state.next_match_index = None; state.mat = None; } sid } }; while state.at < input.end() { sid = aut.next_state( input.get_anchored(), sid, input.haystack()[state.at], ); if aut.is_special(sid) { state.id = Some(sid); if aut.is_dead(sid) { return Ok(()); } else if aut.is_match(sid) { state.next_match_index = Some(1); state.mat = Some(get_match(aut, sid, 0, state.at + 1)); return Ok(()); } else if let Some(pre) = pre { // If we're here, we know it's a special state that is not a // dead or a match state AND that a prefilter is active. Thus, // it must be a start state. debug_assert!(aut.is_start(sid)); let span = Span::from(state.at..input.end()); match pre.find_in(input.haystack(), span).into_option() { None => return Ok(()), Some(i) => { if i > state.at { state.at = i; continue; } } } } else { // When pre.is_none(), then starting states should not be // treated as special. That is, without a prefilter, is_special // should only return true when the state is a dead or a match // state. // // ... except for one special case: in stream searching, we // currently call overlapping search with a 'None' prefilter, // regardless of whether one exists or not, because stream // searching can't currently deal with prefilters correctly in // all cases. } } state.at += 1; } state.id = Some(sid); Ok(()) } #[inline(always)] fn get_match( aut: &A, sid: StateID, index: usize, at: usize, ) -> Match { let pid = aut.match_pattern(sid, index); let len = aut.pattern_len(pid); Match::new(pid, (at - len)..at) } /// Write a prefix "state" indicator for fmt::Debug impls. It always writes /// exactly two printable bytes to the given formatter. /// /// Specifically, this tries to succinctly distinguish the different types of /// states: dead states, start states and match states. It even accounts for /// the possible overlappings of different state types. (The only possible /// overlapping is that of match and start states.) pub(crate) fn fmt_state_indicator( f: &mut core::fmt::Formatter<'_>, aut: A, id: StateID, ) -> core::fmt::Result { if aut.is_dead(id) { write!(f, "D ")?; } else if aut.is_match(id) { if aut.is_start(id) { write!(f, "*>")?; } else { write!(f, "* ")?; } } else if aut.is_start(id) { write!(f, " >")?; } else { write!(f, " ")?; } Ok(()) } /// Return an iterator of transitions in a sparse format given an iterator /// of all explicitly defined transitions. The iterator yields ranges of /// transitions, such that any adjacent transitions mapped to the same /// state are combined into a single range. pub(crate) fn sparse_transitions<'a>( mut it: impl Iterator + 'a, ) -> impl Iterator + 'a { let mut cur: Option<(u8, u8, StateID)> = None; core::iter::from_fn(move || { while let Some((class, next)) = it.next() { let (prev_start, prev_end, prev_next) = match cur { Some(x) => x, None => { cur = Some((class, class, next)); continue; } }; if prev_next == next { cur = Some((prev_start, class, prev_next)); } else { cur = Some((class, class, next)); return Some((prev_start, prev_end, prev_next)); } } if let Some((start, end, next)) = cur.take() { return Some((start, end, next)); } None }) }