You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/standard/base-types/regular-expressions-in-depth.md
+5-5Lines changed: 5 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1076,14 +1076,14 @@ As noted earlier when talking about `IgnoreCase`, vectorization is the idea that
1076
1076
1077
1077
Oneofthemostimportantplacesforvectorizationinaregexengineis when finding the next location a pattern could possibly match. For longer input text being searched, the time to find matches is frequently dominated by this aspect. As such, as of .NET 6, `Regex` had various tricks in place to get to those locations as quickly as possible:
1078
1078
1079
-
-**Anchors**.Forpatternsthatbeganwithananchor, itcouldeitheravoiddoinganysearchingiftherewasonlyoneplacethepatterncouldpossiblybegin (e.g. a"beginning"anchor, like `^` or `\A`), anditcouldskippasttextitknewcouldn't match (e.g. `IndexOf('\n')` for a "beginning-of-line" anchor if not currently at the beginning of a line).
1080
-
-**Boyer-Moore**.Forpatternsbeginningwithasequenceofatleasttwocharacters (case-sensitiveorcase-insensitive), itcouldusea [Boyer-Moore](https://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string-search_algorithm) search to find the next occurrence of that sequence in the input text.
-**Anchors**:Forpatternsthatbeganwithananchor, itcouldeitheravoiddoinganysearchingiftherewasonlyoneplacethepatterncouldpossiblybegin (e.g. a"beginning"anchor, like `^` or `\A`), anditcouldskippasttextitknewcouldn't match (e.g. `IndexOf('\n')` for a "beginning-of-line" anchor if not currently at the beginning of a line).
1080
+
-**Boyer-Moore**:Forpatternsbeginningwithasequenceofatleasttwocharacters (case-sensitiveorcase-insensitive), itcouldusea [Boyer-Moore](https://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string-search_algorithm) search to find the next occurrence of that sequence in the input text.
- **Goodbye, Boyer-Moore**. `Regex` has used the Boyer-Moore algorithm since `Regex`'s earliest days; the `RegexCompiler` even emitted a customized implementation in order to maximize throughput. However, Boyer-Moore was created at a time when vector instruction sets weren't yet a reality. Most modern hardware can examine 8 or 16 16-bit `char`s in just a few instructions, whereas with Boyer-Moore, it's rare to be able to skip that many at a time (the most it can possibly skip at a time is the length of the substring for which it's searching). In the aforementioned corpus of ~19,000 regular expressions, ~50% of those expressions that begin with a case-sensitive prefix of at least two characters have a prefix less than or equal to four characters, and ~75% are less than or equal to eight characters. Moreover, the Boyer-Moore algorithm works by choosing a single character to examine in order to perform each jump, but a well-vectorized algorithm can simultaneously compare multiple characters, such as the first and last in the prefix (as described in [SIMD-friendly algorithms for substring searching](http://0x80.pl/articles/simd-strfind.html#algorithm-1-generic-simd)), enabling it to stay in the inner vectorized loop longer. In .NET 7, `IndexOf` performing an ordinal search for a string has been significantly improved with such tricks, and now in .NET 7, `Regex` uses `IndexOf` rather than Boyer-Moore, the implementation of which has been deleted (this was inspired by Rust's regex crate making a similar change [last year](https://github.com/rust-lang/regex/pull/767)). You can see the impact of this on a micro-benchmark like the following, which is finding every word in a document, creating a `Regex` for that word, and then using each `Regex` to find all occurrences of each word in the document (this would be an ideal use for the new `Count` method, but I'm not using it here as it doesn't exist in the previous releases being compared):
1086
+
- **Goodbye, Boyer-Moore**: `Regex` has used the Boyer-Moore algorithm since `Regex`'s earliest days; the `RegexCompiler` even emitted a customized implementation in order to maximize throughput. However, Boyer-Moore was created at a time when vector instruction sets weren't yet a reality. Most modern hardware can examine 8 or 16 16-bit `char`s in just a few instructions, whereas with Boyer-Moore, it's rare to be able to skip that many at a time (the most it can possibly skip at a time is the length of the substring for which it's searching). In the aforementioned corpus of ~19,000 regular expressions, ~50% of those expressions that begin with a case-sensitive prefix of at least two characters have a prefix less than or equal to four characters, and ~75% are less than or equal to eight characters. Moreover, the Boyer-Moore algorithm works by choosing a single character to examine in order to perform each jump, but a well-vectorized algorithm can simultaneously compare multiple characters, such as the first and last in the prefix (as described in [SIMD-friendly algorithms for substring searching](http://0x80.pl/articles/simd-strfind.html#algorithm-1-generic-simd)), enabling it to stay in the inner vectorized loop longer. In .NET 7, `IndexOf` performing an ordinal search for a string has been significantly improved with such tricks, and now in .NET 7, `Regex` uses `IndexOf` rather than Boyer-Moore, the implementation of which has been deleted (this was inspired by Rust's regex crate making a similar change [last year](https://github.com/rust-lang/regex/pull/767)). You can see the impact of this on a micro-benchmark like the following, which is finding every word in a document, creating a `Regex` for that word, and then using each `Regex` to find all occurrences of each word in the document (this would be an ideal use for the new `Count` method, but I'm not using it here as it doesn't exist in the previous releases being compared):
0 commit comments