Skip to content

Commit 9d072ce

Browse files
committed
Minor updates and tweaks
1 parent 0b0fca3 commit 9d072ce

File tree

1 file changed

+5
-5
lines changed

1 file changed

+5
-5
lines changed

docs/standard/base-types/regular-expressions-in-depth.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1076,14 +1076,14 @@ As noted earlier when talking about `IgnoreCase`, vectorization is the idea that
10761076

10771077
One of the most important places for vectorization in a regex engine is when finding the next location a pattern could possibly match. For longer input text being searched, the time to find matches is frequently dominated by this aspect. As such, as of .NET 6, `Regex` had various tricks in place to get to those locations as quickly as possible:
10781078

1079-
- **Anchors**. For patterns that began with an anchor, it could either avoid doing any searching if there was only one place the pattern could possibly begin (e.g. a "beginning" anchor, like `^` or `\A`), and it could skip past text it knew couldn't match (e.g. `IndexOf('\n')` for a "beginning-of-line" anchor if not currently at the beginning of a line).
1080-
- **Boyer-Moore**. For patterns beginning with a sequence of at least two characters (case-sensitive or case-insensitive), it could use a [Boyer-Moore](https://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string-search_algorithm) search to find the next occurrence of that sequence in the input text.
1081-
- **IndexOf(char)**. For patterns beginning with a single case-sensitive character, it could use `IndexOf(char)` to find the next possible match location.
1082-
- **IndexOfAny(char, char, ...)**. For patterns beginning with one of only a few case-sensitive characters, it could use `IndexOfAny(...)` with those characters to find the next possible match location.
1079+
- **Anchors**: For patterns that began with an anchor, it could either avoid doing any searching if there was only one place the pattern could possibly begin (e.g. a "beginning" anchor, like `^` or `\A`), and it could skip past text it knew couldn't match (e.g. `IndexOf('\n')` for a "beginning-of-line" anchor if not currently at the beginning of a line).
1080+
- **Boyer-Moore**: For patterns beginning with a sequence of at least two characters (case-sensitive or case-insensitive), it could use a [Boyer-Moore](https://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string-search_algorithm) search to find the next occurrence of that sequence in the input text.
1081+
- **IndexOf(char)**: For patterns beginning with a single case-sensitive character, it could use `IndexOf(char)` to find the next possible match location.
1082+
- **IndexOfAny(char, char, ...)**: For patterns beginning with one of only a few case-sensitive characters, it could use `IndexOfAny(...)` with those characters to find the next possible match location.
10831083

10841084
These optimizations are all really useful, but there are many additional possible solutions that .NET 7 now takes advantage of:
10851085

1086-
- **Goodbye, Boyer-Moore**. `Regex` has used the Boyer-Moore algorithm since `Regex`'s earliest days; the `RegexCompiler` even emitted a customized implementation in order to maximize throughput. However, Boyer-Moore was created at a time when vector instruction sets weren't yet a reality. Most modern hardware can examine 8 or 16 16-bit `char`s in just a few instructions, whereas with Boyer-Moore, it's rare to be able to skip that many at a time (the most it can possibly skip at a time is the length of the substring for which it's searching). In the aforementioned corpus of ~19,000 regular expressions, ~50% of those expressions that begin with a case-sensitive prefix of at least two characters have a prefix less than or equal to four characters, and ~75% are less than or equal to eight characters. Moreover, the Boyer-Moore algorithm works by choosing a single character to examine in order to perform each jump, but a well-vectorized algorithm can simultaneously compare multiple characters, such as the first and last in the prefix (as described in [SIMD-friendly algorithms for substring searching](http://0x80.pl/articles/simd-strfind.html#algorithm-1-generic-simd)), enabling it to stay in the inner vectorized loop longer. In .NET 7, `IndexOf` performing an ordinal search for a string has been significantly improved with such tricks, and now in .NET 7, `Regex` uses `IndexOf` rather than Boyer-Moore, the implementation of which has been deleted (this was inspired by Rust's regex crate making a similar change [last year](https://github.com/rust-lang/regex/pull/767)). You can see the impact of this on a micro-benchmark like the following, which is finding every word in a document, creating a `Regex` for that word, and then using each `Regex` to find all occurrences of each word in the document (this would be an ideal use for the new `Count` method, but I'm not using it here as it doesn't exist in the previous releases being compared):
1086+
- **Goodbye, Boyer-Moore**: `Regex` has used the Boyer-Moore algorithm since `Regex`'s earliest days; the `RegexCompiler` even emitted a customized implementation in order to maximize throughput. However, Boyer-Moore was created at a time when vector instruction sets weren't yet a reality. Most modern hardware can examine 8 or 16 16-bit `char`s in just a few instructions, whereas with Boyer-Moore, it's rare to be able to skip that many at a time (the most it can possibly skip at a time is the length of the substring for which it's searching). In the aforementioned corpus of ~19,000 regular expressions, ~50% of those expressions that begin with a case-sensitive prefix of at least two characters have a prefix less than or equal to four characters, and ~75% are less than or equal to eight characters. Moreover, the Boyer-Moore algorithm works by choosing a single character to examine in order to perform each jump, but a well-vectorized algorithm can simultaneously compare multiple characters, such as the first and last in the prefix (as described in [SIMD-friendly algorithms for substring searching](http://0x80.pl/articles/simd-strfind.html#algorithm-1-generic-simd)), enabling it to stay in the inner vectorized loop longer. In .NET 7, `IndexOf` performing an ordinal search for a string has been significantly improved with such tricks, and now in .NET 7, `Regex` uses `IndexOf` rather than Boyer-Moore, the implementation of which has been deleted (this was inspired by Rust's regex crate making a similar change [last year](https://github.com/rust-lang/regex/pull/767)). You can see the impact of this on a micro-benchmark like the following, which is finding every word in a document, creating a `Regex` for that word, and then using each `Regex` to find all occurrences of each word in the document (this would be an ideal use for the new `Count` method, but I'm not using it here as it doesn't exist in the previous releases being compared):
10871087

10881088
```csharp
10891089
private string _text;

0 commit comments

Comments
 (0)