-
Notifications
You must be signed in to change notification settings - Fork 4.9k
[release/3.1] Use Sse2 instrinsics to optimize NeedsEscaping/FindFirstCharToEncode for all built-in JavaScriptEncoders #42030
Conversation
…N strings (dotnet#41845) * Use Sse2 instrinsics to make NeedsEscaping check faster for large strings. * Update the utf-8 bytes needsescaping and add tests. * Remove unnecessary bitwise OR and add more tests * Add more tests around surrogates, invalid strings, and characters > short.MaxValue.
…xed using Sse2 intrinsics. (dotnet#41933) * Optimize FindFirstCharToEncode for JavaScriptEncoder.Default and Relaxed using Sse2 intrinsics. * Create an Sse2Helper and improve perf of TextEncoder and AllowedCharactersBitmap * Loop unroll FindFirstCharacterToEncode * Improve code coverage. * Add more tests for surrogate pairs and fix call to WillEncode. * Address PR feedback - remove some code duplication. * Move DefaultJavaScriptEncoder to separate file and override EncodeUtf8 with better caching. * Add default replacement character as a test. * Address nits.
…sEscaping (dotnet#42023) * When encoder is null, use JavaScriptEncoder.Default to check for NeedsEscaping. * Remove unnecessary unsafe keyword and add comment to using directive. * Address feedback. * Remove gotos and move the IsEmpty check outside the fixed block.
|
|
||
| namespace System.Text.Encodings.Web | ||
| { | ||
| internal sealed class DefaultJavaScriptEncoder : JavaScriptEncoder |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of the changes in this file are copy-paste from JavaScriptEncoder.cs into a separate file and don't change the existing logic.
|
|
||
| namespace System.Text.Encodings.Web | ||
| { | ||
| internal static class JavaScriptEncoderHelper |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file is extracting out common code into helpers to avoid duplication but is unchanged from what's already there.
| #if BUILDING_INBOX_LIBRARY | ||
| if (Sse2.IsSupported) | ||
| { | ||
| short* startingAddress = (short*)text; | ||
| while (textLength - 8 >= idx) | ||
| { | ||
| Debug.Assert(startingAddress >= text && startingAddress <= (text + textLength - 8)); | ||
|
|
||
| // Load the next 8 characters. | ||
| Vector128<short> sourceValue = Sse2.LoadVector128(startingAddress); | ||
|
|
||
| // Check if any of the 8 characters need to be escaped. | ||
| Vector128<short> mask = Sse2Helper.CreateEscapingMask_DefaultJavaScriptEncoderBasicLatin(sourceValue); | ||
|
|
||
| int index = Sse2.MoveMask(mask.AsByte()); | ||
| // If index == 0, that means none of the 8 characters needed to be escaped. | ||
| // TrailingZeroCount is relatively expensive, avoid it if possible. | ||
| if (index != 0) | ||
| { | ||
| // Found at least one character that needs to be escaped, figure out the index of | ||
| // the first one found that needed to be escaped within the 8 characters. | ||
| Debug.Assert(index > 0 && index <= 65_535); | ||
| int tzc = BitOperations.TrailingZeroCount(index); | ||
| Debug.Assert(tzc % 2 == 0 && tzc >= 0 && tzc <= 16); | ||
| idx += tzc >> 1; | ||
| goto Return; | ||
| } | ||
| idx += 8; | ||
| startingAddress += 8; | ||
| } | ||
|
|
||
| // Process the remaining characters. | ||
| Debug.Assert(textLength - idx < 8); | ||
| } | ||
| #endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the crux of the optimization (applied to both utf-16 and utf-8 overloads for the different JavaScriptEncoders).
| // null pointers and gaurd against that. Hence, check up-front and fall down to return -1. | ||
| if (encoder != null && !value.IsEmpty) | ||
| // Some implementations of JavaScriptEncoder.FindFirstCharacterToEncode may not accept | ||
| // null pointers and gaurd against that. Hence, check up-front to return -1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tiny Typo: gaurd -> guard.
steveharter
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on previous review, the changes look correct. However the S.T.E.W owners should approve this since it changes code in that assembly plus adds a netcoreapp30 config that needs to be vetted.
|
Not for Preview 2. Seems too large for 3.1 but let's talk next week. Great change for master. |
| Debug.Assert(opStatus == OperationStatus.Done); | ||
| idx += utf8BytesConsumedForScalar; | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it make sense to add assert after this line idx == utf8Text.Length?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The while loop condition already guards for idx < utf8Text.Length, so adding the assert would be primarily to make sure idx isn't greater than utf8Text.Length. Sure, we can add it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed this nit in #42064
|
Closing this. I'll re-evaluate if we want this change in 3.1. |
Port the set of performance optimizations related to escaping checks: #41845, #41933, #42023 and add necessary fixes that are dependent on the release/3.1 branch (such as use of TFM based constants instead of the SDK one)
The PR descriptions highlight the performance improvements. Here's a gist with the analysis:
https://gist.github.com/ahsonkhan/c566f5e7d65c1fde5a83a67be290c4ee
Description
No functional/behavioral change is intended by this change, other than perf optimizations.
Here are the key improvements:
This set of optimizations was applied to all the built-in concrete encoders (
Default,UnsafeRelaxedJsonEscaping, ones created from theCreatefactory method, and theTextEncodervirtual methods), and since it was applied on multiple types, some of this change involves refactoring out the common code into helpers/separate files.Customer Impact:
Callers of the S.T.Json
JsonSerializerandUtf8JsonWriterget faster when using any encoder options, particularly for writing strings that are large (>= 16 characters). This change also targets improving users of ASP.NET since they set theJsonSerializerOptionto use the non-defaultJavaScriptEncoderby default (UnsafeRelaxedJsonEscaping).Here's a summary of the results (regardless of whether folks use the default encoder or pass in the custom built-in encoders via the
JsonSerializerOptionsorJsonWriterOptions).SearchResult).By moving the "NeedsEscaping" logic to its correct location - S.T.E.Web and removing duplicate code from S.T.Json leads to better separation of concerns/single responsibility and helps with maintenance by mitigating potential bugs that could occur when we have duplicate logic.
Regression?
No
Risk
The primary risk with this change is the escaping check breaks for some edge case/length. This is mitigated by code review from multiple folks and exhaustive tests (test coverage of the added code is ~100%). The code was also written defensively using Debug.Asserts aggressively to reason about the invariant. Given the performance gains is substantial, and moving the logic into S.T.E.W follows better encapsulation, I think the risk is justified.
Tests run / added
A big part of this change includes adding tests for various encoders and input strings to validate that the right index is being returned for which character needs to be escaped. This includes surrogate pairs, invalid strings, empty strings, etc.
cc @GrabYourPitchforks, @steveharter, @ericstj, @danmosemsft