-
Notifications
You must be signed in to change notification settings - Fork 5.2k
Delete AVX512 paths from IndexOf(string)
#117865
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
MihaZupan
wants to merge
1
commit into
dotnet:main
from
MihaZupan:searchvalues-singleString-dropAvx512
Closed
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd really rather we not delete this. The issue isn't really V512, but the algorithm/loop being suboptimal for all the vector paths here, this is particularly prevalent from the scalar fallback which causes it to pessimize more for larger vector sizes.
Fixing it isn't that much more work and would be a bigger win.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by scalar fallback in this case?
Throughput numbers are just stressing the vectorized inner loop for large inputs with no matches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's what the main loops look like in this case:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The
ShortInputpath that is hit is pessimized for larger vector sizes as it must process 2-4x as many elements as non-vectorized. While this doesn't necessarily show up for some bigger inputs, it will show up for small inputs and for inputs that have trailing elements. Additionally, for the whole method the non-idiomatic loops with goto can pessimize various JIT optimizations, control flow analysis, and other things.Longer term, these should all be rewritten to follow a "better" pattern which helps minimize the dispatch branching and which allows the trailing elements to also be vectorized. -- Ideally we're generally using a pattern like
TensorPrimitivesuses, where the core loop/trailing logic is centralized and we're just specifying the inner loops and exit conditions.The problem with the main loop is the
vpcmpeqw, vpmovm2wsequences. This is a really trivially issue related to the fact that the bitwise operands (and/andn/or/xor) are normalized to having a base type ofint/uintsince the underlying instructions only support these sizes due to embedded broadcast/masking support.The check that looks for
and(cvtmasktovec(op1), cvtmasktovec(op2))sequences was looking for all three base types to match, when it actually only needscvtmasktovec(op1)andcvtmasktovec(op2)to match and then the replacementandmask(op1, op2)to track that base type.The following PR resolves that: #117887
Now the codegen still isn't "ideal" because we end up converting the mask to a vector to do the "is there any matches" check (this is the
vpmovm2w, vptestmb, kortestq). That is a little more complicated to fix since it requires moving some of theop_Inequalitytransforms from LIR (lowering) into HIR (morph, valuenum, etc). This is planned work, just not something we've completed yet.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I follow? This PR doesn't change how short inputs behave here.
The
ShortInputpath is only used for inputs relative toVector128's length. When it's taken does not depend on whether the system has Avx512 or Avx2 support. Trailing elements are also processed with a vectorized step.Thanks! I'll double-check what the numbers look like with your change.
Assuming it's still worse/not better compared to Vector256 paths, does it make sense to keep around?
E.g. We've reverted Avx512 support from
IndexOfAnyAsciiSearcherover much smaller regressions even though there are meaningful throughput benefits on longer inputs there (#93222), whereas it's just worse across the board here.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was a general comment about how this and several other vectorized code paths in corelib are written in a way, in general, that pessimizes the larger vector sizes and/or small inptus. It wasn't a comment about the changes in this PR, rather just a general "issue" that helps make V512 perform worse than it should. If we were to fix those issues, all the paths should get faster
I believe it's still worth keeping and to continue incrementally tracking the improvements around. The more we revert, the harder it is to test/validate the improvements as they go in. Which applies to AVX512 and SVE alike, both of which have different considerations for mask handling.
The long term goal is to have centralized SIMD looping logic and to utilize things like (the currently internal)
ISimdVector, we're getting closer to that each release and continuing to get large improvements to the handling and codegen across the board.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-- A lot of the places where the perf is found to be suboptimal bad are also fairly trivial fixes, like the one I did. If we file issues for them and go and ensure the pattern recognition is being handled correctly, it is far better for all the vector paths. The same goes for utilizing the helpers like the
Any, All, None, Count, IndexOf, and LastIndexOfhelpers that now exist on the vector types.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With your change the Regex SliceSlice benchmark (just a ton of
span.IndexOf(string)-like searches) shows a 40% regression (as in taking 1.4x as long) with the Avx512 paths compared to Avx2 on Zen hardware.Should we reconsider targeted arch-specific opt outs for such cases if performance diverges this much, and we consider affected code paths as important?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We generally don't do target specific opt outs.
If you're really that concerned with the regression and it showing up in real world, then I'd just go with the removal for now. But please ensure a tracking issue is filed to ensure it is added back when the direct
kortestlogic is added in .NET 11.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With #118108 now merged, Regex won't hit these paths anymore for ASCII values, so I'm less concerned about the real-world impact.
IndexOf(string)is still impacted, but switching toSearchValues<string>can mitigate that.In general it is unfortunate that we would keep around Vector512 paths if they aren't improving perf, but hopefully potntial future changes you mentioned can help here.