-
Notifications
You must be signed in to change notification settings - Fork 5.2k
Remove early exit logic in GetSortKey-related code paths #1078
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Recommend that reviewers suppress white space changes to make the code easier to review: https://github.com/dotnet/runtime/pull/1078/files?w=1 |
| SortKey sortKeyForEmptyString = compareInfo.GetSortKey(""); | ||
| SortKey sortKeyForZeroWidthJoiner = compareInfo.GetSortKey("\u200c"); | ||
| Assert.NotEqual(0, SortKey.Compare(sortKeyForEmptyString, sortKeyForZeroWidthJoiner)); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the expected result here?
Console.WriteLine(SortKey.Compare(
CultureInfo.InvariantCulture.CompareInfo.GetSortKey("\uFFFD"),
CultureInfo.InvariantCulture.CompareInfo.GetSortKey("\u200C")));There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that on NLS (Windows) they're equal and that on ICU (non-Windows) they're unequal, but I'm not current on a box where I can easily check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked this today - my statement above is correct. Under the default / invariant culture, on ICU "\ufffd" and "\u200c" are not equal, but using Windows's CompareStringEx they are equal.
|
I'll send a separate PR to handle #1060. It requires a fix to |
| Debug.Assert(!GlobalizationMode.Invariant); | ||
| Debug.Assert((options & (CompareOptions.Ordinal | CompareOptions.OrdinalIgnoreCase)) == 0); | ||
|
|
||
| if (source.Length == 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please check if Windows/ICU returns the same hash value for the empty string with all locales or cultures? I am asking because if this is the case, we can cache this value and keep the fast path for the empty strings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, I am not actively following up so either don't block on me or wait till I am back. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or at least maybe we could cache it for invariant culture (if not already)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about this but currently I don't have any evidence that anybody is trying to compute the culture-aware hash code of an empty string in a hot path.
And yeah, no rush whatsoever on this. It can wait until the new year. :)
Fixes #1062.
When generating sort keys, certain code points such as
U+200CandU+FFFD(depending on OS) are given zero weight and are ignored during the sort key calculation. This means, e.g., that the strings"ab"and"a\u200cb"can compare as equal using some culture-aware comparers.One consequence of this is that the empty string can compare as equal to some non-empty strings if they contain only zero-weight code points. For instance,
""and"\u200c"should compare as equal under the invariant culture-aware comparer, which means that they should also have the same sort key. However, this conflicts with the optimization we have at the entry point to most of our sort key generation routines, where the early-exit code optimistically assumes that the sort key of the empty string will be distinct from all possible sort keys of non-empty strings.The fix is to remove the early-exit code and to always call into ICU / NLS for sort key generation.