Skip to content

Conversation

@GrabYourPitchforks
Copy link
Member

Fixes #1062.

When generating sort keys, certain code points such as U+200C and U+FFFD (depending on OS) are given zero weight and are ignored during the sort key calculation. This means, e.g., that the strings "ab" and "a\u200cb" can compare as equal using some culture-aware comparers.

One consequence of this is that the empty string can compare as equal to some non-empty strings if they contain only zero-weight code points. For instance, "" and "\u200c" should compare as equal under the invariant culture-aware comparer, which means that they should also have the same sort key. However, this conflicts with the optimization we have at the entry point to most of our sort key generation routines, where the early-exit code optimistically assumes that the sort key of the empty string will be distinct from all possible sort keys of non-empty strings.

The fix is to remove the early-exit code and to always call into ICU / NLS for sort key generation.

@GrabYourPitchforks
Copy link
Member Author

Recommend that reviewers suppress white space changes to make the code easier to review: https://github.com/dotnet/runtime/pull/1078/files?w=1

SortKey sortKeyForEmptyString = compareInfo.GetSortKey("");
SortKey sortKeyForZeroWidthJoiner = compareInfo.GetSortKey("\u200c");
Assert.NotEqual(0, SortKey.Compare(sortKeyForEmptyString, sortKeyForZeroWidthJoiner));
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the expected result here?

        Console.WriteLine(SortKey.Compare(
            CultureInfo.InvariantCulture.CompareInfo.GetSortKey("\uFFFD"),
            CultureInfo.InvariantCulture.CompareInfo.GetSortKey("\u200C")));

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that on NLS (Windows) they're equal and that on ICU (non-Windows) they're unequal, but I'm not current on a box where I can easily check.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked this today - my statement above is correct. Under the default / invariant culture, on ICU "\ufffd" and "\u200c" are not equal, but using Windows's CompareStringEx they are equal.

@GrabYourPitchforks
Copy link
Member Author

I'll send a separate PR to handle #1060. It requires a fix to string.Replace and related APIs.

Debug.Assert(!GlobalizationMode.Invariant);
Debug.Assert((options & (CompareOptions.Ordinal | CompareOptions.OrdinalIgnoreCase)) == 0);

if (source.Length == 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please check if Windows/ICU returns the same hash value for the empty string with all locales or cultures? I am asking because if this is the case, we can cache this value and keep the fast path for the empty strings.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, I am not actively following up so either don't block on me or wait till I am back. Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or at least maybe we could cache it for invariant culture (if not already)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this but currently I don't have any evidence that anybody is trying to compute the culture-aware hash code of an empty string in a hot path.

And yeah, no rush whatsoever on this. It can wait until the new year. :)

@jkotas jkotas merged commit 480acec into dotnet:master Dec 27, 2019
@GrabYourPitchforks GrabYourPitchforks deleted the getsortkey_empty branch January 6, 2020 21:43
@ghost ghost locked as resolved and limited conversation to collaborators Dec 11, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Equals and GetHashCode mismatch with StringComparer.InvariantCulture

5 participants