You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[SPARK-48281][SQL] Alter string search logic for UTF8_BINARY_LCASE collation (StringInStr, SubstringIndex)
### What changes were proposed in this pull request?
String searching in UTF8_BINARY_LCASE now works on character-level, rather than on byte-level. For example: `instr("İ", "i")`; now returns 0, because there exists no `start, len` such that `lowercase(substring("İ", start, len)) == "i"`.
### Why are the changes needed?
Fix functions that give unusable results due to one-to-many case mapping when performing string search under UTF8_BINARY_LCASE (see example above).
### Does this PR introduce _any_ user-facing change?
Yes, behaviour of `instr` and `substring_index` expressions is changed for edge cases with one-to-many case mapping.
### How was this patch tested?
New unit tests in `CollationSupportSuite`.
### Was this patch authored or co-authored using generative AI tooling?
No.
Closes#46589 from uros-db/alter-lcase-vol2.
Authored-by: Uros Bojanic <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
0 commit comments