Skip to content

Conversation

@vladanvasi-db
Copy link
Contributor

@vladanvasi-db vladanvasi-db commented Sep 16, 2024

What changes were proposed in this pull request?

In this PR, I propose to disallow CS_AI collated strings in expressions that use StringsSearch in their implementation. These expressions are trim, startswith, endswith, locate, instr, str_to_map, contains, replace, split_part and substring_index.

Currently, these expressions support all possible collations, however, they do not work properly with CS_AI collators. This is because there is no support for CS_AI search in the ICU's StringSearch class which is used to implement these expressions. Therefore, the expressions are not behaving correctly when used with CS_AI collators (e.g. currently startswith('hOtEl' collate unicode_ai, 'Hotel' collate unicode_ai) returns true).

Why are the changes needed?

Proposed changes are necessary in order to achieve correct behavior of the expressions mentioned above.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

This patch was tested by adding a test in the CollationSuite.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the SQL label Sep 16, 2024
@HyukjinKwon
Copy link
Member

... disallow CS_AI collated strings in expressions that use StringsSearch in their implementation.

Do you mind explaining, in the PR description, why they are disallowed?

Copy link
Contributor

@uros-db uros-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left some more comments, but overall looking better to me

on another note - I think we should consider adding a new error class (instead of relying on DATATYPE_MISMATCH.UNEXPECTED_INPUT_TYPE here)

one reason for this is that UNEXPECTED_INPUT_TYPE will give an error message that looks something like:

"Parameter ... of function ... requires the STRING type, however ... has the type STRING COLLATE UNICODE_AI."

which I think is rather confusing

another reason is that this is indeed a special case, and shouldn't use such a generic error condition - instead, we should add a new one that offers a better explanation to the user

@HyukjinKwon HyukjinKwon changed the title [SPARK-49667] Disallowed CS_AI collators with expressions that use StringSearch [SPARK-49667][SQL[ Disallowed CS_AI collators with expressions that use StringSearch Sep 19, 2024
@HyukjinKwon HyukjinKwon changed the title [SPARK-49667][SQL[ Disallowed CS_AI collators with expressions that use StringSearch [SPARK-49667][SQL] Disallowed CS_AI collators with expressions that use StringSearch Sep 19, 2024
@vladanvasi-db
Copy link
Contributor Author

left some more comments, but overall looking better to me

on another note - I think we should consider adding a new error class (instead of relying on DATATYPE_MISMATCH.UNEXPECTED_INPUT_TYPE here)

one reason for this is that UNEXPECTED_INPUT_TYPE will give an error message that looks something like:

"Parameter ... of function ... requires the STRING type, however ... has the type STRING COLLATE UNICODE_AI."

which I think is rather confusing

another reason is that this is indeed a special case, and shouldn't use such a generic error condition - instead, we should add a new one that offers a better explanation to the user
I partially agree, but I would still use this class instead of adding the new error class. This class is being used in Analyzer and is in deed thrown when the input types are not correct/supported. In this case, I think that adding a new error class would just mean adding more secondary changes that in this PR that would make it more complicated and are not directly related to the main point of it. I would propose to leave this PR as it is with DATATYPE_MISMATCH.UNEXPECTED_INPUT_TYPE and if it is not suitable for users, we can refactor it in a followup PR. However, this PR is meant for ensuring the correct behavior of expressions.

@vladanvasi-db vladanvasi-db force-pushed the vladanvasi-db/cs-ai-collations-expressions-disablement branch from 123a2ff to 6292cd1 Compare September 19, 2024 06:52
Copy link
Contributor

@uros-db uros-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, changes in this PR are ensuring core funcionality (block expressions for _CS_AI)
so I agree - if we decide to add a new error class, then we can do that in a follow-up

@vladanvasi-db vladanvasi-db force-pushed the vladanvasi-db/cs-ai-collations-expressions-disablement branch from 6292cd1 to d8bf03f Compare September 19, 2024 08:19
@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in a060c23 Sep 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants