-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-47776][SS] Disallow binary inequality collation be used in key schema of stateful operator #45951
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… schema of stateful operator
|
We can have O(1) lookup for string with collation values, using a sort key. WIP PR: #45929 |
|
What about disallowing this first? It's not only streaming aggregation. We will need to do the same approach for every stateful operator. Before we see the clear path forward, let's prevent the correctness issue for now. |
| storeProvider.getStore(version) | ||
| } | ||
|
|
||
| private def disallowBinaryInequalityColumn(schema: StructType): Unit = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use following utility function:
def isBinaryStable(dataType: DataType): Boolean = !dataType.existsRecursively {
case st: StringType => !CollationFactory.fetchCollation(st.collationId).supportsBinaryEquality
case _ => false
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, done!
Yeah, let's go with this change. Once hash agg support change goes in we can try to enable the streaming code. Btw, thank you for adding the test! We will use that as the starting point. |
|
GA only failed in Docker integration which is unrelated. |
|
Merging to master. |
What changes were proposed in this pull request?
This PR proposes to disallow using binary inequality collation column in the key schema of stateful operator. Worth noting that changing the collation for the same string column during the query restart was already disallowed at the time of introduction of collation.
Why are the changes needed?
state store API is heavily relying on the fact that provider implementation performs O(1)-like get and put operation. While the actual implementation would be dependent on the state store provider, it is intuitive to assume that these providers only do lookup of the key based on binary format (implying binary equality).
That said, even though the column spec is case insensitive, state store API wouldn't take this into consideration, and could lead to produce the wrong result. e.g. Determiniing 'a' and 'A' differently while the column is case insensitive.
Does this PR introduce any user-facing change?
No, as it wasn't released yet.
How was this patch tested?
New UTs.
Was this patch authored or co-authored using generative AI tooling?
No.