Skip to content

Conversation

@HeartSaVioR
Copy link
Contributor

@HeartSaVioR HeartSaVioR commented Apr 9, 2024

What changes were proposed in this pull request?

This PR proposes to disallow using binary inequality collation column in the key schema of stateful operator. Worth noting that changing the collation for the same string column during the query restart was already disallowed at the time of introduction of collation.

Why are the changes needed?

state store API is heavily relying on the fact that provider implementation performs O(1)-like get and put operation. While the actual implementation would be dependent on the state store provider, it is intuitive to assume that these providers only do lookup of the key based on binary format (implying binary equality).

That said, even though the column spec is case insensitive, state store API wouldn't take this into consideration, and could lead to produce the wrong result. e.g. Determiniing 'a' and 'A' differently while the column is case insensitive.

Does this PR introduce any user-facing change?

No, as it wasn't released yet.

How was this patch tested?

New UTs.

Was this patch authored or co-authored using generative AI tooling?

No.

@HeartSaVioR
Copy link
Contributor Author

HeartSaVioR commented Apr 9, 2024

@cloud-fan
Copy link
Contributor

We can have O(1) lookup for string with collation values, using a sort key. WIP PR: #45929

@HeartSaVioR
Copy link
Contributor Author

What about disallowing this first? It's not only streaming aggregation. We will need to do the same approach for every stateful operator. Before we see the clear path forward, let's prevent the correctness issue for now.

storeProvider.getStore(version)
}

private def disallowBinaryInequalityColumn(schema: StructType): Unit = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use following utility function:

  def isBinaryStable(dataType: DataType): Boolean = !dataType.existsRecursively {
    case st: StringType => !CollationFactory.fetchCollation(st.collationId).supportsBinaryEquality
    case _ => false
  }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, done!

@dbatomic
Copy link
Contributor

dbatomic commented Apr 9, 2024

What about disallowing this first? It's not only streaming aggregation. We will need to do the same approach for every stateful operator. Before we see the clear path forward, let's prevent the correctness issue for now.

Yeah, let's go with this change. Once hash agg support change goes in we can try to enable the streaming code. Btw, thank you for adding the test! We will use that as the starting point.

@github-actions github-actions bot added the DOCS label Apr 9, 2024
@HeartSaVioR
Copy link
Contributor Author

GA only failed in Docker integration which is unrelated.

@HeartSaVioR
Copy link
Contributor Author

Merging to master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants