You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[SPARK-53332][SS] Enable StateDataSource with state checkpoint v2 (only snapshotStartBatchId option)
### What changes were proposed in this pull request?
This PR enables StateDataSource support with state checkpoint v2 format for the `snapshotStartBatchId` and related options, completing the StateDataSource checkpoint v2 integration.
There is changes to the replayStateFromSnapshot method signature. `snapshotVersionStateStoreCkptId` and `endVersionStateStoreCkptId`. Both are needed as `snapshotVersionStateStoreCkptId` is used when getting the snapshot and `endVersionStateStoreCkptId` for calculating the full lineage from the final version.
Before
```
def replayStateFromSnapshot(
snapshotVersion: Long, endVersion: Long, readOnly: Boolean = false): StateStore
```
After
```
def replayStateFromSnapshot(
snapshotVersion: Long, endVersion: Long, readOnly: Boolean = false): StateStore
snapshotVersion: Long,
endVersion: Long,
readOnly: Boolean = false,
snapshotVersionStateStoreCkptId: Option[String] = None,
endVersionStateStoreCkptId: Option[String] = None): StateStore
```
This is the final PR in the series following:
- #52047: Enable StateDataSource with state checkpoint v2 (only batchId option)
- #52148: Enable StateDataSource with state checkpoint v2 (only readChangeFeed)
NOTE: To read checkpoint v2 state data sources it is required to have `"spark.sql.streaming.stateStore.checkpointFormatVersion" -> 2`. It is possible to allow reading state data sources arbitrarily based on what is in the CommitLog by relaxing assertion checks but this is left as a future change.
### Why are the changes needed?
State checkpoint v2 (`"spark.sql.streaming.stateStore.checkpointFormatVersion"`) introduces a new format for storing state metadata that includes unique identifiers in the file path for each state store. The existing StateDataSource implementation only worked with checkpoint v1 format, making it incompatible with streaming queries using the newer checkpoint format. Only `batchId` was implemented in #52047 and only `readChangeFeed` was implemented in #52148.
### Does this PR introduce _any_ user-facing change?
Yes.
State Data Source will work when checkpoint v2 is used and the `snapshotStartBatchId` and related options are used.
### How was this patch tested?
In the previous PRs test suites were added to parameterize the current tests with checkpoint v2. All of these tests are now added back. All tests that previously intentionally tested some feature of the State Data Source Reader with checkpoint v1 should now be parameterized with checkpoint v2 (including python tests).
`RocksDBWithCheckpointV2StateDataSourceReaderSnapshotSuite` is added which uses the golden file approach similar to #46944 where `snapshotStartBatchId` is first added.
### Was this patch authored or co-authored using generative AI tooling?
No
Closes#52202 from dylanwong250/SPARK-53332.
Authored-by: Dylan Wong <[email protected]>
Signed-off-by: Anish Shrigondekar <[email protected]>
Copy file name to clipboardExpand all lines: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/state/StreamStreamJoinStatePartitionReader.scala
+34-10Lines changed: 34 additions & 10 deletions
Original file line number
Diff line number
Diff line change
@@ -24,7 +24,7 @@ import org.apache.spark.sql.connector.read.{InputPartition, PartitionReader, Par
Copy file name to clipboardExpand all lines: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/operators/stateful/join/SymmetricHashJoinStateManager.scala
0 commit comments