-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-48742][SS] Virtual Column Family for RocksDB #47107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
...main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
Outdated
Show resolved
Hide resolved
| val prefix = new Array[Byte]( | ||
| prefixKeyEncoded.length + 4 + offSetForColFamilyPrefix) | ||
| if (hasVirtualColFamilyPrefix) { | ||
| Platform.putLong(prefix, Platform.BYTE_ARRAY_OFFSET, colFamilyId) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Short might be enough ? 2 bytes probably good enough for 0 indexed num of column families ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use Short instead of Long as VCF id.
| } | ||
|
|
||
| // Maintain mapping of column family name to handle | ||
| // Maintain a set of column family name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also mention where/why this is needed/used ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline - we can prob remove this set entirely and move relevant stuff to the provider layer
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala
Outdated
Show resolved
Hide resolved
...core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateEncoder.scala
Outdated
Show resolved
Hide resolved
...rc/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateStoreProvider.scala
Outdated
Show resolved
Hide resolved
...rc/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateStoreProvider.scala
Outdated
Show resolved
Hide resolved
| } | ||
| } | ||
|
|
||
| // TODO SPARK-48796 after restart state id will not be the same |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the same as the situation in integration test in TransformWithStateSuite. We cannot re-load the same col family id for the same col family.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will be fixed once state schema related changes are merged. Create a task for fixing this: https://issues.apache.org/jira/browse/SPARK-48796
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea thanks
...core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateEncoder.scala
Outdated
Show resolved
Hide resolved
...rc/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateStoreProvider.scala
Outdated
Show resolved
Hide resolved
...rc/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateStoreProvider.scala
Outdated
Show resolved
Hide resolved
anishshri-db
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
HeartSaVioR
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only left sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateStoreSuite.scala
sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/state/RocksDBSuite.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala
Outdated
Show resolved
Hide resolved
| def encodePrefixKey(prefixKey: UnsafeRow, vcfId: Option[Short]): Array[Byte] | ||
| def encodeKey(row: UnsafeRow, vcfId: Option[Short]): Array[Byte] | ||
| def decodeKey(keyBytes: Array[Byte]): UnsafeRow | ||
| def offSetForColFamilyPrefix: Int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: offset, not offSet
Btw, I'd propose two things;
- encode(Prefix)Key and decodeKey are no longer symmetric. encode(Prefix)Key is dealing with vcfId but in decodeKey there is no way for us to get the vcfId back.
I see it could be less performant (though I feel like it doesn't matter much) if we have to always read the part of vcfId and return even for the case we know the vcfId already. But probably better to have a new method which explicitly calls out in the name it will skip vcfId and assume the caller already knows vcfId.
- It looks like column family prefix is applied over the all encoder implementations, which I think we can do better abstraction. This may be a good time to have base (abstract) implementation of RocksDBKeyStateEncoder handling column family prefix.
Something like following:
abstract class RocksDBKeyStateEncoderBase(useColumnFamilies: Boolean)
extends RocksDBKeyStateEncoder {
protected def encodeColumnFamilyPrefix(
numBytes: Int,
vcfId: Option[Short],
useColumnFamilies: Boolean): (Array[Byte], Int) = {
if (useColumnFamilies) {
val encodedBytes = new Array[Byte](numBytes + VIRTUAL_COL_FAMILY_PREFIX_BYTES)
Platform.putShort(encodedBytes, Platform.BYTE_ARRAY_OFFSET, vcfId.get)
(encodedBytes, Platform.BYTE_ARRAY_OFFSET + VIRTUAL_COL_FAMILY_PREFIX_BYTES)
} else {
val encodedBytes = new Array[Byte](numBytes)
(encodedBytes, Platform.BYTE_ARRAY_OFFSET)
}
}
protected def decodeColumnFamilyPrefix(keyBytes: Array[Byte]): (Option[Short], Int) = {
if (useColumnFamilies) {
val vcfId = Platform.getShort(keyBytes, Platform.BYTE_ARRAY_OFFSET)
(Some(vcfId), Platform.BYTE_ARRAY_OFFSET + VIRTUAL_COL_FAMILY_PREFIX_BYTES)
} else {
(None, Platform.BYTE_ARRAY_OFFSET)
}
}
// Only if we want to skip over reading CF prefix...
protected def decodeKeyStartOffset: Int = {
if (useColumnFamilies) {
Platform.BYTE_ARRAY_OFFSET + VIRTUAL_COL_FAMILY_PREFIX_BYTES
} else Platform.BYTE_ARRAY_OFFSET
}
(The method name can change with preferred way, not talented with naming.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That said, I'll review the change in Encoder file once the proposal is reflected or we decide not to do so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also the proposal can be combined with this proposal:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactor into RocksDBKeyStateEncoderBase. Add a virtual column family Id as input parameter and the API for RocksDBKeyStateEncoder.encodeKey/decodeKey remains unchanged as we now only pass in the vcfId during encoder initialization.
...rc/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateStoreProvider.scala
Show resolved
Hide resolved
...rc/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateStoreProvider.scala
Outdated
Show resolved
Hide resolved
| (RocksDBKeyStateEncoder, RocksDBValueStateEncoder)] | ||
|
|
||
| private val colFamilyNameToIdMap = new java.util.concurrent.ConcurrentHashMap[String, Short] | ||
| // TODO SPARK-48796 load column family id from state schema when restarting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds like we can't release before addressing SPARK-48796, do I understand correctly? If then I'd need to mark the ticket as blocker for Spark 4.0.0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also do we have a path forward on storing column family id? Will SPARK-48796 address this altogether?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes correct - this will be handled in that ticket where we store the vcf id within the new state schema format
|
|
||
| def getVcfIdBytes(id: Short): Array[Byte] = { | ||
| val encodedBytes = new Array[Byte](VIRTUAL_COL_FAMILY_PREFIX_BYTES) | ||
| Platform.putShort(encodedBytes, Platform.BYTE_ARRAY_OFFSET, id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can just consider that as API spec of Platform. get/put requires starting offset and from reading byte array they require Platform.BYTE_ARRAY_OFFSET to represent the starting offset.
| * key, and the operator should not call prefixScan method in StateStore. | ||
| * @param useColumnFamilies Whether the underlying state store uses a single or multiple column | ||
| * families | ||
| * families; by default we'll use virtual column family if this parameter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like we shouldn't mention this - it's an implementation detail and even specific to RocksDB implementation. We define "interface" here and it's up to provider implementation how to deal with column families.
| } | ||
| } | ||
|
|
||
| /* Column family related tests */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are the tests added here copied from RocksDBSuite? If it is, it'd be a great help for reviewer if you go through and comment whenever there is a difference worth looking at.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added few comments below to denote which suites are removed entirely and which are moved into RocksDBStateStoreSuite.
|
Also @jingz-db - test failure seems related ? |
| } | ||
| } | ||
|
|
||
| testWithColumnFamilies(s"RocksDB: column family creation with invalid names", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These tests in RocksDBSuite are removed and moved into RocksDBStateStoreSuite because the verification functions are moved from RocksDB to RocksDBstateStoreProvider.
| } | ||
| } | ||
|
|
||
| testWithChangelogCheckpointingEnabled( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This suite is removed because column family is no longer part of the input parameter in changelog v2.
| } | ||
|
|
||
| /* Column family related tests */ | ||
| testWithColumnFamilies("column family creation with invalid names", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The following suites are moved from RocksDBSuite. The suites are mostly the same as the old. The only difference is we are testing on the provider layer instead of rocksDB instance layer.
| } | ||
| } | ||
|
|
||
| Seq( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The following suites are newly added suites.
We don't have any new suite added for testing with RangeScanEncoder because there are lots of existing suites in the RocksDBStateStoreSuite already, while we don't have any existingNoPrefix and PrefixScan related suites with non-default column families.
HeartSaVioR
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only minors, thanks for the patience!
| prefixKeyEncoded, Platform.BYTE_ARRAY_OFFSET, prefixKeyEncodedLen) | ||
|
|
||
| // Here we calculate the remainingKeyEncodedLen leveraging the length of keyBytes | ||
| val remainingKeyEncodedLen = keyBytes.length - 4 - prefixKeyEncodedLen - |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be also calculated based on decodeKeyStartOffset.
decodeKeyStartOffset + 4 + prefixKeyEncodedLen = starting offset for remaining key as encoded
The reason we abstract the start offset for both encode and decode is to let the subclasses to not deal with column family prefix directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking a close look :)
IIUC, The prefixKeyEncoded is part of the original implementation of the PrefixKeyScanStateEncoder.decode ( the prefixKeyEncoded is the length of the prefix of the key itself, not virtual column family prefix). So the virtual column family prefix is already dealt with in the decodeKeyStartOffset.
Though for decodeKey I found it hard to rid subclass of dealing with the col family prefix completely - e.g. for remainingKeyEncodedLen here, we still need to substract the length of the column family prefix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you assuming the case of Platform.BYTE_ARRAY_OFFSET != 0 (in some random platform) hence the length of byte array is misaligned if we take Platform.BYTE_ARRAY_OFFSET into account? Awesome thought if intended :) Great details.
...e/src/test/scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateStoreSuite.scala
Outdated
Show resolved
Hide resolved
|
Thanks! Merging to master. |
### What changes were proposed in this pull request? Introducing virtual column family to RocksDB. We attach an 2-byte-Id prefix as column family identifier for each of the key row that is put into RocksDB. The encoding and decoding of the virtual column family prefix happens at the `RocksDBKeyEncoder` layer as we can pre-allocate extra 2 bytes and avoid additional memcpy. - Remove Physical Column Family related codes as this becomes potentially dead code till some caller starts using this. - Remove `useColumnFamilies` from `StateStoreChangelogV2` API. ### Why are the changes needed? Currently within the scope of the arbitrary stateful API v2 (transformWithState) project, each state variable is stored inside one [physical column family](https://github.com/facebook/rocksdb/wiki/Column-Families) within the RocksDB state store instance. Column families are also used to implement secondary indexes for various features. Each physical column family has its own memtables, creates its own SST files, and handles compaction independently on those independent SST files. When the number of operations to RocksDB is relatively small and the number of column families is relatively large, the overhead of handling small SST files becomes high, especially since all of these have to be uploaded in the snapshot dir and referenced in the metadata file for the uploaded RocksDB snapshot. Using prefix to manage different key spaces / virtual column family could reduce such overheads. ### Does this PR introduce _any_ user-facing change? No. If `useColumnFamilies` are set to true in the `StateStore.init()`, virtual column family will be used. ### How was this patch tested? Unit tests in `RocksDBStateStoreSuite`, and integration tests in `TransformWithStateSuite`. Moved test suites in `RocksDBSuite` into `RocksDBStateStoreSuite` because some previous verification functions are now moved into `RocksDBStateProvider` ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47107 from jingz-db/virtual-col-family. Lead-authored-by: jingz-db <[email protected]> Co-authored-by: Jing Zhan <[email protected]> Signed-off-by: Jungtaek Lim <[email protected]>
### What changes were proposed in this pull request? Introducing virtual column family to RocksDB. We attach an 2-byte-Id prefix as column family identifier for each of the key row that is put into RocksDB. The encoding and decoding of the virtual column family prefix happens at the `RocksDBKeyEncoder` layer as we can pre-allocate extra 2 bytes and avoid additional memcpy. - Remove Physical Column Family related codes as this becomes potentially dead code till some caller starts using this. - Remove `useColumnFamilies` from `StateStoreChangelogV2` API. ### Why are the changes needed? Currently within the scope of the arbitrary stateful API v2 (transformWithState) project, each state variable is stored inside one [physical column family](https://github.com/facebook/rocksdb/wiki/Column-Families) within the RocksDB state store instance. Column families are also used to implement secondary indexes for various features. Each physical column family has its own memtables, creates its own SST files, and handles compaction independently on those independent SST files. When the number of operations to RocksDB is relatively small and the number of column families is relatively large, the overhead of handling small SST files becomes high, especially since all of these have to be uploaded in the snapshot dir and referenced in the metadata file for the uploaded RocksDB snapshot. Using prefix to manage different key spaces / virtual column family could reduce such overheads. ### Does this PR introduce _any_ user-facing change? No. If `useColumnFamilies` are set to true in the `StateStore.init()`, virtual column family will be used. ### How was this patch tested? Unit tests in `RocksDBStateStoreSuite`, and integration tests in `TransformWithStateSuite`. Moved test suites in `RocksDBSuite` into `RocksDBStateStoreSuite` because some previous verification functions are now moved into `RocksDBStateProvider` ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47107 from jingz-db/virtual-col-family. Lead-authored-by: jingz-db <[email protected]> Co-authored-by: Jing Zhan <[email protected]> Signed-off-by: Jungtaek Lim <[email protected]>
### What changes were proposed in this pull request? Introducing virtual column family to RocksDB. We attach an 2-byte-Id prefix as column family identifier for each of the key row that is put into RocksDB. The encoding and decoding of the virtual column family prefix happens at the `RocksDBKeyEncoder` layer as we can pre-allocate extra 2 bytes and avoid additional memcpy. - Remove Physical Column Family related codes as this becomes potentially dead code till some caller starts using this. - Remove `useColumnFamilies` from `StateStoreChangelogV2` API. ### Why are the changes needed? Currently within the scope of the arbitrary stateful API v2 (transformWithState) project, each state variable is stored inside one [physical column family](https://github.com/facebook/rocksdb/wiki/Column-Families) within the RocksDB state store instance. Column families are also used to implement secondary indexes for various features. Each physical column family has its own memtables, creates its own SST files, and handles compaction independently on those independent SST files. When the number of operations to RocksDB is relatively small and the number of column families is relatively large, the overhead of handling small SST files becomes high, especially since all of these have to be uploaded in the snapshot dir and referenced in the metadata file for the uploaded RocksDB snapshot. Using prefix to manage different key spaces / virtual column family could reduce such overheads. ### Does this PR introduce _any_ user-facing change? No. If `useColumnFamilies` are set to true in the `StateStore.init()`, virtual column family will be used. ### How was this patch tested? Unit tests in `RocksDBStateStoreSuite`, and integration tests in `TransformWithStateSuite`. Moved test suites in `RocksDBSuite` into `RocksDBStateStoreSuite` because some previous verification functions are now moved into `RocksDBStateProvider` ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47107 from jingz-db/virtual-col-family. Lead-authored-by: jingz-db <[email protected]> Co-authored-by: Jing Zhan <[email protected]> Signed-off-by: Jungtaek Lim <[email protected]>
…g to use the correct number of version bytes ### What changes were proposed in this pull request? There are currently two bugs: - The NoPrefixKeyStateEncoder adds an extra version byte to each row when UnsafeRow encoding is used: #47107 - Rows written with Avro encoding do not include a version byte: #48401 **Neither of these bugs have been released, since these bugs are only triggered with multiple column families, and transformWithState is only using it, which is going to be released for Spark 4.0.0.** This change fixes both of these bugs. ### Why are the changes needed? These changes are needed in order to conform with the expected state row encoding format. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit tests ### Was this patch authored or co-authored using generative AI tooling? No Closes #49996 from ericm-db/SPARK-51249. Lead-authored-by: Eric Marnadi <[email protected]> Co-authored-by: Eric Marnadi <[email protected]> Signed-off-by: Jungtaek Lim <[email protected]>
…g to use the correct number of version bytes ### What changes were proposed in this pull request? There are currently two bugs: - The NoPrefixKeyStateEncoder adds an extra version byte to each row when UnsafeRow encoding is used: #47107 - Rows written with Avro encoding do not include a version byte: #48401 **Neither of these bugs have been released, since these bugs are only triggered with multiple column families, and transformWithState is only using it, which is going to be released for Spark 4.0.0.** This change fixes both of these bugs. ### Why are the changes needed? These changes are needed in order to conform with the expected state row encoding format. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit tests ### Was this patch authored or co-authored using generative AI tooling? No Closes #49996 from ericm-db/SPARK-51249. Lead-authored-by: Eric Marnadi <[email protected]> Co-authored-by: Eric Marnadi <[email protected]> Signed-off-by: Jungtaek Lim <[email protected]> (cherry picked from commit 42ab97a) Signed-off-by: Jungtaek Lim <[email protected]>
…g to use the correct number of version bytes ### What changes were proposed in this pull request? There are currently two bugs: - The NoPrefixKeyStateEncoder adds an extra version byte to each row when UnsafeRow encoding is used: apache#47107 - Rows written with Avro encoding do not include a version byte: apache#48401 **Neither of these bugs have been released, since these bugs are only triggered with multiple column families, and transformWithState is only using it, which is going to be released for Spark 4.0.0.** This change fixes both of these bugs. ### Why are the changes needed? These changes are needed in order to conform with the expected state row encoding format. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit tests ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#49996 from ericm-db/SPARK-51249. Lead-authored-by: Eric Marnadi <[email protected]> Co-authored-by: Eric Marnadi <[email protected]> Signed-off-by: Jungtaek Lim <[email protected]>
What changes were proposed in this pull request?
Introducing virtual column family to RocksDB. We attach an 2-byte-Id prefix as column family identifier for each of the key row that is put into RocksDB. The encoding and decoding of the virtual column family prefix happens at the
RocksDBKeyEncoderlayer as we can pre-allocate extra 2 bytes and avoid additional memcpy.useColumnFamiliesfromStateStoreChangelogV2API.Why are the changes needed?
Currently within the scope of the arbitrary stateful API v2 (transformWithState) project, each state variable is stored inside one physical column family within the RocksDB state store instance. Column families are also used to implement secondary indexes for various features. Each physical column family has its own memtables, creates its own SST files, and handles compaction independently on those independent SST files.
When the number of operations to RocksDB is relatively small and the number of column families is relatively large, the overhead of handling small SST files becomes high, especially since all of these have to be uploaded in the snapshot dir and referenced in the metadata file for the uploaded RocksDB snapshot. Using prefix to manage different key spaces / virtual column family could reduce such overheads.
Does this PR introduce any user-facing change?
No. If
useColumnFamiliesare set to true in theStateStore.init(), virtual column family will be used.How was this patch tested?
Unit tests in
RocksDBStateStoreSuite, and integration tests inTransformWithStateSuite.Moved test suites in
RocksDBSuiteintoRocksDBStateStoreSuitebecause some previous verification functions are now moved intoRocksDBStateProviderWas this patch authored or co-authored using generative AI tooling?
No.