-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[Spark-27406][SQL]UnsafeArrayData serialization breaks when two machines have different Oops size #24317
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@rxin Can you please have a look? |
|
Same thing is required to handle for UnsafeMapData |
| */ | ||
|
|
||
| public final class UnsafeArrayData extends ArrayData { | ||
| public final class UnsafeArrayData extends ArrayData implements Externalizable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
implement KryoSerializable also
|
can we create a common trait/utils to do serialization for all these 3 unsafe data structures (row, array, map)? |
|
@cloud-fan sounds reasonable. However, row/array/map structs are slightly different which needs different way to serialize. getBytes() may be put in the "UnsafeSerializationHelper/Utils". Do you have a better idea? Your comments will be appreciated. |
|
ok to test |
|
let's merge it first and think about code refactor later. |
|
Test build #104417 has finished for PR 24317 at commit
|
|
thanks, merging to master! |
|
can you send another PR to 2.4? it conflicts. Thanks! |
|
Okay, It will be done tonight. |
This PR is the branch-2.4 version for #24317 Closes #24324 from pengbo/SPARK-27406-branch-2.4. Authored-by: mingbo_pb <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
## What changes were proposed in this pull request? Finish the rest work of #24317, #9030 a. Implement Kryo serialization for UnsafeArrayData b. fix UnsafeMapData Java/Kryo Serialization issue when two machines have different Oops size c. Move the duplicate code "getBytes()" to Utils. ## How was this patch tested? According Units has been added & tested Closes #24357 from pengbo/SPARK-27416_new. Authored-by: pengbo <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
Finish the rest work of apache#24317, apache#9030 a. Implement Kryo serialization for UnsafeArrayData b. fix UnsafeMapData Java/Kryo Serialization issue when two machines have different Oops size c. Move the duplicate code "getBytes()" to Utils. According Units has been added & tested Closes apache#24357 from pengbo/SPARK-27416_new. Authored-by: pengbo <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
…erialization … This is a Spark 2.4.x backport of #24357 by pengbo. Original description follows below: --- ## What changes were proposed in this pull request? Finish the rest work of #24317, #9030 a. Implement Kryo serialization for UnsafeArrayData b. fix UnsafeMapData Java/Kryo Serialization issue when two machines have different Oops size c. Move the duplicate code "getBytes()" to Utils. ## How was this patch tested? According Units has been added & tested Closes #25223 from JoshRosen/SPARK-27416-2.4. Authored-by: pengbo <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
This PR is the branch-2.4 version for apache#24317 Closes apache#24324 from pengbo/SPARK-27406-branch-2.4. Authored-by: mingbo_pb <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
This PR is the branch-2.4 version for apache#24317 Closes apache#24324 from pengbo/SPARK-27406-branch-2.4. Authored-by: mingbo_pb <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
This PR is the branch-2.4 version for apache#24317 Closes apache#24324 from pengbo/SPARK-27406-branch-2.4. Authored-by: mingbo_pb <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
This PR is the branch-2.4 version for apache/spark#24317 Closes #24324 from pengbo/SPARK-27406-branch-2.4. Authored-by: mingbo_pb <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 3352803)
…erialization … This is a Spark 2.4.x backport of apache#24357 by pengbo. Original description follows below: --- ## What changes were proposed in this pull request? Finish the rest work of apache#24317, apache#9030 a. Implement Kryo serialization for UnsafeArrayData b. fix UnsafeMapData Java/Kryo Serialization issue when two machines have different Oops size c. Move the duplicate code "getBytes()" to Utils. ## How was this patch tested? According Units has been added & tested Closes apache#25223 from JoshRosen/SPARK-27416-2.4. Authored-by: pengbo <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
…erialization … This is a Spark 2.4.x backport of apache#24357 by pengbo. Original description follows below: --- ## What changes were proposed in this pull request? Finish the rest work of apache#24317, apache#9030 a. Implement Kryo serialization for UnsafeArrayData b. fix UnsafeMapData Java/Kryo Serialization issue when two machines have different Oops size c. Move the duplicate code "getBytes()" to Utils. ## How was this patch tested? According Units has been added & tested Closes apache#25223 from JoshRosen/SPARK-27416-2.4. Authored-by: pengbo <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
…erialization This is a Spark 2.3.x backport of a 2.4.x backport of apache#24357 by pengbo. Original description follows below: --- Finish the rest work of apache#24317, apache#9030 a. Implement Kryo serialization for UnsafeArrayData b. fix UnsafeMapData Java/Kryo Serialization issue when two machines have different Oops size c. Move the duplicate code "getBytes()" to Utils. According Units has been added & tested Authored-by: pengbo <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> Signed-off-by: Jason Altekruse <[email protected]>
This PR is the branch-2.4 version for apache/spark#24317 Closes #24324 from pengbo/SPARK-27406-branch-2.4. Authored-by: mingbo_pb <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
What changes were proposed in this pull request?
ApproxCountDistinctForIntervals holds the UnsafeArrayData data to initialize endpoints. When the UnsafeArrayData is serialized with Java serialization, the BYTE_ARRAY_OFFSET in memory can change if two machines have different pointer width (Oops in JVM).
This PR fixes this issue by using the same way in #9030
How was this patch tested?
Manual test has been done in our tpcds environment and regarding unit test case has been added as well