-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-10914] UnsafeRow serialization breaks when two machines have different Oops size. #9030
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ifferent Oops size.
The problem is that UnsafeRow contains 3 pieces of information when pointing to some data in memory (an object, a base offset, and length). When the row is serialized with Java/Kryo serialization, the object layout in memory can change if two machines have different pointer width (Oops in JVM).
To reproduce, launch Spark using
MASTER=local-cluster[2,1,1024] bin/spark-shell --conf "spark.executor.extraJavaOptions=-XX:-UseCompressedOops"
And then run the following
scala> sql("select 1 xx").collect()
(cherry picked from commit 157b2a818d3993b1321cc41fb7b30407bd13490b)
Signed-off-by: Reynold Xin <[email protected]>
|
cc @davies and @JoshRosen |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
going to revert this
|
LGTM, we have done this for UTF8String already (not support Kryo). @cloud-fan Should we also do this for UnsafeArrayData? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this just call writeExternal(out)?
|
I think we should apply this to unsafe array too. |
|
Test build #1860 has finished for PR 9030 at commit
|
|
Test build #43414 has finished for PR 9030 at commit
|
|
Test build #1861 has finished for PR 9030 at commit
|
|
test this please |
|
Test build #43434 has finished for PR 9030 at commit
|
|
Merging this in master & branch-1.5. |
…ifferent Oops size.
UnsafeRow contains 3 pieces of information when pointing to some data in memory (an object, a base offset, and length). When the row is serialized with Java/Kryo serialization, the object layout in memory can change if two machines have different pointer width (Oops in JVM).
To reproduce, launch Spark using
MASTER=local-cluster[2,1,1024] bin/spark-shell --conf "spark.executor.extraJavaOptions=-XX:-UseCompressedOops"
And then run the following
scala> sql("select 1 xx").collect()
Author: Reynold Xin <[email protected]>
Closes #9030 from rxin/SPARK-10914.
(cherry picked from commit 84ea287)
Signed-off-by: Reynold Xin <[email protected]>
…ines have different Oops size ## What changes were proposed in this pull request? ApproxCountDistinctForIntervals holds the UnsafeArrayData data to initialize endpoints. When the UnsafeArrayData is serialized with Java serialization, the BYTE_ARRAY_OFFSET in memory can change if two machines have different pointer width (Oops in JVM). This PR fixes this issue by using the same way in #9030 ## How was this patch tested? Manual test has been done in our tpcds environment and regarding unit test case has been added as well Closes #24317 from pengbo/SPARK-27406. Authored-by: mingbo_pb <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
## What changes were proposed in this pull request? Finish the rest work of #24317, #9030 a. Implement Kryo serialization for UnsafeArrayData b. fix UnsafeMapData Java/Kryo Serialization issue when two machines have different Oops size c. Move the duplicate code "getBytes()" to Utils. ## How was this patch tested? According Units has been added & tested Closes #24357 from pengbo/SPARK-27416_new. Authored-by: pengbo <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
Finish the rest work of apache#24317, apache#9030 a. Implement Kryo serialization for UnsafeArrayData b. fix UnsafeMapData Java/Kryo Serialization issue when two machines have different Oops size c. Move the duplicate code "getBytes()" to Utils. According Units has been added & tested Closes apache#24357 from pengbo/SPARK-27416_new. Authored-by: pengbo <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
…erialization … This is a Spark 2.4.x backport of #24357 by pengbo. Original description follows below: --- ## What changes were proposed in this pull request? Finish the rest work of #24317, #9030 a. Implement Kryo serialization for UnsafeArrayData b. fix UnsafeMapData Java/Kryo Serialization issue when two machines have different Oops size c. Move the duplicate code "getBytes()" to Utils. ## How was this patch tested? According Units has been added & tested Closes #25223 from JoshRosen/SPARK-27416-2.4. Authored-by: pengbo <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
…erialization … This is a Spark 2.4.x backport of apache#24357 by pengbo. Original description follows below: --- ## What changes were proposed in this pull request? Finish the rest work of apache#24317, apache#9030 a. Implement Kryo serialization for UnsafeArrayData b. fix UnsafeMapData Java/Kryo Serialization issue when two machines have different Oops size c. Move the duplicate code "getBytes()" to Utils. ## How was this patch tested? According Units has been added & tested Closes apache#25223 from JoshRosen/SPARK-27416-2.4. Authored-by: pengbo <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
…erialization … This is a Spark 2.4.x backport of apache#24357 by pengbo. Original description follows below: --- ## What changes were proposed in this pull request? Finish the rest work of apache#24317, apache#9030 a. Implement Kryo serialization for UnsafeArrayData b. fix UnsafeMapData Java/Kryo Serialization issue when two machines have different Oops size c. Move the duplicate code "getBytes()" to Utils. ## How was this patch tested? According Units has been added & tested Closes apache#25223 from JoshRosen/SPARK-27416-2.4. Authored-by: pengbo <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
…erialization This is a Spark 2.3.x backport of a 2.4.x backport of apache#24357 by pengbo. Original description follows below: --- Finish the rest work of apache#24317, apache#9030 a. Implement Kryo serialization for UnsafeArrayData b. fix UnsafeMapData Java/Kryo Serialization issue when two machines have different Oops size c. Move the duplicate code "getBytes()" to Utils. According Units has been added & tested Authored-by: pengbo <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> Signed-off-by: Jason Altekruse <[email protected]>
UnsafeRow contains 3 pieces of information when pointing to some data in memory (an object, a base offset, and length). When the row is serialized with Java/Kryo serialization, the object layout in memory can change if two machines have different pointer width (Oops in JVM).
To reproduce, launch Spark using
MASTER=local-cluster[2,1,1024] bin/spark-shell --conf "spark.executor.extraJavaOptions=-XX:-UseCompressedOops"
And then run the following
scala> sql("select 1 xx").collect()