Skip to content

Conversation

@pengbo
Copy link
Contributor

@pengbo pengbo commented Apr 8, 2019

What changes were proposed in this pull request?

ApproxCountDistinctForIntervals holds the UnsafeArrayData data to initialize endpoints. When the UnsafeArrayData is serialized with Java serialization, the BYTE_ARRAY_OFFSET in memory can change if two machines have different pointer width (Oops in JVM).

This PR fixes this issue by using the same way in #9030

How was this patch tested?

Manual test has been done in our tpcds environment and regarding unit test case has been added as well

@pengbo
Copy link
Contributor Author

pengbo commented Apr 8, 2019

@rxin Can you please have a look?

@pengbo pengbo changed the title [Spark 27406][SQL]UnsafeArrayData serialization breaks when two machines have different Oops size [Spark-27406][SQL]UnsafeArrayData serialization breaks when two machines have different Oops size Apr 8, 2019
@sandeep-katta
Copy link
Contributor

Same thing is required to handle for UnsafeMapData

*/

public final class UnsafeArrayData extends ArrayData {
public final class UnsafeArrayData extends ArrayData implements Externalizable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

implement KryoSerializable also

@sandeep-katta
Copy link
Contributor

cc @cloud-fan , @srowen @HyukjinKwon @rxin

@cloud-fan
Copy link
Contributor

cloud-fan commented Apr 8, 2019

can we create a common trait/utils to do serialization for all these 3 unsafe data structures (row, array, map)?

@pengbo
Copy link
Contributor Author

pengbo commented Apr 8, 2019

@cloud-fan sounds reasonable.

However, row/array/map structs are slightly different which needs different way to serialize. getBytes() may be put in the "UnsafeSerializationHelper/Utils".

Do you have a better idea? Your comments will be appreciated.

@HeartSaVioR
Copy link
Contributor

HeartSaVioR commented Apr 9, 2019

Maybe reconsidering #24050 or picking up the approach #24050 to here and minimize the boundary of area (like restricting to getBytes())?

@cloud-fan
Copy link
Contributor

ok to test

@cloud-fan
Copy link
Contributor

let's merge it first and think about code refactor later.

@SparkQA
Copy link

SparkQA commented Apr 9, 2019

Test build #104417 has finished for PR 24317 at commit faec817.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • abstract class FileScanBuilder(

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan
Copy link
Contributor

can you send another PR to 2.4? it conflicts. Thanks!

@cloud-fan cloud-fan closed this in 3e4cfe9 Apr 9, 2019
@pengbo
Copy link
Contributor Author

pengbo commented Apr 9, 2019

Okay, It will be done tonight.
I will create another PR for UnsafeMapData and refactoring later this week.

cloud-fan pushed a commit that referenced this pull request Apr 10, 2019
This PR is the branch-2.4 version for #24317

Closes #24324 from pengbo/SPARK-27406-branch-2.4.

Authored-by: mingbo_pb <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
cloud-fan pushed a commit that referenced this pull request Apr 17, 2019
## What changes were proposed in this pull request?
Finish the rest work of #24317, #9030
a. Implement Kryo serialization for UnsafeArrayData
b. fix UnsafeMapData Java/Kryo Serialization issue when two machines have different Oops size
c. Move the duplicate code "getBytes()" to Utils.

## How was this patch tested?
According Units has been added & tested

Closes #24357 from pengbo/SPARK-27416_new.

Authored-by: pengbo <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
JoshRosen pushed a commit to JoshRosen/spark that referenced this pull request Jul 22, 2019
Finish the rest work of apache#24317, apache#9030
a. Implement Kryo serialization for UnsafeArrayData
b. fix UnsafeMapData Java/Kryo Serialization issue when two machines have different Oops size
c. Move the duplicate code "getBytes()" to Utils.

According Units has been added & tested

Closes apache#24357 from pengbo/SPARK-27416_new.

Authored-by: pengbo <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
dongjoon-hyun pushed a commit that referenced this pull request Jul 22, 2019
…erialization …

This is a Spark 2.4.x backport of #24357 by pengbo. Original description follows below:

---

## What changes were proposed in this pull request?
Finish the rest work of #24317, #9030
a. Implement Kryo serialization for UnsafeArrayData
b. fix UnsafeMapData Java/Kryo Serialization issue when two machines have different Oops size
c. Move the duplicate code "getBytes()" to Utils.

## How was this patch tested?
According Units has been added & tested

Closes #25223 from JoshRosen/SPARK-27416-2.4.

Authored-by: pengbo <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
kai-chi pushed a commit to kai-chi/spark that referenced this pull request Jul 23, 2019
This PR is the branch-2.4 version for apache#24317

Closes apache#24324 from pengbo/SPARK-27406-branch-2.4.

Authored-by: mingbo_pb <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
kai-chi pushed a commit to kai-chi/spark that referenced this pull request Jul 25, 2019
This PR is the branch-2.4 version for apache#24317

Closes apache#24324 from pengbo/SPARK-27406-branch-2.4.

Authored-by: mingbo_pb <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
kai-chi pushed a commit to kai-chi/spark that referenced this pull request Aug 1, 2019
This PR is the branch-2.4 version for apache#24317

Closes apache#24324 from pengbo/SPARK-27406-branch-2.4.

Authored-by: mingbo_pb <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
zhongjinhan pushed a commit to zhongjinhan/spark-1 that referenced this pull request Sep 3, 2019
This PR is the branch-2.4 version for apache/spark#24317

Closes #24324 from pengbo/SPARK-27406-branch-2.4.

Authored-by: mingbo_pb <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 3352803)
rluta pushed a commit to rluta/spark that referenced this pull request Sep 17, 2019
…erialization …

This is a Spark 2.4.x backport of apache#24357 by pengbo. Original description follows below:

---

## What changes were proposed in this pull request?
Finish the rest work of apache#24317, apache#9030
a. Implement Kryo serialization for UnsafeArrayData
b. fix UnsafeMapData Java/Kryo Serialization issue when two machines have different Oops size
c. Move the duplicate code "getBytes()" to Utils.

## How was this patch tested?
According Units has been added & tested

Closes apache#25223 from JoshRosen/SPARK-27416-2.4.

Authored-by: pengbo <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
kai-chi pushed a commit to kai-chi/spark that referenced this pull request Sep 26, 2019
…erialization …

This is a Spark 2.4.x backport of apache#24357 by pengbo. Original description follows below:

---

## What changes were proposed in this pull request?
Finish the rest work of apache#24317, apache#9030
a. Implement Kryo serialization for UnsafeArrayData
b. fix UnsafeMapData Java/Kryo Serialization issue when two machines have different Oops size
c. Move the duplicate code "getBytes()" to Utils.

## How was this patch tested?
According Units has been added & tested

Closes apache#25223 from JoshRosen/SPARK-27416-2.4.

Authored-by: pengbo <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
jaltekruse added a commit to jaltekruse/spark that referenced this pull request Dec 10, 2019
…erialization

This is a Spark 2.3.x backport of a 2.4.x backport of apache#24357 by pengbo. Original description follows below:

---

Finish the rest work of apache#24317, apache#9030
a. Implement Kryo serialization for UnsafeArrayData
b. fix UnsafeMapData Java/Kryo Serialization issue when two machines have different Oops size
c. Move the duplicate code "getBytes()" to Utils.

According Units has been added & tested

Authored-by: pengbo <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>

Signed-off-by: Jason Altekruse <[email protected]>
yoock pushed a commit to yoock/spark-apache that referenced this pull request Jan 14, 2020
This PR is the branch-2.4 version for apache/spark#24317

Closes #24324 from pengbo/SPARK-27406-branch-2.4.

Authored-by: mingbo_pb <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants