Skip to content

Conversation

@dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented Jan 24, 2023

What changes were proposed in this pull request?

This PR aims to register partitioned-table-related classes to KryoSerializer.
Specifically, CREATE TABLE and MSCK REPAIR TABLE uses this classes.

Why are the changes needed?

To support partitioned-tables more easily with KryoSerializer. Previously, it fails like the following.

java.lang.IllegalArgumentException: Class is not registered:
org.apache.spark.util.HadoopFSUtils$SerializableBlockLocation
java.lang.IllegalArgumentException: Class is not registered:
org.apache.spark.util.HadoopFSUtils$SerializableFileStatus
java.lang.IllegalArgumentException: Class is not registered:
org.apache.spark.sql.execution.command.PartitionStatistics

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Pass the CIs and manually tests.

TEST TABLE

$ tree /tmp/t
/tmp/t
├── p=1
│   └── users.orc
├── p=10
│   └── users.orc
├── p=11
│   └── users.orc
├── p=2
│   └── users.orc
├── p=3
│   └── users.orc
├── p=4
│   └── users.orc
├── p=5
│   └── users.orc
├── p=6
│   └── users.orc
├── p=7
│   └── users.orc
├── p=8
│   └── users.orc
└── p=9
    └── users.orc

CREATE PARTITIONED TABLES AND RECOVER PARTITIONS

$ bin/spark-shell -c spark.kryo.registrationRequired=true -c spark.serializer=org.apache.spark.serializer.KryoSerializer -c spark.sql.sources.parallelPartitionDiscovery.threshold=1

scala> sql("CREATE TABLE t USING ORC LOCATION '/tmp/t'").show()
++                                                                              
||
++
++


scala> sql("MSCK REPAIR TABLE t").show()
++
||
++
++

@github-actions github-actions bot added the CORE label Jan 24, 2023
@dongjoon-hyun dongjoon-hyun marked this pull request as draft January 24, 2023 02:19
@dongjoon-hyun dongjoon-hyun marked this pull request as ready for review January 24, 2023 02:27
Seq(
"org.apache.spark.util.HadoopFSUtils$SerializableBlockLocation",
"[Lorg.apache.spark.util.HadoopFSUtils$SerializableBlockLocation;",
"org.apache.spark.util.HadoopFSUtils$SerializableFileStatus",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since these are private case class of HadoopFSUtils, we need to register like this.

@dongjoon-hyun
Copy link
Member Author

Could you review this, @viirya ?

@dongjoon-hyun
Copy link
Member Author

Thank you, @viirya ! Merged to master.

@dongjoon-hyun dongjoon-hyun deleted the SPARK-42164 branch January 24, 2023 05:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants