[SPARK-42164][CORE] Register partitioned-table-related classes to KryoSerializer

dongjoon-hyun · dongjoon-hyun · commit 7546b4405d5b · 2023-01-23T21:39:30.000-08:00
### What changes were proposed in this pull request? This PR aims to register partitioned-table-related classes to `KryoSerializer`. Specifically, `CREATE TABLE` and `MSCK REPAIR TABLE` uses this classes. ### Why are the changes needed? To support partitioned-tables more easily with `KryoSerializer`. Previously, it fails like the following. ``` java.lang.IllegalArgumentException: Class is not registered: org.apache.spark.util.HadoopFSUtils$SerializableBlockLocation ``` ``` java.lang.IllegalArgumentException: Class is not registered: org.apache.spark.util.HadoopFSUtils$SerializableFileStatus ``` ``` java.lang.IllegalArgumentException: Class is not registered: org.apache.spark.sql.execution.command.PartitionStatistics ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs and manually tests. **TEST TABLE** ``` $ tree /tmp/t /tmp/t ├── p=1 │ └── users.orc ├── p=10 │ └── users.orc ├── p=11 │ └── users.orc ├── p=2 │ └── users.orc ├── p=3 │ └── users.orc ├── p=4 │ └── users.orc ├── p=5 │ └── users.orc ├── p=6 │ └── users.orc ├── p=7 │ └── users.orc ├── p=8 │ └── users.orc └── p=9 └── users.orc ``` **CREATE PARTITIONED TABLES AND RECOVER PARTITIONS** ``` $ bin/spark-shell -c spark.kryo.registrationRequired=true -c spark.serializer=org.apache.spark.serializer.KryoSerializer -c spark.sql.sources.parallelPartitionDiscovery.threshold=1 scala> sql("CREATE TABLE t USING ORC LOCATION '/tmp/t'").show() ++ || ++ ++ scala> sql("MSCK REPAIR TABLE t").show() ++ || ++ ++ ``` Closes apache#39713 from dongjoon-hyun/SPARK-42164. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
diff --git a/core/src/main/scala/org/apache/spark/serializer/KryoSerializer.scala b/core/src/main/scala/org/apache/spark/serializer/KryoSerializer.scala
@@ -510,6 +510,10 @@ private[serializer] object KryoSerializer {
   // SQL / ML / MLlib classes once and then re-use that filtered list in newInstance() calls.
   private lazy val loadableSparkClasses: Seq[Class[_]] = {
     Seq(
+      "org.apache.spark.util.HadoopFSUtils$SerializableBlockLocation",
+      "[Lorg.apache.spark.util.HadoopFSUtils$SerializableBlockLocation;",
+      "org.apache.spark.util.HadoopFSUtils$SerializableFileStatus",
+
       "org.apache.spark.sql.catalyst.expressions.BoundReference",
       "org.apache.spark.sql.catalyst.expressions.SortOrder",
       "[Lorg.apache.spark.sql.catalyst.expressions.SortOrder;",
@@ -536,6 +540,7 @@ private[serializer] object KryoSerializer {
       "org.apache.spark.sql.types.DecimalType",
       "org.apache.spark.sql.types.Decimal$DecimalAsIfIntegral$",
       "org.apache.spark.sql.types.Decimal$DecimalIsFractional$",
+      "org.apache.spark.sql.execution.command.PartitionStatistics",
       "org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTaskResult",
       "org.apache.spark.sql.execution.joins.EmptyHashedRelation$",
       "org.apache.spark.sql.execution.joins.LongHashedRelation",