Skip to content

Commit 4ffc27c

Browse files
committed
[SPARK-6123] [SPARK-6775] [SPARK-6776] [SQL] Refactors Parquet read path for interoperability and backwards-compatibility
This PR is a follow-up of #6617 and is part of [SPARK-6774] [2], which aims to ensure interoperability and backwards-compatibility for Spark SQL Parquet support. And this one fixes the read path. Now Spark SQL is expected to be able to read legacy Parquet data files generated by most (if not all) common libraries/tools like parquet-thrift, parquet-avro, and parquet-hive. However, we still need to refactor the write path to write standard Parquet LISTs and MAPs ([SPARK-8848] [4]). ### Major changes 1. `CatalystConverter` class hierarchy refactoring - Replaces `CatalystConverter` trait with a much simpler `ParentContainerUpdater`. Now instead of extending the original `CatalystConverter` trait, every converter class accepts an updater which is responsible for propagating the converted value to some parent container. For example, appending array elements to a parent array buffer, appending a key-value pairs to a parent mutable map, or setting a converted value to some specific field of a parent row. Root converter doesn't have a parent and thus uses a `NoopUpdater`. This simplifies the design since converters don't need to care about details of their parent converters anymore. - Unifies `CatalystRootConverter`, `CatalystGroupConverter` and `CatalystPrimitiveRowConverter` into `CatalystRowConverter` Specifically, now all row objects are represented by `SpecificMutableRow` during conversion. - Refactors `CatalystArrayConverter`, and removes `CatalystArrayContainsNullConverter` and `CatalystNativeArrayConverter` `CatalystNativeArrayConverter` was probably designed with the intention of avoiding boxing costs. However, the way it uses Scala generics actually doesn't achieve this goal. The new `CatalystArrayConverter` handles both nullable and non-nullable array elements in a consistent way. - Implements backwards-compatibility rules in `CatalystArrayConverter` When Parquet records are being converted, schema of Parquet files should have already been verified. So we only need to care about the structure rather than field names in the Parquet schema. Since all map objects represented in legacy systems have the same structure as the standard one (see [backwards-compatibility rules for MAP] [1]), we only need to deal with LIST (namely array) in `CatalystArrayConverter`. 2. Requested columns handling When specifying requested columns in `RowReadSupport`, we used to use a Parquet `MessageType` converted from a Catalyst `StructType` which contains all requested columns. This is not preferable when taking compatibility and interoperability into consideration. Because the actual Parquet file may have different physical structure from the converted schema. In this PR, the schema for requested columns is constructed using the following method: - For a column that exists in the target Parquet file, we extract the column type by name from the full file schema, and construct a single-field `MessageType` for that column. - For a column that doesn't exist in the target Parquet file, we create a single-field `StructType` and convert it to a `MessageType` using `CatalystSchemaConverter`. - Unions all single-field `MessageType`s into a full schema containing all requested fields With this change, we also fix [SPARK-6123] [3] by validating the global schema against each individual Parquet part-files. ### Testing This PR also adds compatibility tests for parquet-avro, parquet-thrift, and parquet-hive. Please refer to `README.md` under `sql/core/src/test` for more information about these tests. To avoid build time code generation and adding extra complexity to the build system, Java code generated from testing Thrift schema and Avro IDL is also checked in. [1]: https://github.com/apache/incubator-parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules-1 [2]: https://issues.apache.org/jira/browse/SPARK-6774 [3]: https://issues.apache.org/jira/browse/SPARK-6123 [4]: https://issues.apache.org/jira/browse/SPARK-8848 Author: Cheng Lian <[email protected]> Closes #7231 from liancheng/spark-6776 and squashes the following commits: 360fe18 [Cheng Lian] Adds ParquetHiveCompatibilitySuite c6fbc06 [Cheng Lian] Removes WIP file committed by mistake b8c1295 [Cheng Lian] Excludes the whole parquet package from MiMa 598c3e8 [Cheng Lian] Adds extra Maven repo for hadoop-lzo, which is a transitive dependency of parquet-thrift 926af87 [Cheng Lian] Simplifies Parquet compatibility test suites 7946ee1 [Cheng Lian] Fixes Scala styling issues 3d7ab36 [Cheng Lian] Fixes .rat-excludes a8f13bb [Cheng Lian] Using Parquet writer API to do compatibility tests f2208cd [Cheng Lian] Adds README.md for Thrift/Avro code generation 1d390aa [Cheng Lian] Adds parquet-thrift compatibility test 440f7b3 [Cheng Lian] Adds generated files to .rat-excludes 13b9121 [Cheng Lian] Adds ParquetAvroCompatibilitySuite 06cfe9d [Cheng Lian] Adds comments about TimestampType handling a099d3e [Cheng Lian] More comments 0cc1b37 [Cheng Lian] Fixes MiMa checks 884d3e6 [Cheng Lian] Fixes styling issue and reverts unnecessary changes 802cbd7 [Cheng Lian] Fixes bugs related to schema merging and empty requested columns 38fe1e7 [Cheng Lian] Adds explicit return type 7fb21f1 [Cheng Lian] Reverts an unnecessary debugging change 1781dff [Cheng Lian] Adds test case for SPARK-8811 6437d4b [Cheng Lian] Assembles requested schema from Parquet file schema bcac49f [Cheng Lian] Removes the 16-byte restriction of decimals a74fb2c [Cheng Lian] More comments 0525346 [Cheng Lian] Removes old Parquet record converters 03c3bd9 [Cheng Lian] Refactors Parquet read path to implement backwards-compatibility rules
1 parent 5687f76 commit 4ffc27c

File tree

27 files changed

+5984
-919
lines changed

27 files changed

+5984
-919
lines changed

.rat-excludes

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -91,3 +91,5 @@ help/*
9191
html/*
9292
INDEX
9393
.lintr
94+
gen-java.*
95+
.*avpr

pom.xml

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -161,6 +161,7 @@
161161
<fasterxml.jackson.version>2.4.4</fasterxml.jackson.version>
162162
<snappy.version>1.1.1.7</snappy.version>
163163
<netlib.java.version>1.1.2</netlib.java.version>
164+
<thrift.version>0.9.2</thrift.version>
164165
<!-- For maven shade plugin (see SPARK-8819) -->
165166
<create.dependency.reduced.pom>false</create.dependency.reduced.pom>
166167

@@ -179,6 +180,8 @@
179180
<hbase.deps.scope>compile</hbase.deps.scope>
180181
<hive.deps.scope>compile</hive.deps.scope>
181182
<parquet.deps.scope>compile</parquet.deps.scope>
183+
<parquet.test.deps.scope>test</parquet.test.deps.scope>
184+
<thrift.test.deps.scope>test</thrift.test.deps.scope>
182185

183186
<!--
184187
Overridable test home. So that you can call individual pom files directly without
@@ -270,6 +273,18 @@
270273
<enabled>false</enabled>
271274
</snapshots>
272275
</repository>
276+
<!-- For transitive dependencies brougt by parquet-thrift -->
277+
<repository>
278+
<id>twttr-repo</id>
279+
<name>Twttr Repository</name>
280+
<url>http://maven.twttr.com</url>
281+
<releases>
282+
<enabled>true</enabled>
283+
</releases>
284+
<snapshots>
285+
<enabled>false</enabled>
286+
</snapshots>
287+
</repository>
273288
<!-- TODO: This can be deleted after Spark 1.4 is posted -->
274289
<repository>
275290
<id>spark-1.4-staging</id>
@@ -1101,6 +1116,24 @@
11011116
<version>${parquet.version}</version>
11021117
<scope>${parquet.deps.scope}</scope>
11031118
</dependency>
1119+
<dependency>
1120+
<groupId>org.apache.parquet</groupId>
1121+
<artifactId>parquet-avro</artifactId>
1122+
<version>${parquet.version}</version>
1123+
<scope>${parquet.test.deps.scope}</scope>
1124+
</dependency>
1125+
<dependency>
1126+
<groupId>org.apache.parquet</groupId>
1127+
<artifactId>parquet-thrift</artifactId>
1128+
<version>${parquet.version}</version>
1129+
<scope>${parquet.test.deps.scope}</scope>
1130+
</dependency>
1131+
<dependency>
1132+
<groupId>org.apache.thrift</groupId>
1133+
<artifactId>libthrift</artifactId>
1134+
<version>${thrift.version}</version>
1135+
<scope>${thrift.test.deps.scope}</scope>
1136+
</dependency>
11041137
<dependency>
11051138
<groupId>org.apache.flume</groupId>
11061139
<artifactId>flume-ng-core</artifactId>

project/MimaExcludes.scala

Lines changed: 2 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -62,21 +62,8 @@ object MimaExcludes {
6262
"org.apache.spark.ml.classification.LogisticCostFun.this"),
6363
// SQL execution is considered private.
6464
excludePackage("org.apache.spark.sql.execution"),
65-
// NanoTime and CatalystTimestampConverter is only used inside catalyst,
66-
// not needed anymore
67-
ProblemFilters.exclude[MissingClassProblem](
68-
"org.apache.spark.sql.parquet.timestamp.NanoTime"),
69-
ProblemFilters.exclude[MissingClassProblem](
70-
"org.apache.spark.sql.parquet.timestamp.NanoTime$"),
71-
ProblemFilters.exclude[MissingClassProblem](
72-
"org.apache.spark.sql.parquet.CatalystTimestampConverter"),
73-
ProblemFilters.exclude[MissingClassProblem](
74-
"org.apache.spark.sql.parquet.CatalystTimestampConverter$"),
75-
// SPARK-6777 Implements backwards compatibility rules in CatalystSchemaConverter
76-
ProblemFilters.exclude[MissingClassProblem](
77-
"org.apache.spark.sql.parquet.ParquetTypeInfo"),
78-
ProblemFilters.exclude[MissingClassProblem](
79-
"org.apache.spark.sql.parquet.ParquetTypeInfo$")
65+
// Parquet support is considered private.
66+
excludePackage("org.apache.spark.sql.parquet")
8067
) ++ Seq(
8168
// SPARK-8479 Add numNonzeros and numActives to Matrix.
8269
ProblemFilters.exclude[MissingMethodProblem](

sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataType.scala

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,11 +17,12 @@
1717

1818
package org.apache.spark.sql.types
1919

20+
import scala.util.Try
2021
import scala.util.parsing.combinator.RegexParsers
2122

22-
import org.json4s._
2323
import org.json4s.JsonAST.JValue
2424
import org.json4s.JsonDSL._
25+
import org.json4s._
2526
import org.json4s.jackson.JsonMethods._
2627

2728
import org.apache.spark.annotation.DeveloperApi
@@ -82,6 +83,9 @@ abstract class DataType extends AbstractDataType {
8283

8384

8485
object DataType {
86+
private[sql] def fromString(raw: String): DataType = {
87+
Try(DataType.fromJson(raw)).getOrElse(DataType.fromCaseClassString(raw))
88+
}
8589

8690
def fromJson(json: String): DataType = parseDataType(parse(json))
8791

sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -311,6 +311,11 @@ object StructType extends AbstractDataType {
311311

312312
private[sql] override def simpleString: String = "struct"
313313

314+
private[sql] def fromString(raw: String): StructType = DataType.fromString(raw) match {
315+
case t: StructType => t
316+
case _ => throw new RuntimeException(s"Failed parsing StructType: $raw")
317+
}
318+
314319
def apply(fields: Seq[StructField]): StructType = StructType(fields.toArray)
315320

316321
def apply(fields: java.util.List[StructField]): StructType = {

sql/core/pom.xml

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -101,9 +101,45 @@
101101
<version>9.3-1102-jdbc41</version>
102102
<scope>test</scope>
103103
</dependency>
104+
<dependency>
105+
<groupId>org.apache.parquet</groupId>
106+
<artifactId>parquet-avro</artifactId>
107+
<scope>test</scope>
108+
</dependency>
109+
<dependency>
110+
<groupId>org.apache.parquet</groupId>
111+
<artifactId>parquet-thrift</artifactId>
112+
<scope>test</scope>
113+
</dependency>
114+
<dependency>
115+
<groupId>org.apache.thrift</groupId>
116+
<artifactId>libthrift</artifactId>
117+
<scope>test</scope>
118+
</dependency>
104119
</dependencies>
105120
<build>
106121
<outputDirectory>target/scala-${scala.binary.version}/classes</outputDirectory>
107122
<testOutputDirectory>target/scala-${scala.binary.version}/test-classes</testOutputDirectory>
123+
<plugins>
124+
<plugin>
125+
<groupId>org.codehaus.mojo</groupId>
126+
<artifactId>build-helper-maven-plugin</artifactId>
127+
<executions>
128+
<execution>
129+
<id>add-scala-test-sources</id>
130+
<phase>generate-test-sources</phase>
131+
<goals>
132+
<goal>add-test-source</goal>
133+
</goals>
134+
<configuration>
135+
<sources>
136+
<source>src/test/scala</source>
137+
<source>src/test/gen-java</source>
138+
</sources>
139+
</configuration>
140+
</execution>
141+
</executions>
142+
</plugin>
143+
</plugins>
108144
</build>
109145
</project>

0 commit comments

Comments
 (0)