[SPARK-10301] [SQL] Fixes schema merging for nested structs #8509

liancheng · 2015-08-28T16:18:14Z

This PR can be quite challenging to review. I'm trying to give a detailed description of the problem as well as its solution here.

When reading Parquet files, we need to specify a potentially nested Parquet schema (of type MessageType) as requested schema for column pruning. This Parquet schema is translated from a Catalyst schema (of type StructType), which is generated by the query planner and represents all requested columns. However, this translation can be fairly complicated because of several reasons:

Requested schema must conform to the real schema of the physical file to be read.

This means we have to tailor the actual file schema of every individual physical Parquet file to be read according to the given Catalyst schema. Fortunately we are already doing this in Spark 1.5 by pushing request schema conversion to executor side in PR #7231.
2. Support for schema merging.

A single Parquet dataset may consist of multiple physical Parquet files come with different but compatible schemas. This means we may request for a column path that doesn't exist in a physical Parquet file. All requested column paths can be nested. For example, for a Parquet file schema

message root {
  required group f0 {
    required group f00 {
      required int32 f000;
      required binary f001 (UTF8);
    }
  }
}

we may request for column paths defined in the following schema:

message root {
  required group f0 {
    required group f00 {
      required binary f001 (UTF8);
      required float f002;
    }
  }

  optional double f1;
}

Notice that we pruned column path f0.f00.f000, but added f0.f00.f002 and f1.

The good news is that Parquet handles non-existing column paths properly and always returns null for them.
3. The map from StructType to MessageType is a one-to-many map.

This is the most unfortunate part.

Due to historical reasons (dark histories!), schemas of Parquet files generated by different libraries have different "flavors". For example, to handle a schema with a single non-nullable column, whose type is an array of non-nullable integers, parquet-protobuf generates the following Parquet schema:

message m0 {
  repeated int32 f;
}

while parquet-avro generates another version:

message m1 {
  required group f (LIST) {
    repeated int32 array;
  }
}

and parquet-thrift spills this:

message m1 {
  required group f (LIST) {
    repeated int32 f_tuple;
  }
}

All of them can be mapped to the following unique Catalyst schema:

StructType(
  StructField(
    "f",
    ArrayType(IntegerType, containsNull = false),
    nullable = false))

This greatly complicates Parquet requested schema construction, since the path of a given column varies in different cases. To read the array elements from files with the above schemas, we must use f for m0, f.array for m1, and f.f_tuple for m2.

In earlier Spark versions, we didn't try to fix this issue properly. Spark 1.4 and prior versions simply translate the Catalyst schema in a way more or less compatible with parquet-hive and parquet-avro, but is broken in many other cases. Earlier revisions of Spark 1.5 only try to tailor the Parquet file schema at the first level, and ignore nested ones. This caused SPARK-10301 as well as SPARK-10005. In PR #8228, I tried to avoid the hard part of the problem and made a minimum change in CatalystRowConverter to fix SPARK-10005. However, when taking SPARK-10301 into consideration, keeping hacking CatalystRowConverter doesn't seem to be a good idea. So this PR is an attempt to fix the problem in a proper way.

For a given physical Parquet file with schema ps and a compatible Catalyst requested schema cs, we use the following algorithm to tailor ps to get the result Parquet requested schema ps':

For a leaf column path c in cs:

if c exists in cs and a corresponding Parquet column path c' can be found in ps, c' should be included in ps';
otherwise, we convert c to a Parquet column path c" using CatalystSchemaConverter, and include c" in ps';
no other column paths should exist in ps'.

Then comes the most tedious part:

Given cs, ps, and c, how to locate c' in ps?

Unfortunately, there's no quick answer, and we have to enumerate all possible structures defined in parquet-format spec. They are:

the standard structure of nested types, and
cases defined in all backwards-compatibility rules for LIST and MAP.

The core part of this PR is CatalystReadSupport.clipParquetType(), which tailors a given Parquet file schema according to a requested schema in its Catalyst form. Backwards-compatibility rules of LIST and MAP are covered in clipParquetListType() and clipParquetMapType() respectively. The column path selection algorithm is implemented in clipParquetGroupFields().

With this PR, we no longer need to do schema tailoring in CatalystReadSupport and CatalystRowConverter. Another benefit is that, now we can also read Parquet datasets consist of files with different physical Parquet schema but share the same logical schema, for example, files generated by different Parquet libraries. This situation is illustrated by this test case.

liancheng · 2015-08-28T16:29:08Z

sql/core/src/test/scala/org/apache/spark/sql/test/SharedSQLContext.scala

This is an irrelevant change, added to stop IntelliJ IDEA highlighting errors in ScalaDoc.

liancheng · 2015-08-28T16:56:06Z

@rxin Considering this is a pretty major change and SPARK-10301 isn't a blocker, I'm not quite sure whether we should include this into 1.5 at this moment. Another thing to note is that, to the best of my knowledge, most (if not all) existing Parquet libraries suffer this issue.

SparkQA · 2015-08-28T18:29:06Z

Test build #41749 has finished for PR 8509 at commit a53cf34.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-08-28T19:20:02Z

Test build #41750 has finished for PR 8509 at commit 88ab2a9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

marmbrus · 2015-08-28T22:37:13Z

This seems risky to include in branch-1.5 given how far along in the process we are. I'd propose we instead merge a small patch that checks that the things being zipped are the same size, and if not throws and error asking the user to turn on schema merging (the parquet error is very confusing). We can merge this into master.

liancheng · 2015-08-29T00:32:47Z

The quick fix @marmbrus mentioned has been added as part of #8515 (yhuai@b509bee).

…share the same logical schema

SparkQA · 2015-08-30T19:21:57Z

Test build #41809 has finished for PR 8509 at commit 38644d8.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- public class JavaTrainValidationSplitExample
- class KMeans @Since("1.5.0") (
- class GaussianMixtureModel @Since("1.3.0") (
- class KMeansModel @Since("1.1.0") (@Since("1.0.0") val clusterCenters: Array[Vector])
- class PowerIterationClusteringModel @Since("1.3.0") (
- class StreamingKMeansModel @Since("1.2.0") (
- class StreamingKMeans @Since("1.2.0") (
- class ChiSqSelectorModel @Since("1.3.0") (
- class ChiSqSelector @Since("1.3.0") (
- class ElementwiseProduct @Since("1.4.0") (
- class IDF @Since("1.2.0") (@Since("1.2.0") val minDocFreq: Int)
- class Normalizer @Since("1.1.0") (p: Double) extends VectorTransformer
- class PCA @Since("1.4.0") (@Since("1.4.0") val k: Int)
- class StandardScaler @Since("1.1.0") (withMean: Boolean, withStd: Boolean) extends Logging
- class StandardScalerModel @Since("1.3.0") (
- class PoissonGenerator @Since("1.1.0") (
- class ExponentialGenerator @Since("1.3.0") (
- class GammaGenerator @Since("1.3.0") (
- class LogNormalGenerator @Since("1.3.0") (
- case class Rating @Since("0.8.0") (
- class MatrixFactorizationModel @Since("0.8.0") (
- abstract class GeneralizedLinearModel @Since("1.0.0") (
- class IsotonicRegressionModel @Since("1.3.0") (
- case class LabeledPoint @Since("1.0.0") (
- class LassoModel @Since("1.1.0") (
- class LinearRegressionModel @Since("1.1.0") (
- class RidgeRegressionModel @Since("1.1.0") (
- class MultivariateGaussian @Since("1.3.0") (
- case class BoostingStrategy @Since("1.4.0") (
- class Strategy @Since("1.3.0") (
- class DecisionTreeModel @Since("1.0.0") (
- class Node @Since("1.2.0") (
- class Predict @Since("1.2.0") (
- class RandomForestModel @Since("1.2.0") (
- class GradientBoostedTreesModel @Since("1.2.0") (
- case class LimitNode(limit: Int, child: LocalNode) extends UnaryLocalNode
- case class UnionNode(children: Seq[LocalNode]) extends LocalNode

SparkQA · 2015-08-30T19:45:48Z

Test build #41806 has finished for PR 8509 at commit f21d88e.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class LimitNode(limit: Int, child: LocalNode) extends UnaryLocalNode
- case class UnionNode(children: Seq[LocalNode]) extends LocalNode

liancheng · 2015-09-01T08:45:11Z

Merging to master.

davies · 2015-09-03T21:04:45Z

.../src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystReadSupport.scala

What about UDT?

Looks like for a UDT, we need to call isPrimitiveCatalystType on the sqlType of this UDT?

After chatted with @liancheng offline, we should not handle UDT here (leave it as it's).

Then, let's have a comment at here to explain the reason.

yhuai · 2015-09-04T17:21:56Z

Also, if there is any change we have in https://github.com/apache/spark/pull/8583/files but not in this one, let's have a follow-up for our master branch.

yhuai · 2015-09-04T17:28:20Z

.../src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystReadSupport.scala

Should we also call clipParquetType for parquetKeyType? What will happen if the key is a complex type?

We don't allow key type to be complex type in Spark SQL. This is consistent with Hive.

Actually, although complex map keys are not allowed while using HiveQL in Spark SQL, they are allowed otherwise, and we can read/write them from/to Parquet successfully. So we do need to handle complex map key here.

Added support for this in #8583.

liancheng · 2015-09-05T09:18:29Z

Since this PR has already been merged, I'm addressing all the comments in #8583, which backports this PR to branch-1.5. Will send out a separate PR later to address these issues for master.

cloud-fan · 2015-09-06T15:42:07Z

...e/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaSuite.scala

should be f01ElementType?

Yes, this error has been fixed in #8583.

cloud-fan · 2015-09-07T04:29:06Z

.../src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystReadSupport.scala

Should we add an assert here to make sure parquetType matches catalystType?

At first I thought it would be too complicated to add this assertion here since there can be multiple Parquet representation for a single Catalyst type, and some of them may even conflict with each other. But I just realized that we can simply resort to CatalystSchemaConverter to convert parquetType to a Catalyst type and see whether the result matches catalystType. This is because the mapping from Catalyst type to Parquet type is a one-to-many mapping.

I found adding this assertion is still a pretty big change. Since it's only defensive and doesn't affect correctness, I'd like to have this one in a separate PR.

liancheng · 2015-09-08T17:47:33Z

Now all the comments are addressed in #8583.

…or nested structs We used to workaround SPARK-10301 with a quick fix in branch-1.5 (PR #8515), but it doesn't cover the case described in SPARK-10428. So this PR backports PR #8509, which had once been considered too big a change to be merged into branch-1.5 in the last minute, to fix both SPARK-10301 and SPARK-10428 for Spark 1.5. Also added more test cases for SPARK-10428. This PR looks big, but the essential change is only ~200 loc. All other changes are for testing. Especially, PR #8454 is also backported here because the `ParquetInteroperabilitySuite` introduced in PR #8515 depends on it. This should be safe since #8454 only touches testing code. Author: Cheng Lian <[email protected]> Closes #8583 from liancheng/spark-10301/for-1.5.

…8509 for master Author: Cheng Lian <[email protected]> Closes #8670 from liancheng/spark-10301/address-pr-comments.

…or nested structs We used to workaround SPARK-10301 with a quick fix in branch-1.5 (PR apache#8515), but it doesn't cover the case described in SPARK-10428. So this PR backports PR apache#8509, which had once been considered too big a change to be merged into branch-1.5 in the last minute, to fix both SPARK-10301 and SPARK-10428 for Spark 1.5. Also added more test cases for SPARK-10428. This PR looks big, but the essential change is only ~200 loc. All other changes are for testing. Especially, PR apache#8454 is also backported here because the `ParquetInteroperabilitySuite` introduced in PR apache#8515 depends on it. This should be safe since apache#8454 only touches testing code. Author: Cheng Lian <[email protected]> Closes apache#8583 from liancheng/spark-10301/for-1.5. (cherry picked from commit fca16c5) Conflicts: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystReadSupport.scala

liancheng reviewed Aug 28, 2015
View reviewed changes

liancheng force-pushed the spark-10301/fix-parquet-requested-schema branch from f32bedf to 88ab2a9 Compare August 28, 2015 16:48

liancheng mentioned this pull request Aug 29, 2015

[SPARK-10339] [SPARK-10334] [SPARK-10301] [SQL]Partitioned table scan can OOM driver and throw a better error message when users need to enable parquet schema merging #8515

Closed

liancheng added 4 commits August 30, 2015 17:53

Clips Parquet requested schema for better compatibility

fb4d67d

Fixes test failures

b6f4526

More tests and comments

6f009a2

More comments and test cases

f21d88e

liancheng force-pushed the spark-10301/fix-parquet-requested-schema branch from 5004365 to f21d88e Compare August 30, 2015 10:12

Adds test case for parquet files with different physical schemas but …

38644d8

…share the same logical schema

asfgit closed this in 391e6be Sep 1, 2015

liancheng deleted the spark-10301/fix-parquet-requested-schema branch September 1, 2015 10:52

liancheng added a commit to liancheng/spark that referenced this pull request Sep 3, 2015

Backports PR apache#8509 to branch-1.5

c2a6177

liancheng mentioned this pull request Sep 3, 2015

[SPARK-10301] [SPARK-10428] [SQL] [BRANCH-1.5] Fixes schema merging for nested structs #8583

Closed

davies reviewed Sep 3, 2015
View reviewed changes

yhuai reviewed Sep 4, 2015
View reviewed changes

cloud-fan reviewed Sep 6, 2015
View reviewed changes

cloud-fan reviewed Sep 7, 2015
View reviewed changes

liancheng added a commit to liancheng/spark that referenced this pull request Sep 9, 2015

Addresses comments of PR apache#8583 and apache#8509 for master

b015f35

asfgit pushed a commit that referenced this pull request Sep 10, 2015

[SPARK-10301] [SPARK-10428] [SQL] Addresses comments of PR #8583 and #…

49da38e

…8509 for master Author: Cheng Lian <[email protected]> Closes #8670 from liancheng/spark-10301/address-pr-comments.

[SPARK-10301] [SQL] Fixes schema merging for nested structs #8509

[SPARK-10301] [SQL] Fixes schema merging for nested structs #8509

Uh oh!

Conversation

liancheng commented Aug 28, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

liancheng commented Aug 28, 2015

Uh oh!

SparkQA commented Aug 28, 2015

Uh oh!

SparkQA commented Aug 28, 2015

Uh oh!

marmbrus commented Aug 28, 2015

Uh oh!

liancheng commented Aug 29, 2015

Uh oh!

SparkQA commented Aug 30, 2015

Uh oh!

SparkQA commented Aug 30, 2015

Uh oh!

liancheng commented Sep 1, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yhuai commented Sep 4, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

liancheng commented Sep 5, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

liancheng commented Sep 8, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants