-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-10301] [SQL] Fixes schema merging for nested structs #8509
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-10301] [SQL] Fixes schema merging for nested structs #8509
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an irrelevant change, added to stop IntelliJ IDEA highlighting errors in ScalaDoc.
f32bedf to
88ab2a9
Compare
|
@rxin Considering this is a pretty major change and SPARK-10301 isn't a blocker, I'm not quite sure whether we should include this into 1.5 at this moment. Another thing to note is that, to the best of my knowledge, most (if not all) existing Parquet libraries suffer this issue. |
|
Test build #41749 has finished for PR 8509 at commit
|
|
Test build #41750 has finished for PR 8509 at commit
|
|
This seems risky to include in branch-1.5 given how far along in the process we are. I'd propose we instead merge a small patch that checks that the things being zipped are the same size, and if not throws and error asking the user to turn on schema merging (the parquet error is very confusing). We can merge this into master. |
|
The quick fix @marmbrus mentioned has been added as part of #8515 (yhuai@b509bee). |
5004365 to
f21d88e
Compare
…share the same logical schema
|
Test build #41809 has finished for PR 8509 at commit
|
|
Test build #41806 has finished for PR 8509 at commit
|
|
Merging to master. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about UDT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like for a UDT, we need to call isPrimitiveCatalystType on the sqlType of this UDT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After chatted with @liancheng offline, we should not handle UDT here (leave it as it's).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then, let's have a comment at here to explain the reason.
|
Also, if there is any change we have in https://github.com/apache/spark/pull/8583/files but not in this one, let's have a follow-up for our master branch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also call clipParquetType for parquetKeyType? What will happen if the key is a complex type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't allow key type to be complex type in Spark SQL. This is consistent with Hive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, although complex map keys are not allowed while using HiveQL in Spark SQL, they are allowed otherwise, and we can read/write them from/to Parquet successfully. So we do need to handle complex map key here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added support for this in #8583.
|
Since this PR has already been merged, I'm addressing all the comments in #8583, which backports this PR to branch-1.5. Will send out a separate PR later to address these issues for master. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be f01ElementType?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this error has been fixed in #8583.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add an assert here to make sure parquetType matches catalystType?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At first I thought it would be too complicated to add this assertion here since there can be multiple Parquet representation for a single Catalyst type, and some of them may even conflict with each other. But I just realized that we can simply resort to CatalystSchemaConverter to convert parquetType to a Catalyst type and see whether the result matches catalystType. This is because the mapping from Catalyst type to Parquet type is a one-to-many mapping.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found adding this assertion is still a pretty big change. Since it's only defensive and doesn't affect correctness, I'd like to have this one in a separate PR.
|
Now all the comments are addressed in #8583. |
…or nested structs We used to workaround SPARK-10301 with a quick fix in branch-1.5 (PR #8515), but it doesn't cover the case described in SPARK-10428. So this PR backports PR #8509, which had once been considered too big a change to be merged into branch-1.5 in the last minute, to fix both SPARK-10301 and SPARK-10428 for Spark 1.5. Also added more test cases for SPARK-10428. This PR looks big, but the essential change is only ~200 loc. All other changes are for testing. Especially, PR #8454 is also backported here because the `ParquetInteroperabilitySuite` introduced in PR #8515 depends on it. This should be safe since #8454 only touches testing code. Author: Cheng Lian <[email protected]> Closes #8583 from liancheng/spark-10301/for-1.5.
…8509 for master Author: Cheng Lian <[email protected]> Closes #8670 from liancheng/spark-10301/address-pr-comments.
…or nested structs We used to workaround SPARK-10301 with a quick fix in branch-1.5 (PR apache#8515), but it doesn't cover the case described in SPARK-10428. So this PR backports PR apache#8509, which had once been considered too big a change to be merged into branch-1.5 in the last minute, to fix both SPARK-10301 and SPARK-10428 for Spark 1.5. Also added more test cases for SPARK-10428. This PR looks big, but the essential change is only ~200 loc. All other changes are for testing. Especially, PR apache#8454 is also backported here because the `ParquetInteroperabilitySuite` introduced in PR apache#8515 depends on it. This should be safe since apache#8454 only touches testing code. Author: Cheng Lian <[email protected]> Closes apache#8583 from liancheng/spark-10301/for-1.5. (cherry picked from commit fca16c5) Conflicts: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystReadSupport.scala
This PR can be quite challenging to review. I'm trying to give a detailed description of the problem as well as its solution here.
When reading Parquet files, we need to specify a potentially nested Parquet schema (of type
MessageType) as requested schema for column pruning. This Parquet schema is translated from a Catalyst schema (of typeStructType), which is generated by the query planner and represents all requested columns. However, this translation can be fairly complicated because of several reasons:This means we have to tailor the actual file schema of every individual physical Parquet file to be read according to the given Catalyst schema. Fortunately we are already doing this in Spark 1.5 by pushing request schema conversion to executor side in PR #7231.
2. Support for schema merging.
A single Parquet dataset may consist of multiple physical Parquet files come with different but compatible schemas. This means we may request for a column path that doesn't exist in a physical Parquet file. All requested column paths can be nested. For example, for a Parquet file schema
we may request for column paths defined in the following schema:
Notice that we pruned column path
f0.f00.f000, but addedf0.f00.f002andf1.The good news is that Parquet handles non-existing column paths properly and always returns null for them.
3. The map from
StructTypetoMessageTypeis a one-to-many map.This is the most unfortunate part.
Due to historical reasons (dark histories!), schemas of Parquet files generated by different libraries have different "flavors". For example, to handle a schema with a single non-nullable column, whose type is an array of non-nullable integers, parquet-protobuf generates the following Parquet schema:
while parquet-avro generates another version:
and parquet-thrift spills this:
All of them can be mapped to the following unique Catalyst schema:
This greatly complicates Parquet requested schema construction, since the path of a given column varies in different cases. To read the array elements from files with the above schemas, we must use
fform0,f.arrayform1, andf.f_tupleform2.In earlier Spark versions, we didn't try to fix this issue properly. Spark 1.4 and prior versions simply translate the Catalyst schema in a way more or less compatible with parquet-hive and parquet-avro, but is broken in many other cases. Earlier revisions of Spark 1.5 only try to tailor the Parquet file schema at the first level, and ignore nested ones. This caused SPARK-10301 as well as SPARK-10005. In PR #8228, I tried to avoid the hard part of the problem and made a minimum change in
CatalystRowConverterto fix SPARK-10005. However, when taking SPARK-10301 into consideration, keeping hackingCatalystRowConverterdoesn't seem to be a good idea. So this PR is an attempt to fix the problem in a proper way.For a given physical Parquet file with schema
psand a compatible Catalyst requested schemacs, we use the following algorithm to tailorpsto get the result Parquet requested schemaps':For a leaf column path
cincs:cexists incsand a corresponding Parquet column pathc'can be found inps,c'should be included inps';cto a Parquet column pathc"usingCatalystSchemaConverter, and includec"inps';ps'.Then comes the most tedious part:
Unfortunately, there's no quick answer, and we have to enumerate all possible structures defined in parquet-format spec. They are:
LISTandMAP.The core part of this PR is
CatalystReadSupport.clipParquetType(), which tailors a given Parquet file schema according to a requested schema in its Catalyst form. Backwards-compatibility rules ofLISTandMAPare covered inclipParquetListType()andclipParquetMapType()respectively. The column path selection algorithm is implemented inclipParquetGroupFields().With this PR, we no longer need to do schema tailoring in
CatalystReadSupportandCatalystRowConverter. Another benefit is that, now we can also read Parquet datasets consist of files with different physical Parquet schema but share the same logical schema, for example, files generated by different Parquet libraries. This situation is illustrated by this test case.