-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-8838][SQL] Add config to enable/disable merging part-files when merging parquet schema #7238
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #36573 has finished for PR 7238 at commit
|
|
Test build #36586 has finished for PR 7238 at commit
|
|
Test build #36671 has finished for PR 7238 at commit
|
|
Test build #36674 has finished for PR 7238 at commit
|
|
Test build #36676 has finished for PR 7238 at commit
|
|
I ran a simple benchmark for this, by disabling merging part-files when merging parquet schema, we can reduce the data loading time to 1/10. |
|
ping @liancheng @marmbrus |
…erge Conflicts: sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala
|
Test build #37902 has finished for PR 7238 at commit
|
|
ping @liancheng |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd propose renaming this configuration to spark.sql.parquet.respectSummaryFiles.
|
Test build #38077 has finished for PR 7238 at commit
|
|
Test build #1166 has finished for PR 7238 at commit
|
|
retest this please. |
|
Test build #38154 has finished for PR 7238 at commit
|
|
Test build #71 has finished for PR 7238 at commit
|
|
@liancheng The test failure looks like related to #7421 (comment). Can you look at it? Thanks. |
|
@viirya Yeah, this issue has been causing random build failures recently. Planning to look into this right after 1.5 code freeze deadline (early next week). Let's just retest this PR for now. |
|
retest this please. |
…erge Conflicts: sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala
|
Test build #38346 has finished for PR 7238 at commit
|
|
Test build #38348 has finished for PR 7238 at commit
|
|
Instead of finding the partitions of summary files, now we just get the parent paths of summary files. Then we parse the partitions of part-files. The partition paths of part-files that are not included in the parent paths of summary files (i.e., they have no summary files along with) are filtered out. We then choose corresponding part-files that are needed to read for schema merging. |
|
ping @liancheng Any other comments? |
…erge Conflicts: sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala
|
Test build #38517 has finished for PR 7238 at commit
|
|
It should be an unrelated failure. |
|
retest this please. |
|
Test build #112 has finished for PR 7238 at commit
|
|
Test build #38530 has finished for PR 7238 at commit
|
|
ping @liancheng Is this ready to merge now? Thanks. |
|
@viirya Sorry, was a little bit busy on other stuff. I'm going through a final check now. |
|
Ok, @liancheng, many thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still problematic. For a non-partitioned table like this:
base/
_metadata
_common_data
file-1
file-2
...
parsePartitions always returns an empty PartitionSpec containing no Partitions, thus dataPathsWithoutSummaries is always empty, and we always merge all part-files, which is not expected behavior.
However, as what I suggested in the other comment, we can probably just remove this method.
|
(Edited for typos and some rewording.) Hey @viirya, sorry that I lied, this might not be the FINAL check yet... Actually I began to regret one of my previous decision, namely merging part-files which don't have corresponding summary files. This is mostly because there are too many cases to consider if we assume summary files may be missing, and this makes the behavior of this configuration pretty much unintuitive. Parquet summary files can be missing under various corner cases (I can easily name at least 5 of them for now), it's hard to track and explain the behavior and may confuse Spark users who are not familiar with Parquet implementation details. The key problem here is that Parquet summary files are not written/accessed in an atomic manner. And that's one of the most important reason why the Parquet team is actually trying to get rid of the summary file entirely. Since the configuration is named "respectSummaryFiles", it seems more natural and intuitive to assume that summary files are ALWAYS properly generated for ALL Parquet write jobs when this configuration is turned on. To be more specific, given one or more Parquet input paths, we may find one or more summary files. Then metadata gathered by merging all these summary files should reflect the real schema of the given Parquet dataset. Only in this case, we can really "respect" existing summary files. So my suggestion here is that, when the "respectSummaryFiles" configuration is turned on, we only collects all summary files, merge schemas read from them, and just use the merged schema as the final result schema. And of course, this configuration should still be turned off by default. We also need to document this configuration carefully and add an "expert only" tag to it. I still consider this configuration quite useful, because even if you got a dirty Parquet dataset without summary files or with incorrect summary files at hand, you can still repair it quite easily. Essentially you only need to call How do you think? Again, sorry for my late review and your extra efforts for implementing all those intermediate versions... |
|
@liancheng Thank you for clarifying. It is no problem to me. In fact, at the beginning of this PR, it is proposed to only merge summary files and skip all part-files. This configuration is the one for gaining better performance when reading a lot of Parquet files. It is disabled by default and we document it as users should very be sure what it means before they turn on it. I also agreed that we can't take care of so many cases that we can't find the summary files along with part-files. Thus your suggestion is better for me for this PR. I will update this soon. |
|
Test build #38970 has finished for PR 7238 at commit
|
…erge Conflicts: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetRelation.scala
|
Test build #39023 has finished for PR 7238 at commit
|
|
retest this please. |
|
Test build #39031 has finished for PR 7238 at commit
|
|
Test build #160 has finished for PR 7238 at commit
|
|
Thanks! merging to master. |
JIRA: https://issues.apache.org/jira/browse/SPARK-8838
Currently all part-files are merged when merging parquet schema. However, in case there are many part-files and we can make sure that all the part-files have the same schema as their summary file. If so, we provide a configuration to disable merging part-files when merging parquet schema.
In short, we need to merge parquet schema because different summary files may contain different schema. But the part-files are confirmed to have the same schema with summary files.