[SPARK-8838][SQL] Add config to enable/disable merging part-files when merging parquet schema #7238

viirya · 2015-07-06T11:22:50Z

JIRA: https://issues.apache.org/jira/browse/SPARK-8838

Currently all part-files are merged when merging parquet schema. However, in case there are many part-files and we can make sure that all the part-files have the same schema as their summary file. If so, we provide a configuration to disable merging part-files when merging parquet schema.

In short, we need to merge parquet schema because different summary files may contain different schema. But the part-files are confirmed to have the same schema with summary files.

…schema.

SparkQA · 2015-07-06T13:34:26Z

Test build #36573 has finished for PR 7238 at commit bbd4ce7.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class ScalaUDF(
- case class CurrentDate() extends LeafExpression
- case class CurrentTimestamp() extends LeafExpression
- case class Hex(child: Expression) extends UnaryExpression with ExpectsInputTypes
- case class UnHex(child: Expression) extends UnaryExpression with ExpectsInputTypes
- case class ShiftLeft(left: Expression, right: Expression)
- case class ShiftRight(left: Expression, right: Expression)
- case class ShiftRightUnsigned(left: Expression, right: Expression)
- case class Levenshtein(left: Expression, right: Expression) extends BinaryExpression
- case class Ascii(child: Expression) extends UnaryExpression with ExpectsInputTypes
- case class Base64(child: Expression) extends UnaryExpression with ExpectsInputTypes
- case class UnBase64(child: Expression) extends UnaryExpression with ExpectsInputTypes
- case class Decode(bin: Expression, charset: Expression)
- case class Encode(value: Expression, charset: Expression)
- case class UserDefinedFunction protected[sql] (

SparkQA · 2015-07-06T17:45:27Z

Test build #36586 has finished for PR 7238 at commit 8bbebcb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-07T11:33:24Z

Test build #36671 has finished for PR 7238 at commit 4bdd7e0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-07T11:49:07Z

Test build #36674 has finished for PR 7238 at commit 3b6be5b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-07T12:14:38Z

Test build #36676 has finished for PR 7238 at commit 0e734e0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2015-07-09T14:28:12Z

I ran a simple benchmark for this, by disabling merging part-files when merging parquet schema, we can reduce the data loading time to 1/10.

viirya · 2015-07-11T16:28:27Z

ping @liancheng @marmbrus

…erge Conflicts: sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala

SparkQA · 2015-07-21T04:40:12Z

Test build #37902 has finished for PR 7238 at commit 4caf293.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- trait ExpectsInputTypes extends Expression
- trait ImplicitCastInputTypes extends ExpectsInputTypes
- trait Unevaluable extends Expression
- trait Nondeterministic extends Expression
- trait CodegenFallback extends Expression
- case class Hex(child: Expression) extends UnaryExpression with ImplicitCastInputTypes
- case class Unhex(child: Expression) extends UnaryExpression with ImplicitCastInputTypes

viirya · 2015-07-21T05:08:25Z

ping @liancheng

liancheng · 2015-07-21T09:31:15Z

sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala

I'd propose renaming this configuration to spark.sql.parquet.respectSummaryFiles.

…erge

SparkQA · 2015-07-22T14:31:06Z

Test build #38077 has finished for PR 7238 at commit ea8f6e5.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-22T16:35:39Z

Test build #1166 has finished for PR 7238 at commit ea8f6e5.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2015-07-23T03:03:15Z

retest this please.

SparkQA · 2015-07-23T04:29:00Z

Test build #38154 has finished for PR 7238 at commit ea8f6e5.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-23T04:37:23Z

Test build #71 has finished for PR 7238 at commit ea8f6e5.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2015-07-23T04:49:37Z

@liancheng The test failure looks like related to #7421 (comment). Can you look at it? Thanks.

liancheng · 2015-07-23T05:25:49Z

@viirya Yeah, this issue has been causing random build failures recently. Planning to look into this right after 1.5 code freeze deadline (early next week). Let's just retest this PR for now.

viirya · 2015-07-23T05:57:34Z

retest this please.

…erge Conflicts: sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala

SparkQA · 2015-07-24T10:56:30Z

Test build #38346 has finished for PR 7238 at commit 4eb2f00.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-24T12:28:29Z

Test build #38348 has finished for PR 7238 at commit d4ed7e6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2015-07-24T14:00:11Z

Instead of finding the partitions of summary files, now we just get the parent paths of summary files. Then we parse the partitions of part-files. The partition paths of part-files that are not included in the parent paths of summary files (i.e., they have no summary files along with) are filtered out. We then choose corresponding part-files that are needed to read for schema merging.

viirya · 2015-07-26T10:47:28Z

ping @liancheng

Any other comments?

…erge Conflicts: sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala

SparkQA · 2015-07-27T07:38:31Z

Test build #38517 has finished for PR 7238 at commit afc2fa1.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2015-07-27T08:09:10Z

It should be an unrelated failure.

viirya · 2015-07-27T08:09:26Z

retest this please.

SparkQA · 2015-07-27T11:26:54Z

Test build #112 has finished for PR 7238 at commit afc2fa1.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-27T11:34:52Z

Test build #38530 has finished for PR 7238 at commit afc2fa1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2015-07-28T09:53:31Z

ping @liancheng Is this ready to merge now? Thanks.

liancheng · 2015-07-29T00:15:03Z

@viirya Sorry, was a little bit busy on other stuff. I'm going through a final check now.

viirya · 2015-07-29T03:13:57Z

Ok, @liancheng, many thanks.

liancheng · 2015-07-29T16:57:56Z

sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetRelation.scala

This is still problematic. For a non-partitioned table like this:

base/ _metadata _common_data file-1 file-2 ...

parsePartitions always returns an empty PartitionSpec containing no Partitions, thus dataPathsWithoutSummaries is always empty, and we always merge all part-files, which is not expected behavior.

However, as what I suggested in the other comment, we can probably just remove this method.

liancheng · 2015-07-29T16:58:21Z

(Edited for typos and some rewording.)

Hey @viirya, sorry that I lied, this might not be the FINAL check yet...

Actually I began to regret one of my previous decision, namely merging part-files which don't have corresponding summary files. This is mostly because there are too many cases to consider if we assume summary files may be missing, and this makes the behavior of this configuration pretty much unintuitive. Parquet summary files can be missing under various corner cases (I can easily name at least 5 of them for now), it's hard to track and explain the behavior and may confuse Spark users who are not familiar with Parquet implementation details. The key problem here is that Parquet summary files are not written/accessed in an atomic manner. And that's one of the most important reason why the Parquet team is actually trying to get rid of the summary file entirely.

Since the configuration is named "respectSummaryFiles", it seems more natural and intuitive to assume that summary files are ALWAYS properly generated for ALL Parquet write jobs when this configuration is turned on. To be more specific, given one or more Parquet input paths, we may find one or more summary files. Then metadata gathered by merging all these summary files should reflect the real schema of the given Parquet dataset. Only in this case, we can really "respect" existing summary files.

So my suggestion here is that, when the "respectSummaryFiles" configuration is turned on, we only collects all summary files, merge schemas read from them, and just use the merged schema as the final result schema. And of course, this configuration should still be turned off by default. We also need to document this configuration carefully and add an "expert only" tag to it.

I still consider this configuration quite useful, because even if you got a dirty Parquet dataset without summary files or with incorrect summary files at hand, you can still repair it quite easily. Essentially you only need to call ParquetOutputFormat.writeMetaDataFile(), which either generates correct summary files for the entire dataset or deletes ill summary files if it fails to merge all user-defined key-value metadata.

How do you think? Again, sorry for my late review and your extra efforts for implementing all those intermediate versions...

viirya · 2015-07-30T02:44:31Z

@liancheng Thank you for clarifying.

It is no problem to me. In fact, at the beginning of this PR, it is proposed to only merge summary files and skip all part-files. This configuration is the one for gaining better performance when reading a lot of Parquet files. It is disabled by default and we document it as users should very be sure what it means before they turn on it. I also agreed that we can't take care of so many cases that we can't find the summary files along with part-files. Thus your suggestion is better for me for this PR.

I will update this soon.

…erge

SparkQA · 2015-07-30T05:11:22Z

Test build #38970 has finished for PR 7238 at commit 8816f44.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…erge Conflicts: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetRelation.scala

SparkQA · 2015-07-30T08:06:48Z

Test build #39023 has finished for PR 7238 at commit 71d5b5f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2015-07-30T08:14:32Z

retest this please.

SparkQA · 2015-07-30T08:36:25Z

Test build #39031 has finished for PR 7238 at commit 71d5b5f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-30T09:43:13Z

Test build #160 has finished for PR 7238 at commit 71d5b5f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

liancheng · 2015-07-30T09:43:57Z

Thanks! merging to master.

Add config to enable/disable merging part-files when merging parquet …

bbd4ce7

…schema.

Figure out how to test the config.

8bbebcb

viirya added 3 commits July 7, 2015 17:56

Don't read footer files if we can skip them.

4bdd7e0

Fix key not found.

3b6be5b

Use correct API.

0e734e0

viirya mentioned this pull request Jul 19, 2015

[SPARK-8756][SQL] Keep cached information and avoid re-calculating footers in ParquetRelation2 #7154

Closed

Merge remote-tracking branch 'upstream/master' into option_partfile_m…

4caf293

…erge Conflicts: sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala

liancheng reviewed Jul 21, 2015
View reviewed changes

viirya added 3 commits July 22, 2015 22:13

Merge remote-tracking branch 'upstream/master' into option_partfile_m…

47df981

…erge

Merge part-files if there are no summary files.

a57be0e

Correct the code comments.

ea8f6e5

Merge remote-tracking branch 'upstream/master' into option_partfile_m…

d4ed7e6

…erge Conflicts: sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala

Merge remote-tracking branch 'upstream/master' into option_partfile_m…

afc2fa1

…erge Conflicts: sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala

liancheng reviewed Jul 29, 2015
View reviewed changes

viirya added 2 commits July 30, 2015 11:34

Merge remote-tracking branch 'upstream/master' into option_partfile_m…

dbc8e6b

…erge

For comments.

8816f44

Merge remote-tracking branch 'upstream/master' into option_partfile_m…

71d5b5f

…erge Conflicts: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetRelation.scala

asfgit closed this in 6175d6c Jul 30, 2015

viirya deleted the option_partfile_merge branch December 27, 2023 18:18

[SPARK-8838][SQL] Add config to enable/disable merging part-files when merging parquet schema #7238

[SPARK-8838][SQL] Add config to enable/disable merging part-files when merging parquet schema #7238

Uh oh!

Conversation

viirya commented Jul 6, 2015

Uh oh!

SparkQA commented Jul 6, 2015

Uh oh!

SparkQA commented Jul 6, 2015

Uh oh!

SparkQA commented Jul 7, 2015

Uh oh!

SparkQA commented Jul 7, 2015

Uh oh!

SparkQA commented Jul 7, 2015

Uh oh!

viirya commented Jul 9, 2015

Uh oh!

viirya commented Jul 11, 2015

Uh oh!

SparkQA commented Jul 21, 2015

Uh oh!

viirya commented Jul 21, 2015

Uh oh!

liancheng Jul 21, 2015

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 22, 2015

Uh oh!

SparkQA commented Jul 22, 2015

Uh oh!

viirya commented Jul 23, 2015

Uh oh!

SparkQA commented Jul 23, 2015

Uh oh!

SparkQA commented Jul 23, 2015

Uh oh!

viirya commented Jul 23, 2015

Uh oh!

liancheng commented Jul 23, 2015

Uh oh!

viirya commented Jul 23, 2015

Uh oh!

SparkQA commented Jul 24, 2015

Uh oh!

SparkQA commented Jul 24, 2015

Uh oh!

viirya commented Jul 24, 2015

Uh oh!

viirya commented Jul 26, 2015

Uh oh!

SparkQA commented Jul 27, 2015

Uh oh!

viirya commented Jul 27, 2015

Uh oh!

viirya commented Jul 27, 2015

Uh oh!

SparkQA commented Jul 27, 2015

Uh oh!

SparkQA commented Jul 27, 2015

Uh oh!

viirya commented Jul 28, 2015

Uh oh!

liancheng commented Jul 29, 2015

Uh oh!

viirya commented Jul 29, 2015

Uh oh!

liancheng Jul 29, 2015

Choose a reason for hiding this comment

Uh oh!

liancheng commented Jul 29, 2015

Uh oh!

viirya commented Jul 30, 2015

Uh oh!

SparkQA commented Jul 30, 2015

Uh oh!

SparkQA commented Jul 30, 2015

Uh oh!

viirya commented Jul 30, 2015

Uh oh!