Skip to content

Conversation

@viirya
Copy link
Member

@viirya viirya commented Jul 6, 2015

JIRA: https://issues.apache.org/jira/browse/SPARK-8838

Currently all part-files are merged when merging parquet schema. However, in case there are many part-files and we can make sure that all the part-files have the same schema as their summary file. If so, we provide a configuration to disable merging part-files when merging parquet schema.

In short, we need to merge parquet schema because different summary files may contain different schema. But the part-files are confirmed to have the same schema with summary files.

@SparkQA
Copy link

SparkQA commented Jul 6, 2015

Test build #36573 has finished for PR 7238 at commit bbd4ce7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class ScalaUDF(
    • case class CurrentDate() extends LeafExpression
    • case class CurrentTimestamp() extends LeafExpression
    • case class Hex(child: Expression) extends UnaryExpression with ExpectsInputTypes
    • case class UnHex(child: Expression) extends UnaryExpression with ExpectsInputTypes
    • case class ShiftLeft(left: Expression, right: Expression)
    • case class ShiftRight(left: Expression, right: Expression)
    • case class ShiftRightUnsigned(left: Expression, right: Expression)
    • case class Levenshtein(left: Expression, right: Expression) extends BinaryExpression
    • case class Ascii(child: Expression) extends UnaryExpression with ExpectsInputTypes
    • case class Base64(child: Expression) extends UnaryExpression with ExpectsInputTypes
    • case class UnBase64(child: Expression) extends UnaryExpression with ExpectsInputTypes
    • case class Decode(bin: Expression, charset: Expression)
    • case class Encode(value: Expression, charset: Expression)
    • case class UserDefinedFunction protected[sql] (

@SparkQA
Copy link

SparkQA commented Jul 6, 2015

Test build #36586 has finished for PR 7238 at commit 8bbebcb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 7, 2015

Test build #36671 has finished for PR 7238 at commit 4bdd7e0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 7, 2015

Test build #36674 has finished for PR 7238 at commit 3b6be5b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 7, 2015

Test build #36676 has finished for PR 7238 at commit 0e734e0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Jul 9, 2015

I ran a simple benchmark for this, by disabling merging part-files when merging parquet schema, we can reduce the data loading time to 1/10.

@viirya
Copy link
Member Author

viirya commented Jul 11, 2015

ping @liancheng @marmbrus

…erge

Conflicts:
	sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala
@SparkQA
Copy link

SparkQA commented Jul 21, 2015

Test build #37902 has finished for PR 7238 at commit 4caf293.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • trait ExpectsInputTypes extends Expression
    • trait ImplicitCastInputTypes extends ExpectsInputTypes
    • trait Unevaluable extends Expression
    • trait Nondeterministic extends Expression
    • trait CodegenFallback extends Expression
    • case class Hex(child: Expression) extends UnaryExpression with ImplicitCastInputTypes
    • case class Unhex(child: Expression) extends UnaryExpression with ImplicitCastInputTypes

@viirya
Copy link
Member Author

viirya commented Jul 21, 2015

ping @liancheng

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd propose renaming this configuration to spark.sql.parquet.respectSummaryFiles.

@SparkQA
Copy link

SparkQA commented Jul 22, 2015

Test build #38077 has finished for PR 7238 at commit ea8f6e5.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 22, 2015

Test build #1166 has finished for PR 7238 at commit ea8f6e5.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Jul 23, 2015

retest this please.

@SparkQA
Copy link

SparkQA commented Jul 23, 2015

Test build #38154 has finished for PR 7238 at commit ea8f6e5.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 23, 2015

Test build #71 has finished for PR 7238 at commit ea8f6e5.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Jul 23, 2015

@liancheng The test failure looks like related to #7421 (comment). Can you look at it? Thanks.

@liancheng
Copy link
Contributor

@viirya Yeah, this issue has been causing random build failures recently. Planning to look into this right after 1.5 code freeze deadline (early next week). Let's just retest this PR for now.

@viirya
Copy link
Member Author

viirya commented Jul 23, 2015

retest this please.

…erge

Conflicts:
	sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala
@SparkQA
Copy link

SparkQA commented Jul 24, 2015

Test build #38346 has finished for PR 7238 at commit 4eb2f00.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 24, 2015

Test build #38348 has finished for PR 7238 at commit d4ed7e6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Jul 24, 2015

Instead of finding the partitions of summary files, now we just get the parent paths of summary files. Then we parse the partitions of part-files. The partition paths of part-files that are not included in the parent paths of summary files (i.e., they have no summary files along with) are filtered out. We then choose corresponding part-files that are needed to read for schema merging.

@viirya
Copy link
Member Author

viirya commented Jul 26, 2015

ping @liancheng

Any other comments?

…erge

Conflicts:
	sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala
@SparkQA
Copy link

SparkQA commented Jul 27, 2015

Test build #38517 has finished for PR 7238 at commit afc2fa1.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Jul 27, 2015

It should be an unrelated failure.

@viirya
Copy link
Member Author

viirya commented Jul 27, 2015

retest this please.

@SparkQA
Copy link

SparkQA commented Jul 27, 2015

Test build #112 has finished for PR 7238 at commit afc2fa1.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 27, 2015

Test build #38530 has finished for PR 7238 at commit afc2fa1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Jul 28, 2015

ping @liancheng Is this ready to merge now? Thanks.

@liancheng
Copy link
Contributor

@viirya Sorry, was a little bit busy on other stuff. I'm going through a final check now.

@viirya
Copy link
Member Author

viirya commented Jul 29, 2015

Ok, @liancheng, many thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still problematic. For a non-partitioned table like this:

base/
  _metadata
  _common_data
  file-1
  file-2
  ...

parsePartitions always returns an empty PartitionSpec containing no Partitions, thus dataPathsWithoutSummaries is always empty, and we always merge all part-files, which is not expected behavior.

However, as what I suggested in the other comment, we can probably just remove this method.

@liancheng
Copy link
Contributor

(Edited for typos and some rewording.)

Hey @viirya, sorry that I lied, this might not be the FINAL check yet...

Actually I began to regret one of my previous decision, namely merging part-files which don't have corresponding summary files. This is mostly because there are too many cases to consider if we assume summary files may be missing, and this makes the behavior of this configuration pretty much unintuitive. Parquet summary files can be missing under various corner cases (I can easily name at least 5 of them for now), it's hard to track and explain the behavior and may confuse Spark users who are not familiar with Parquet implementation details. The key problem here is that Parquet summary files are not written/accessed in an atomic manner. And that's one of the most important reason why the Parquet team is actually trying to get rid of the summary file entirely.

Since the configuration is named "respectSummaryFiles", it seems more natural and intuitive to assume that summary files are ALWAYS properly generated for ALL Parquet write jobs when this configuration is turned on. To be more specific, given one or more Parquet input paths, we may find one or more summary files. Then metadata gathered by merging all these summary files should reflect the real schema of the given Parquet dataset. Only in this case, we can really "respect" existing summary files.

So my suggestion here is that, when the "respectSummaryFiles" configuration is turned on, we only collects all summary files, merge schemas read from them, and just use the merged schema as the final result schema. And of course, this configuration should still be turned off by default. We also need to document this configuration carefully and add an "expert only" tag to it.

I still consider this configuration quite useful, because even if you got a dirty Parquet dataset without summary files or with incorrect summary files at hand, you can still repair it quite easily. Essentially you only need to call ParquetOutputFormat.writeMetaDataFile(), which either generates correct summary files for the entire dataset or deletes ill summary files if it fails to merge all user-defined key-value metadata.

How do you think? Again, sorry for my late review and your extra efforts for implementing all those intermediate versions...

@viirya
Copy link
Member Author

viirya commented Jul 30, 2015

@liancheng Thank you for clarifying.

It is no problem to me. In fact, at the beginning of this PR, it is proposed to only merge summary files and skip all part-files. This configuration is the one for gaining better performance when reading a lot of Parquet files. It is disabled by default and we document it as users should very be sure what it means before they turn on it. I also agreed that we can't take care of so many cases that we can't find the summary files along with part-files. Thus your suggestion is better for me for this PR.

I will update this soon.

@SparkQA
Copy link

SparkQA commented Jul 30, 2015

Test build #38970 has finished for PR 7238 at commit 8816f44.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

…erge

Conflicts:
	sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetRelation.scala
@SparkQA
Copy link

SparkQA commented Jul 30, 2015

Test build #39023 has finished for PR 7238 at commit 71d5b5f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Jul 30, 2015

retest this please.

@SparkQA
Copy link

SparkQA commented Jul 30, 2015

Test build #39031 has finished for PR 7238 at commit 71d5b5f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 30, 2015

Test build #160 has finished for PR 7238 at commit 71d5b5f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@liancheng
Copy link
Contributor

Thanks! merging to master.

@asfgit asfgit closed this in 6175d6c Jul 30, 2015
@viirya viirya deleted the option_partfile_merge branch December 27, 2023 18:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants