[SPARK-8756][SQL] Keep cached information and avoid re-calculating footers in ParquetRelation2 #7154

viirya · 2015-07-01T09:54:00Z

JIRA: https://issues.apache.org/jira/browse/SPARK-8756

Currently, in ParquetRelation2, footers are re-read every time refresh() is called. But we can check if it is possibly changed before we do the reading because reading all footers will be expensive when there are too many partitions. This pr fixes this by keeping some cached information to check it.

SparkQA · 2015-07-01T11:39:53Z

Test build #36245 has finished for PR 7154 at commit 0ef8caf.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- // compiled class file for the closure here will conflict with the one in callUDF (upper case).

rxin · 2015-07-19T06:55:39Z

cc @liancheng

liancheng · 2015-07-19T11:52:29Z

Thanks for contributing this patch! I have two high level comments here:

PR [SPARK-8125] [SQL] Accelerates Parquet schema merging and partition discovery #7396 also tries to accelerate Parquet metadata discovery/refreshing by several means, and has been proven to be quite effective. We've observed ~50x speedup on large partitioned S3 dataset with schema merging enabled.
How about adding a check for FileStatus.getModificationTime? Namely, we only read footers of new files and existing files that are modified since last refresh. This can be particularly useful for appending.

In general, this PR can be a good complement for #7396.

viirya · 2015-07-19T12:32:20Z

@liancheng thanks for commenting. I've noticed #7396 and I think it will bring great improvement to parquet performance.

In fact I have also submitted another PR #7238 to improve parquet schema merging performance. As I tested, it can improve much the performance. #7238 can be a complement to #7396 as well, but it may have much more conflicts with #7396. So I will refactor it after #7396 is getting merged.

viirya · 2015-07-19T12:37:31Z

I will add the check for FileStatus.getModificationTime later.

…quet_relation

viirya · 2015-07-20T06:39:05Z

@liancheng I've added the check you suggested. Please take a look when you have time. Thanks.

SparkQA · 2015-07-20T08:10:53Z

Test build #37812 has finished for PR 7154 at commit 12a0ed9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

marmbrus · 2015-07-21T01:57:24Z

#7396 has been merged

viirya · 2015-07-21T02:23:22Z

@marmbrus thanks, I'm going to solve the conflicts.

…quet_relation Conflicts: sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala

SparkQA · 2015-07-21T04:07:18Z

Test build #37899 has finished for PR 7154 at commit 21bbdec.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- trait ExpectsInputTypes extends Expression
- trait ImplicitCastInputTypes extends ExpectsInputTypes
- trait Unevaluable extends Expression
- trait Nondeterministic extends Expression
- trait CodegenFallback extends Expression
- case class Hex(child: Expression) extends UnaryExpression with ImplicitCastInputTypes
- case class Unhex(child: Expression) extends UnaryExpression with ImplicitCastInputTypes
- case class FakeFileStatus(

viirya · 2015-07-21T05:06:44Z

I think it should be an unrelated failure.

viirya · 2015-07-21T05:06:50Z

retest this please.

SparkQA · 2015-07-21T06:50:28Z

Test build #39 has finished for PR 7154 at commit 21bbdec.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2015-07-21T07:49:38Z

ping @liancheng

liancheng · 2015-07-21T09:16:19Z

sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala

Should we do a deep copy here? Currently it's OK because cachedLeafStatuses() always returns a new instance of Set[FileStatus], but it's possible that we use a mutable set object in the future. In that case, the != predicate above will always be true.

…quet_relation

…nd newly loaded one.

SparkQA · 2015-07-22T08:41:15Z

Test build #38062 has finished for PR 7154 at commit fa5458f.

This patch fails to build.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class UnresolvedFunction(
- case class Average(child: Expression) extends AlgebraicAggregate
- case class Count(child: Expression) extends AlgebraicAggregate
- case class First(child: Expression) extends AlgebraicAggregate
- case class Last(child: Expression) extends AlgebraicAggregate
- case class Max(child: Expression) extends AlgebraicAggregate
- case class Min(child: Expression) extends AlgebraicAggregate
- case class Sum(child: Expression) extends AlgebraicAggregate
- abstract class AlgebraicAggregate extends AggregateFunction2 with Serializable
- implicit class RichAttribute(a: AttributeReference)
- trait AggregateExpression1 extends AggregateExpression
- trait PartialAggregate1 extends AggregateExpression1
- case class Min(child: Expression) extends UnaryExpression with PartialAggregate1
- case class MinFunction(expr: Expression, base: AggregateExpression1) extends AggregateFunction1
- case class Max(child: Expression) extends UnaryExpression with PartialAggregate1
- case class MaxFunction(expr: Expression, base: AggregateExpression1) extends AggregateFunction1
- case class Count(child: Expression) extends UnaryExpression with PartialAggregate1
- case class CountFunction(expr: Expression, base: AggregateExpression1) extends AggregateFunction1
- case class CountDistinct(expressions: Seq[Expression]) extends PartialAggregate1
- case class CollectHashSet(expressions: Seq[Expression]) extends AggregateExpression1
- case class CombineSetsAndCount(inputSet: Expression) extends AggregateExpression1
- case class Average(child: Expression) extends UnaryExpression with PartialAggregate1
- case class AverageFunction(expr: Expression, base: AggregateExpression1)
- case class Sum(child: Expression) extends UnaryExpression with PartialAggregate1
- case class SumFunction(expr: Expression, base: AggregateExpression1) extends AggregateFunction1
- case class CombineSum(child: Expression) extends AggregateExpression1
- case class CombineSumFunction(expr: Expression, base: AggregateExpression1)
- case class SumDistinct(child: Expression) extends UnaryExpression with PartialAggregate1
- case class SumDistinctFunction(expr: Expression, base: AggregateExpression1)
- case class CombineSetsAndSum(inputSet: Expression, base: Expression) extends AggregateExpression1
- case class First(child: Expression) extends UnaryExpression with PartialAggregate1
- case class FirstFunction(expr: Expression, base: AggregateExpression1) extends AggregateFunction1
- case class Last(child: Expression) extends UnaryExpression with PartialAggregate1
- case class LastFunction(expr: Expression, base: AggregateExpression1) extends AggregateFunction1
- case class Aggregate2Sort(
- case class FinalAndCompleteAggregate2Sort(
- class GroupingIterator(
- class PartialSortAggregationIterator(
- class PartialMergeSortAggregationIterator(
- class FinalSortAggregationIterator(
- class FinalAndCompleteSortAggregationIterator(
- abstract class UserDefinedAggregateFunction extends Serializable
- case class ScalaUDAF(

liancheng · 2015-07-22T10:05:10Z

sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala

Unfortunately there's still a bug here :) We also need to remove those files that only exist in the old cache (namely removed files). This happens when an existing directory is overwritten.

I think we can first keep both the old cache and the result of cachedLeafStatuses(), then filter out updated and new files, and at last update the old FileStatus cache.

I guess you are trying to only keep FileStatuses of those files that need to be touched during schema merging here. Actually the FileStatus cache must be consistent with the files stored on the file system. Because we also inject the cache to ParquetInputFormat in ParquetRelation2.buildScan to avoid calling listStatus repeatedly there.

Hmm, I am thinking, if we only read the footers of the updated and newly added files, the merged schema may be incorrect? For the removed files, it is the same situation. If we don't re-merge all footers' schema, the schema should not be correct.

So this pr should check if we need to re-read all footers and merge schema based on whether the FileStatuses are updated or not.

This is a good point. After reading footers and merging schemas of of new files and updated files, we also need to merge the result schema with the old schema, because some columns may be missing in new files and/or updated files.

Actually I found it might be difficult to define the "correctness" of the merged schema. Take the following scenario as an example:

Initially there is file f0, which comes with a single column c0.

Merged schema: c0

File f1 is added, which contains a single conlumn c1

Merged schema: c0, c1

Removing f0

Which is the "correct" merged schema now?

a. c0, c1
b. c1

I tend to use (a), because removing existing columns can be dangerous, and may confuse down stream systems. But currently Spark SQL uses (b). Also, we need to take metastore schema into account for Parquet relations converted from metastore Parquet tables.

I think this issue is too complicated to be fixed in this PR. I agree with you that we should keep this PR simple and just re-read all the footers for now. It's already strictly better than the current implementation, not mention that schema merging has been significantly accelerated by #7396.

…quet_relation

SparkQA · 2015-07-23T09:00:38Z

Test build #38188 has finished for PR 7154 at commit a52b6d1.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-23T10:21:24Z

Test build #38212 has finished for PR 7154 at commit c8fdfb7.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-23T16:13:18Z

Test build #38229 has finished for PR 7154 at commit ae0ec64.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2015-07-24T04:57:18Z

ping @liancheng

liancheng · 2015-07-24T08:47:00Z

sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala

Please revert these two indentation changes.

Fixed. Thanks.

liancheng · 2015-07-24T08:49:22Z

LGTM except for a minor styling issue. Will merge this to master once it's fixed. Thanks for working on this!

liancheng · 2015-07-24T09:42:29Z

Merged to master.

viirya · 2015-07-24T09:45:36Z

Ok. Thanks.

SparkQA · 2015-07-24T10:44:59Z

Test build #38345 has finished for PR 7154 at commit 92e9347.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

Keep cached information and avoid re-calculating footers.

0ef8caf

viirya added 2 commits July 20, 2015 14:35

Merge remote-tracking branch 'upstream/master' into cached_footer_par…

186429d

…quet_relation

Add check of FileStatus's modification time.

12a0ed9

Merge remote-tracking branch 'upstream/master' into cached_footer_par…

21bbdec

…quet_relation Conflicts: sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala

liancheng reviewed Jul 21, 2015
View reviewed changes

viirya added 2 commits July 22, 2015 15:50

Merge remote-tracking branch 'upstream/master' into cached_footer_par…

6ae0911

…quet_relation

Use Map to cache FileStatus and do merging previously loaded schema a…

fa5458f

…nd newly loaded one.

liancheng reviewed Jul 22, 2015
View reviewed changes

viirya added 2 commits July 23, 2015 15:21

Merge remote-tracking branch 'upstream/master' into cached_footer_par…

c2a2420

…quet_relation

For comments.

a52b6d1

Fix it.

c8fdfb7

Fix wrong assignment.

ae0ec64

liancheng reviewed Jul 24, 2015
View reviewed changes

Fix indentation.

92e9347

asfgit closed this in 6a7e537 Jul 24, 2015

viirya deleted the cached_footer_parquet_relation branch December 27, 2023 18:31

[SPARK-8756][SQL] Keep cached information and avoid re-calculating footers in ParquetRelation2 #7154

[SPARK-8756][SQL] Keep cached information and avoid re-calculating footers in ParquetRelation2 #7154

Uh oh!

Conversation

viirya commented Jul 1, 2015

Uh oh!

SparkQA commented Jul 1, 2015

Uh oh!

rxin commented Jul 19, 2015

Uh oh!

liancheng commented Jul 19, 2015

Uh oh!

viirya commented Jul 19, 2015

Uh oh!

viirya commented Jul 19, 2015

Uh oh!

viirya commented Jul 20, 2015

Uh oh!

SparkQA commented Jul 20, 2015

Uh oh!

marmbrus commented Jul 21, 2015

Uh oh!

viirya commented Jul 21, 2015

Uh oh!

SparkQA commented Jul 21, 2015

Uh oh!

viirya commented Jul 21, 2015

Uh oh!

viirya commented Jul 21, 2015

Uh oh!

SparkQA commented Jul 21, 2015

Uh oh!

viirya commented Jul 21, 2015

Uh oh!

liancheng Jul 21, 2015

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 22, 2015

Uh oh!

liancheng Jul 22, 2015

Choose a reason for hiding this comment

Uh oh!

liancheng Jul 22, 2015

Choose a reason for hiding this comment

Uh oh!

viirya Jul 23, 2015

Choose a reason for hiding this comment

Uh oh!

liancheng Jul 23, 2015

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 23, 2015

Uh oh!

SparkQA commented Jul 23, 2015

Uh oh!

SparkQA commented Jul 23, 2015

Uh oh!

viirya commented Jul 24, 2015

Uh oh!

liancheng Jul 24, 2015

Choose a reason for hiding this comment

Uh oh!

viirya Jul 24, 2015

Choose a reason for hiding this comment

Uh oh!

liancheng commented Jul 24, 2015

Uh oh!

liancheng commented Jul 24, 2015

Uh oh!

viirya commented Jul 24, 2015

Uh oh!

SparkQA commented Jul 24, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!