-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-8756][SQL] Keep cached information and avoid re-calculating footers in ParquetRelation2 #7154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #36245 has finished for PR 7154 at commit
|
|
cc @liancheng |
|
Thanks for contributing this patch! I have two high level comments here:
In general, this PR can be a good complement for #7396. |
|
@liancheng thanks for commenting. I've noticed #7396 and I think it will bring great improvement to parquet performance. In fact I have also submitted another PR #7238 to improve parquet schema merging performance. As I tested, it can improve much the performance. #7238 can be a complement to #7396 as well, but it may have much more conflicts with #7396. So I will refactor it after #7396 is getting merged. |
|
I will add the check for |
|
@liancheng I've added the check you suggested. Please take a look when you have time. Thanks. |
|
Test build #37812 has finished for PR 7154 at commit
|
|
#7396 has been merged |
|
@marmbrus thanks, I'm going to solve the conflicts. |
…quet_relation Conflicts: sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala
|
Test build #37899 has finished for PR 7154 at commit
|
|
I think it should be an unrelated failure. |
|
retest this please. |
|
Test build #39 has finished for PR 7154 at commit
|
|
ping @liancheng |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we do a deep copy here? Currently it's OK because cachedLeafStatuses() always returns a new instance of Set[FileStatus], but it's possible that we use a mutable set object in the future. In that case, the != predicate above will always be true.
…nd newly loaded one.
|
Test build #38062 has finished for PR 7154 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately there's still a bug here :) We also need to remove those files that only exist in the old cache (namely removed files). This happens when an existing directory is overwritten.
I think we can first keep both the old cache and the result of cachedLeafStatuses(), then filter out updated and new files, and at last update the old FileStatus cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess you are trying to only keep FileStatuses of those files that need to be touched during schema merging here. Actually the FileStatus cache must be consistent with the files stored on the file system. Because we also inject the cache to ParquetInputFormat in ParquetRelation2.buildScan to avoid calling listStatus repeatedly there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I am thinking, if we only read the footers of the updated and newly added files, the merged schema may be incorrect? For the removed files, it is the same situation. If we don't re-merge all footers' schema, the schema should not be correct.
So this pr should check if we need to re-read all footers and merge schema based on whether the FileStatuses are updated or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good point. After reading footers and merging schemas of of new files and updated files, we also need to merge the result schema with the old schema, because some columns may be missing in new files and/or updated files.
Actually I found it might be difficult to define the "correctness" of the merged schema. Take the following scenario as an example:
-
Initially there is file
f0, which comes with a single columnc0.Merged schema:
c0 -
File
f1is added, which contains a single conlumnc1Merged schema:
c0,c1 -
Removing
f0Which is the "correct" merged schema now?
a.
c0,c1
b.c1I tend to use (a), because removing existing columns can be dangerous, and may confuse down stream systems. But currently Spark SQL uses (b). Also, we need to take metastore schema into account for Parquet relations converted from metastore Parquet tables.
I think this issue is too complicated to be fixed in this PR. I agree with you that we should keep this PR simple and just re-read all the footers for now. It's already strictly better than the current implementation, not mention that schema merging has been significantly accelerated by #7396.
|
Test build #38188 has finished for PR 7154 at commit
|
|
Test build #38212 has finished for PR 7154 at commit
|
|
Test build #38229 has finished for PR 7154 at commit
|
|
ping @liancheng |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please revert these two indentation changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. Thanks.
|
LGTM except for a minor styling issue. Will merge this to master once it's fixed. Thanks for working on this! |
|
Merged to master. |
|
Ok. Thanks. |
|
Test build #38345 has finished for PR 7154 at commit
|
JIRA: https://issues.apache.org/jira/browse/SPARK-8756
Currently, in ParquetRelation2, footers are re-read every time refresh() is called. But we can check if it is possibly changed before we do the reading because reading all footers will be expensive when there are too many partitions. This pr fixes this by keeping some cached information to check it.