-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-6910] [WiP] Reduce number of operations to the cluster. #7049
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…er of poked files drastically on when a table have a lot of files and/or partitions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed in order to prevent eager evaluation of footers. However, I'm not even sure this code is helpful. Could some provide a real world scenario where it would be useful? No performance hit in our workload.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NAVER - http://www.naver.com/
[email protected] 님께 보내신 메일 <Re: [spark] [SPARK-6910] [WiP] Reduce number of operations to the cluster. (#7049)> 이 다음과 같은 이유로 전송 실패했습니다.
받는 사람이 회원님의 메일을 수신차단 하였습니다.
|
@oliviertoupin thanks for sharing your patch. But I think SPARK-6910 is about (1) avoiding making the Can you open a separate jira for your patch? I think both are good optimizations though. |
|
As for getAllPartitions(), my tests show that's it's not so slow, and I think what the users are reporting in this bug (long latency) is actually caused by the reading of the footers. At least, that's what we experienced. With this patch, latency issues are gone. |
|
@oliviertoupin I have a 1.6M partitions table in production and I cannot even query that table because it gets stuck in the Hive metastore call. This jira is originally from the following conversation on the dev mailing list- |
|
Yeah, it seems this is more closely related to https://issues.apache.org/jira/browse/SPARK-8125 /cc @liancheng |
|
Can one of the admins verify this patch? |
|
Can we close this now that #7396 is merged? |
|
I think so, I will try #7396, and reopen a PR if the issue is still present in our use case. |
Here is my workaround for SPARK-6910.
It doesn't prune early. It takes a different approach. Instead our investigation showed that during a query w/ hivecontext on a parquet partitioned table w/ many partitions and many files per partitions (totalling ~40K files), the metastore was poked a lot, but more importantly the datanodes were solicited A LOT. Turns out when doing a query spark would read the footers of ALL parquet files in order to build the schema. This PR read to footers lazy only if needed.
Also:
readAllFootersInParallelUsingSummaryFileswill end up reading ALL the footers for that tables. This because when there is no summary, parquet-mr revert toreadAllFootersInParallelImprovement:
One our benchmark query that would take 40s on the first call, but then a few seconds (when all the metadata would be cached), now takes 8s consistently. This is not a big query, just that the target table have ~40K files.
I'm not sure if it's mergeable as-is, this PR fit our requirements and we use it in our build, but maybe not those of the whole community.