Skip to content

Conversation

@oliviertoupin
Copy link

Here is my workaround for SPARK-6910.

It doesn't prune early. It takes a different approach. Instead our investigation showed that during a query w/ hivecontext on a parquet partitioned table w/ many partitions and many files per partitions (totalling ~40K files), the metastore was poked a lot, but more importantly the datanodes were solicited A LOT. Turns out when doing a query spark would read the footers of ALL parquet files in order to build the schema. This PR read to footers lazy only if needed.

Also:

  • The metastore already contains the schema, so we use it when available (always when using metastore tables?) to avoid reading the footers.
  • We don't need to poke that much files in order to figure out the schema, unless we do schema merging, but this in the normal path (w/o schema merging), however, there is a bug or at least a big caveat in parquet-mr. Spark when using readAllFootersInParallelUsingSummaryFiles will end up reading ALL the footers for that tables. This because when there is no summary, parquet-mr revert to readAllFootersInParallel

Improvement:

One our benchmark query that would take 40s on the first call, but then a few seconds (when all the metadata would be cached), now takes 8s consistently. This is not a big query, just that the target table have ~40K files.

I'm not sure if it's mergeable as-is, this PR fit our requirements and we use it in our build, but maybe not those of the whole community.

Olivier Toupin added 2 commits June 26, 2015 17:08
…er of poked files drastically on when a table have a lot of files and/or partitions
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed in order to prevent eager evaluation of footers. However, I'm not even sure this code is helpful. Could some provide a real world scenario where it would be useful? No performance hit in our workload.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NAVER - http://www.naver.com/

[email protected] 님께 보내신 메일 <Re: [spark] [SPARK-6910] [WiP] Reduce number of operations to the cluster. (#7049)> 이 다음과 같은 이유로 전송 실패했습니다.


받는 사람이 회원님의 메일을 수신차단 하였습니다.


@piaozhexiu
Copy link

@oliviertoupin thanks for sharing your patch.

But I think SPARK-6910 is about (1) avoiding making the getAllPartitions() call to Hive metastore not about (2) improving file listing.

Can you open a separate jira for your patch? I think both are good optimizations though.

@oliviertoupin
Copy link
Author

As for getAllPartitions(), my tests show that's it's not so slow, and I think what the users are reporting in this bug (long latency) is actually caused by the reading of the footers. At least, that's what we experienced. With this patch, latency issues are gone.

@piaozhexiu
Copy link

@oliviertoupin I have a 1.6M partitions table in production and I cannot even query that table because it gets stuck in the Hive metastore call.

This jira is originally from the following conversation on the dev mailing list-
http://goo.gl/2DHPqb

@marmbrus
Copy link
Contributor

Yeah, it seems this is more closely related to https://issues.apache.org/jira/browse/SPARK-8125

/cc @liancheng

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@marmbrus
Copy link
Contributor

Can we close this now that #7396 is merged?

@oliviertoupin
Copy link
Author

I think so, I will try #7396, and reopen a PR if the issue is still present in our use case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants