[SPARK-22913][SQL] Improved Hive Partition Pruning #20100

ameent · 2017-12-28T03:29:59Z

Adding support for Timestamp and Fractional column types. The pruning
of partitions of these types is being put behind default options
that are set to false, as it's not clear which hive metastore
implementations support predicates on these types of columns.

The AWS Glue Catalog http://docs.aws.amazon.com/glue/latest/dg/populate-data-catalog.html
does support filters on timestamp and fractional columns and pushing these filters
down to it has significant performance improvements in our use cases.

As part of this change the hive pruning suite is renamed (a TODO) and 2
ignored tests are added that will validate the functionality of partition
pruning through integration tests. The tests are ignored since the integration
test setup uses a Hive client that throws errors when it sees partition column
filters on non-integral and non-string columns.

Unit tests are added to validate filtering, which are active.

What changes were proposed in this pull request?

See https://issues.apache.org/jira/browse/SPARK-22913

This change addresses the JIRA. I'm looking for feedback on the change itself and whether the config values I added make sense. I was not able to find official Hive specification on which filters a metastore needs to support and as such, feel hesitant to turn on this behavior by default. Piggybacking on top of "advancedPartitionPruning" option felt wrong because that config toggles whether "in (...)" queries are expanded in a series of "ors" and I don't want people to be forced to turn off that behavior alongside not pushing timestamp predicates.

How was this patch tested?

This change is tested via unit tests, modified integration tests (that are ignored) and manual tests on EMR 5.10 running against AWS Glue Catalog as the Hive metastore.

Adding support for Timestamp and Fractional column types. The pruning of partitions of these types is being put behind default options that are set to false, as it's not clear which hive metastore implementations support predicates on these types of columns. The AWS Glue Catalog http://docs.aws.amazon.com/glue/latest/dg/populate-data-catalog.html does support filters on timestamp and fractional columns and pushing these filters down to it has significant performance improvements in our use cases. As part of this change the hive pruning suite is renamed (a TODO) and 2 ignored tests are added that will validate the functionality of partition pruning through integration tests. The tests are ignored since the integration test setup uses a Hive client that throws errors when it sees partition column filters on non-integral and non-string columns. Unit tests are added to validate filtering, which are active.

ameent · 2018-01-11T05:29:20Z

Any updates on this?

ameent · 2018-01-11T05:32:10Z

@srowen can you please help find someone to review this?

ameent · 2018-01-17T15:26:11Z

CCing @cloud-fan @tdas @HyukjinKwon @xubo245 I need help finding someone who can provide feedback on this pull request.

This change reduces run-time of one of our use cases from 6 minutes to around 11 seconds. We have tables with large # of partitions (over 1 million) and retrieving all partitions over the wire to the master node does add considerable amount of time.

cloud-fan · 2018-01-17T16:16:31Z

instead of having configs, we should delegate the partition pruning logic to HiveShim and only support these types for certain hive versions.

ameent · 2018-01-17T17:47:55Z

Thanks @cloud-fan. Do you propose that we model "AWS Glue" as its own Hive version?

AmplabJenkins · 2018-06-09T00:16:29Z

Can one of the admins verify this patch?

cloud-fan · 2018-06-09T01:29:48Z

Spark doesn't officially support Glue, I think Glue is plugged into Spark by pretending itself as a certain hive version, and that hive version should support timestamp and fraction.

HyukjinKwon · 2018-07-13T07:06:21Z

Sorry for a late response. I am now checking PRs queued in my list.
I agree with @cloud-fan's for now and I think we should better leave this closed.

HyukjinKwon · 2018-07-13T07:07:27Z

@ameent BTW, we can't directly close this. I'd appreciate it if you manually close this.

srowen mentioned this pull request Jul 18, 2018

[INFRA] Close stale PR #21781

Closed

asfgit closed this in 1a4fda8 Jul 19, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-22913][SQL] Improved Hive Partition Pruning #20100

[SPARK-22913][SQL] Improved Hive Partition Pruning #20100

Uh oh!

ameent commented Dec 28, 2017

Uh oh!

ameent commented Jan 11, 2018

Uh oh!

ameent commented Jan 11, 2018 •

edited

Loading

Uh oh!

ameent commented Jan 17, 2018

Uh oh!

cloud-fan commented Jan 17, 2018

Uh oh!

ameent commented Jan 17, 2018

Uh oh!

AmplabJenkins commented Jun 9, 2018

Uh oh!

cloud-fan commented Jun 9, 2018

Uh oh!

HyukjinKwon commented Jul 13, 2018

Uh oh!

HyukjinKwon commented Jul 13, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-22913][SQL] Improved Hive Partition Pruning #20100

[SPARK-22913][SQL] Improved Hive Partition Pruning #20100

Uh oh!

Conversation

ameent commented Dec 28, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

ameent commented Jan 11, 2018

Uh oh!

ameent commented Jan 11, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ameent commented Jan 17, 2018

Uh oh!

cloud-fan commented Jan 17, 2018

Uh oh!

ameent commented Jan 17, 2018

Uh oh!

AmplabJenkins commented Jun 9, 2018

Uh oh!

cloud-fan commented Jun 9, 2018

Uh oh!

HyukjinKwon commented Jul 13, 2018

Uh oh!

HyukjinKwon commented Jul 13, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ameent commented Jan 11, 2018 •

edited

Loading