[SPARK-33509][SQL] List partition by names from a V2 table which supports partition management #30452

MaxGekk · 2020-11-21T15:11:47Z

What changes were proposed in this pull request?

Add new method listPartitionByNames to the SupportsPartitionManagement interface. It allows to list partitions by partition names and their values.
Implement new method in InMemoryPartitionTable which is used in DSv2 tests.

Why are the changes needed?

Currently, the SupportsPartitionManagement interface exposes only listPartitionIdentifiers which allows to list partitions by partition values. And it requires to specify all values for partition schema fields in the prefix. This restriction does not allow to list partitions by some of partition names (not all of them).

For example, the table tableA is partitioned by two column year and month

CREATE TABLE tableA (price int, year int, month int)
USING _
partitioned by (year, month)

and has the following partitions:

PARTITION(year = 2015, month = 1)
PARTITION(year = 2015, month = 2)
PARTITION(year = 2016, month = 2)
PARTITION(year = 2016, month = 3)

If we want to list all partitions with month = 2, we have to specify year for listPartitionIdentifiers() which not always possible as we don't know all year values in advance. New method listPartitionByNames() allows to specify partition values only for month, and get two partitions:

PARTITION(year = 2015, month = 2)
PARTITION(year = 2016, month = 2)

Does this PR introduce any user-facing change?

No

How was this patch tested?

By running the affected test suite SupportsPartitionManagementSuite.

MaxGekk · 2020-11-21T15:16:28Z

@cloud-fan Could you take a look at this PR, please.

SparkQA · 2020-11-21T15:54:20Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36079/

SparkQA · 2020-11-21T16:28:26Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36079/

SparkQA · 2020-11-21T19:44:45Z

Test build #131472 has finished for PR 30452 at commit 3f20ee8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2020-11-23T01:36:26Z

...talyst/src/main/java/org/apache/spark/sql/connector/catalog/SupportsPartitionManagement.java

+     * @param ident a partition identifier values.
+     * @return an array of Identifiers for the partitions
+     */
+    InternalRow[] listPartitionByNames(String[] names, InternalRow ident);


Is a user supposed to access to Table directly? Looks a bit odd that listPartitionByNames interface is added but not used in the Spark internal side.

Yea we should remove the old API as it's not released yet.

The method listPartitionIdentifiers() is used only in partitionExists() and in tests. Let me remove it and replace its usage by listPartitionByNames().

How about renaming listPartitionByNames() to just listPartitions()?

Looks a bit odd that listPartitionByNames interface is added but not used in the Spark internal side.

@HyukjinKwon listPartitionByNames() will be used by V2 commands like SHOW PARTITIONS (in #30398) and SHOW TABLE EXTENDED.

Yea we should remove the old API as it's not released yet.

Frankly speaking, I would remove listPartitionIdentifiers() separately as this requires unrelated changes to list partition by names.

I think the final API name should still be listPartitionIdentifiers

We can remove the old one in your next PR.

I think the final API name should still be listPartitionIdentifiers

Agree with this.

cloud-fan · 2020-11-23T05:34:57Z

The partition can be a transform like year(ts_col), shall we just partition index in the API instead?

MaxGekk · 2020-11-23T06:04:11Z

As the changes are related to listPartitionIdentifiers() added by #28617. @stczwd @rdblue @RussellSpitzer @emkornfield @dongjoon-hyun May I ask you to review this PR.

MaxGekk · 2020-11-23T06:22:51Z

sql/catalyst/src/test/scala/org/apache/spark/sql/connector/InMemoryPartitionTable.scala

+    val indexes = names.map(schema.fieldIndex)
+    val dataTypes = names.map(schema(_).dataType)


The names should be normalized after #30454, so, we shouldn't care of case sensitivity here.

MaxGekk · 2020-11-23T06:52:29Z

... shall we just partition index in the API instead?

@cloud-fan Sure. I see at least two variants of the implementation:

Pass an array of indexes, or
a BitSet

The second one could save some memory, I guess.

MaxGekk · 2020-11-23T10:22:23Z

...src/test/scala/org/apache/spark/sql/connector/catalog/SupportsPartitionManagementSuite.scala

+      (Array("part0", "part1"), InternalRow(0, "abc")) -> Set(InternalRow(0, "abc")),
+      (Array("part0"), InternalRow(0)) -> Set(InternalRow(0, "abc"), InternalRow(0, "def")),
+      (Array("part1"), InternalRow("abc")) -> Set(InternalRow(0, "abc"), InternalRow(1, "abc")),
+      (Array.empty[String], InternalRow.empty) ->


This is a special case which allows to list all partitions.

cloud-fan · 2020-11-23T12:01:03Z

Since SupportsPartitionManagement already have the API partitionSchema, which means that the implementations will pick a name for partition transforms, I think it's OK to use String[] in the listPartitionIdentifiers API parameter.

MaxGekk · 2020-11-25T11:44:34Z

@cloud-fan @HyukjinKwon Can we continue with this PR or do you have some objections?

cloud-fan · 2020-11-25T12:41:02Z

I'm merging it to unblock the following work. @stczwd @rdblue please leave comments if you have any concerns, so that we can address them.

cloud-fan · 2020-11-25T12:41:47Z

merging to master, thanks!

jackylee-ch · 2020-11-26T01:26:30Z

Since SupportsPartitionManagement already have the API partitionSchema, which means that the implementations will pick a name for partition transforms, I think it's OK to use String[] in the listPartitionIdentifiers API parameter.

Yeah, I prefer extend the listPartitionIdentifiers instead of add new API listPartitionsByNames.

rdblue · 2020-11-26T01:28:05Z

The partition can be a transform like year(ts_col), shall we just partition index in the API instead?

If I remember correctly, there should be a schema exposed by the table that describes these. We should get the name from that schema.

MaxGekk · 2020-11-26T08:52:14Z

I removed listPartitionIdentifiers() and renamed listPartitionsByNames() to listPartitionIdentifiers(): #30514

MaxGekk added 6 commits November 21, 2020 14:27

Add listPartitionByNames() to the SupportsPartitionManagement interface

2d42120

First implementation of listPartitionByNames()

d1cbc92

Add a test

20d9121

Gets all partitions

75c4903

Add asserts

373e22e

Nothing matches to parameters

3f20ee8

github-actions bot added the SQL label Nov 21, 2020

MaxGekk mentioned this pull request Nov 21, 2020

[SPARK-33452][SQL] Support v2 SHOW PARTITIONS #30398

Closed

HyukjinKwon reviewed Nov 23, 2020

View reviewed changes

MaxGekk commented Nov 23, 2020

View reviewed changes

cloud-fan closed this in 2c5cc36 Nov 25, 2020

MaxGekk deleted the column-names-listPartitionIdentifiers branch February 19, 2021 15:03

		val indexes = names.map(schema.fieldIndex)
		val dataTypes = names.map(schema(_).dataType)

[SPARK-33509][SQL] List partition by names from a V2 table which supports partition management #30452

[SPARK-33509][SQL] List partition by names from a V2 table which supports partition management #30452

Uh oh!

Conversation

MaxGekk commented Nov 21, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

MaxGekk commented Nov 21, 2020

Uh oh!

SparkQA commented Nov 21, 2020

Uh oh!

SparkQA commented Nov 21, 2020

Uh oh!

SparkQA commented Nov 21, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MaxGekk Nov 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Nov 23, 2020

Uh oh!

MaxGekk commented Nov 23, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MaxGekk commented Nov 23, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Nov 23, 2020

Uh oh!

MaxGekk commented Nov 25, 2020

Uh oh!

cloud-fan commented Nov 25, 2020

Uh oh!

cloud-fan commented Nov 25, 2020

Uh oh!

jackylee-ch commented Nov 26, 2020

Uh oh!

rdblue commented Nov 26, 2020

Uh oh!

MaxGekk commented Nov 26, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

MaxGekk commented Nov 21, 2020 •

edited

Loading

MaxGekk Nov 23, 2020 •

edited

Loading