Skip to content

Conversation

@MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Nov 21, 2020

What changes were proposed in this pull request?

  1. Add new method listPartitionByNames to the SupportsPartitionManagement interface. It allows to list partitions by partition names and their values.
  2. Implement new method in InMemoryPartitionTable which is used in DSv2 tests.

Why are the changes needed?

Currently, the SupportsPartitionManagement interface exposes only listPartitionIdentifiers which allows to list partitions by partition values. And it requires to specify all values for partition schema fields in the prefix. This restriction does not allow to list partitions by some of partition names (not all of them).

For example, the table tableA is partitioned by two column year and month

CREATE TABLE tableA (price int, year int, month int)
USING _
partitioned by (year, month)

and has the following partitions:

PARTITION(year = 2015, month = 1)
PARTITION(year = 2015, month = 2)
PARTITION(year = 2016, month = 2)
PARTITION(year = 2016, month = 3)

If we want to list all partitions with month = 2, we have to specify year for listPartitionIdentifiers() which not always possible as we don't know all year values in advance. New method listPartitionByNames() allows to specify partition values only for month, and get two partitions:

PARTITION(year = 2015, month = 2)
PARTITION(year = 2016, month = 2)

Does this PR introduce any user-facing change?

No

How was this patch tested?

By running the affected test suite SupportsPartitionManagementSuite.

@github-actions github-actions bot added the SQL label Nov 21, 2020
@MaxGekk
Copy link
Member Author

MaxGekk commented Nov 21, 2020

@cloud-fan Could you take a look at this PR, please.

@SparkQA
Copy link

SparkQA commented Nov 21, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36079/

@SparkQA
Copy link

SparkQA commented Nov 21, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36079/

@SparkQA
Copy link

SparkQA commented Nov 21, 2020

Test build #131472 has finished for PR 30452 at commit 3f20ee8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

* @param ident a partition identifier values.
* @return an array of Identifiers for the partitions
*/
InternalRow[] listPartitionByNames(String[] names, InternalRow ident);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is a user supposed to access to Table directly? Looks a bit odd that listPartitionByNames interface is added but not used in the Spark internal side.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea we should remove the old API as it's not released yet.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method listPartitionIdentifiers() is used only in partitionExists() and in tests. Let me remove it and replace its usage by listPartitionByNames().

How about renaming listPartitionByNames() to just listPartitions()?

Copy link
Member Author

@MaxGekk MaxGekk Nov 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks a bit odd that listPartitionByNames interface is added but not used in the Spark internal side.

@HyukjinKwon listPartitionByNames() will be used by V2 commands like SHOW PARTITIONS (in #30398) and SHOW TABLE EXTENDED.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea we should remove the old API as it's not released yet.

Frankly speaking, I would remove listPartitionIdentifiers() separately as this requires unrelated changes to list partition by names.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the final API name should still be listPartitionIdentifiers

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove the old one in your next PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the final API name should still be listPartitionIdentifiers

Agree with this.

@cloud-fan
Copy link
Contributor

The partition can be a transform like year(ts_col), shall we just partition index in the API instead?

@MaxGekk
Copy link
Member Author

MaxGekk commented Nov 23, 2020

As the changes are related to listPartitionIdentifiers() added by #28617. @stczwd @rdblue @RussellSpitzer @emkornfield @dongjoon-hyun May I ask you to review this PR.

Comment on lines +111 to +112
val indexes = names.map(schema.fieldIndex)
val dataTypes = names.map(schema(_).dataType)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The names should be normalized after #30454, so, we shouldn't care of case sensitivity here.

@MaxGekk
Copy link
Member Author

MaxGekk commented Nov 23, 2020

... shall we just partition index in the API instead?

@cloud-fan Sure. I see at least two variants of the implementation:

  • Pass an array of indexes, or
  • a BitSet

The second one could save some memory, I guess.

(Array("part0", "part1"), InternalRow(0, "abc")) -> Set(InternalRow(0, "abc")),
(Array("part0"), InternalRow(0)) -> Set(InternalRow(0, "abc"), InternalRow(0, "def")),
(Array("part1"), InternalRow("abc")) -> Set(InternalRow(0, "abc"), InternalRow(1, "abc")),
(Array.empty[String], InternalRow.empty) ->
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a special case which allows to list all partitions.

@cloud-fan
Copy link
Contributor

Since SupportsPartitionManagement already have the API partitionSchema, which means that the implementations will pick a name for partition transforms, I think it's OK to use String[] in the listPartitionIdentifiers API parameter.

@MaxGekk
Copy link
Member Author

MaxGekk commented Nov 25, 2020

@cloud-fan @HyukjinKwon Can we continue with this PR or do you have some objections?

@cloud-fan
Copy link
Contributor

I'm merging it to unblock the following work. @stczwd @rdblue please leave comments if you have any concerns, so that we can address them.

@cloud-fan
Copy link
Contributor

merging to master, thanks!

@cloud-fan cloud-fan closed this in 2c5cc36 Nov 25, 2020
@jackylee-ch
Copy link
Contributor

Since SupportsPartitionManagement already have the API partitionSchema, which means that the implementations will pick a name for partition transforms, I think it's OK to use String[] in the listPartitionIdentifiers API parameter.

Yeah, I prefer extend the listPartitionIdentifiers instead of add new API listPartitionsByNames.

@rdblue
Copy link
Contributor

rdblue commented Nov 26, 2020

The partition can be a transform like year(ts_col), shall we just partition index in the API instead?

If I remember correctly, there should be a schema exposed by the table that describes these. We should get the name from that schema.

@MaxGekk
Copy link
Member Author

MaxGekk commented Nov 26, 2020

I removed listPartitionIdentifiers() and renamed listPartitionsByNames() to listPartitionIdentifiers(): #30514

@MaxGekk MaxGekk deleted the column-names-listPartitionIdentifiers branch February 19, 2021 15:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants