-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-33509][SQL] List partition by names from a V2 table which supports partition management #30452
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-33509][SQL] List partition by names from a V2 table which supports partition management #30452
Conversation
|
@cloud-fan Could you take a look at this PR, please. |
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Test build #131472 has finished for PR 30452 at commit
|
| * @param ident a partition identifier values. | ||
| * @return an array of Identifiers for the partitions | ||
| */ | ||
| InternalRow[] listPartitionByNames(String[] names, InternalRow ident); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is a user supposed to access to Table directly? Looks a bit odd that listPartitionByNames interface is added but not used in the Spark internal side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea we should remove the old API as it's not released yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The method listPartitionIdentifiers() is used only in partitionExists() and in tests. Let me remove it and replace its usage by listPartitionByNames().
How about renaming listPartitionByNames() to just listPartitions()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks a bit odd that listPartitionByNames interface is added but not used in the Spark internal side.
@HyukjinKwon listPartitionByNames() will be used by V2 commands like SHOW PARTITIONS (in #30398) and SHOW TABLE EXTENDED.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea we should remove the old API as it's not released yet.
Frankly speaking, I would remove listPartitionIdentifiers() separately as this requires unrelated changes to list partition by names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the final API name should still be listPartitionIdentifiers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can remove the old one in your next PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the final API name should still be
listPartitionIdentifiers
Agree with this.
|
The partition can be a transform like |
|
As the changes are related to |
| val indexes = names.map(schema.fieldIndex) | ||
| val dataTypes = names.map(schema(_).dataType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The names should be normalized after #30454, so, we shouldn't care of case sensitivity here.
@cloud-fan Sure. I see at least two variants of the implementation:
The second one could save some memory, I guess. |
| (Array("part0", "part1"), InternalRow(0, "abc")) -> Set(InternalRow(0, "abc")), | ||
| (Array("part0"), InternalRow(0)) -> Set(InternalRow(0, "abc"), InternalRow(0, "def")), | ||
| (Array("part1"), InternalRow("abc")) -> Set(InternalRow(0, "abc"), InternalRow(1, "abc")), | ||
| (Array.empty[String], InternalRow.empty) -> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a special case which allows to list all partitions.
|
Since |
|
@cloud-fan @HyukjinKwon Can we continue with this PR or do you have some objections? |
|
I'm merging it to unblock the following work. @stczwd @rdblue please leave comments if you have any concerns, so that we can address them. |
|
merging to master, thanks! |
Yeah, I prefer extend the |
If I remember correctly, there should be a schema exposed by the table that describes these. We should get the name from that schema. |
|
I removed |
What changes were proposed in this pull request?
listPartitionByNamesto theSupportsPartitionManagementinterface. It allows to list partitions by partition names and their values.InMemoryPartitionTablewhich is used in DSv2 tests.Why are the changes needed?
Currently, the
SupportsPartitionManagementinterface exposes onlylistPartitionIdentifierswhich allows to list partitions by partition values. And it requires to specify all values for partition schema fields in the prefix. This restriction does not allow to list partitions by some of partition names (not all of them).For example, the table
tableAis partitioned by two columnyearandmonthand has the following partitions:
If we want to list all partitions with
month = 2, we have to specifyyearfor listPartitionIdentifiers() which not always possible as we don't know allyearvalues in advance. New method listPartitionByNames() allows to specify partition values only formonth, and get two partitions:Does this PR introduce any user-facing change?
No
How was this patch tested?
By running the affected test suite
SupportsPartitionManagementSuite.