-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-24722][SQL] pivot() with Column type argument #21699
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #92541 has finished for PR 21699 at commit
|
| * @since 2.4.0 | ||
| */ | ||
| def pivot(pivotColumn: String, values: Seq[Any]): RelationalGroupedDataset = { | ||
| def pivot(pivotColumn: Column, values: Seq[Any]): RelationalGroupedDataset = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make diffs smaller, can you move this under the signature def pivot(pivotColumn: String, values: Seq[Any])?
|
|
|
cc: @rxin @gatorsmile |
|
Test build #92561 has finished for PR 21699 at commit
|
|
jenkins, retest this, please |
|
cc @aray |
|
This mostly looks good, but I'd like to ask a few things first:
|
|
Test build #92568 has finished for PR 21699 at commit
|
The methods have been added already. @rednaxelafx Please, look at the lines:
I added this test case: https://github.com/apache/spark/pull/21699/files#diff-50aa7d3b7b7934a7df6f414396e74c3cR271 . |
|
Test build #92577 has finished for PR 21699 at commit
|
Purpose of the PR is to make pivot API consistent to |
|
@maryannxue Please, have a look at the PR. |
|
Test build #92719 has finished for PR 21699 at commit
|
|
@MaxGekk Yes, it was caused by my previous PR. The change in my PR was a walk-around for an existing problem in either Aggregate or PivotFirst (I suspect it's Aggregate) with struct-type columns. The change itself worked as designed because Pivot SQL support wouldn't allow any function (like "lowercase") in the pivot column. However it broke your PR coz it aimed to allow any expression.
|
|
retest this please |
|
Test build #93824 has finished for PR 21699 at commit
|
|
retest this please |
|
Test build #93826 has finished for PR 21699 at commit
|
|
Test build #93832 has finished for PR 21699 at commit
|
|
@MaxGekk LGTM, but one more thing to consider: |
@maryannxue Supported and tested. Please, have a look at: https://github.com/MaxGekk/spark-1/blob/5da5a2c94a1e99cc3edd920080470b3d17cfc699/sql/core/src/test/scala/org/apache/spark/sql/DataFramePivotSuite.scala#L312-L342 |
| import org.apache.spark.sql.functions.struct | ||
| groupType match { | ||
| case RelationalGroupedDataset.GroupByType => | ||
| val pivotValues = values.map { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm? wait @maryannxue I think we shouldn't do this at least here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR just proposes an overloaded version, pivot(column) of pivot(string). It's not necessary to fix other things together. Also, it needs another review iteration I guess. For instance, does array or map works and nested struct work, etc. Let's take out this change for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HyukjinKwon Should I revert the last commit and propose it as a separate PR? I think it makes sense to discuss in JIRA ticket possible alternatives for API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope the last commit is reverted and we go ahead orthogonally if @maryannxue is happy with that too.
|
Test build #93848 has finished for PR 21699 at commit
|
|
Thank you for the change, @MaxGekk! |
Actually I am mostly worry of the |
It depends on whether we'd like to add extra interfaces for multiple columns. I don't have a preference between reusing this interface for multiple pivot columns or adding new ones. And we can always decide later. |
|
If this PR proposes a different API then an overloaded version of I would prefer to have For the current status, I would let the multiple |
| } | ||
| } | ||
|
|
||
| test("SPARK-24722: pivoting nested columns") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I usually leave a JIRA number for one specific regression test when it's a bug since that's going to explicitly point out which case was broken .. but not a big deal though.
This reverts commit 5da5a2c.
|
@HyukjinKwon I revert the last changes. Please, take a look at the PR again. |
|
Test build #94109 has finished for PR 21699 at commit
|
|
Jenkins, retest this, please |
|
Test build #94123 has finished for PR 21699 at commit
|
|
Merged to master. |
| * @since 2.4.0 | ||
| */ | ||
| def pivot(pivotColumn: String, values: java.util.List[Any]): RelationalGroupedDataset = { | ||
| def pivot(pivotColumn: Column, values: java.util.List[Any]): RelationalGroupedDataset = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's a bad idea to use Any in the API. For the existing ones we can't remove, but we should not add new ones using Any.
In Spark 3.0 we should audit all the APIs in functions.scala and make them consistent(e.g. only use Column in the API)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there's a plan for auditing it in 3.0.0, I am okay with going ahead with Column but thing is, we should deprecate them first.
For the current status, I think the problem here is, this is an overloaded version of pivot and wouldn't necessarily make the differences.
I used pivot heavily in the previous company before and I am pretty sure pivot(String, actual values) has been a common pattern and usage so far.
What changes were proposed in this pull request?
In the PR, I propose column-based API for the
pivot()function. It allows using of any column expressions as the pivot column. Also this makes it consistent with how groupBy() works.How was this patch tested?
I added new tests to
DataFramePivotSuiteand updated PySpark examples for thepivot()function.