Skip to content

Conversation

@MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Jul 2, 2018

What changes were proposed in this pull request?

In the PR, I propose column-based API for the pivot() function. It allows using of any column expressions as the pivot column. Also this makes it consistent with how groupBy() works.

How was this patch tested?

I added new tests to DataFramePivotSuite and updated PySpark examples for the pivot() function.

@SparkQA
Copy link

SparkQA commented Jul 3, 2018

Test build #92541 has finished for PR 21699 at commit 0fdd11f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

* @since 2.4.0
*/
def pivot(pivotColumn: String, values: Seq[Any]): RelationalGroupedDataset = {
def pivot(pivotColumn: Column, values: Seq[Any]): RelationalGroupedDataset = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make diffs smaller, can you move this under the signature def pivot(pivotColumn: String, values: Seq[Any])?

@maropu
Copy link
Member

maropu commented Jul 3, 2018

def pivot(pivotColumn: String), too?

@maropu
Copy link
Member

maropu commented Jul 3, 2018

cc: @rxin @gatorsmile

@SparkQA
Copy link

SparkQA commented Jul 3, 2018

Test build #92561 has finished for PR 21699 at commit d62b7e7.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Jul 3, 2018

jenkins, retest this, please

@HyukjinKwon
Copy link
Member

cc @aray

@rednaxelafx
Copy link
Contributor

This mostly looks good, but I'd like to ask a few things first:

  1. The new overloaded pivot() that takes Column only exist for pivot(Column, Seq[Any]). Were you planning to add a new overload for each existing String version, e.g. pivot(Column) and pivot(Column, java.util.List[Any])?
  2. Since you're adding the Column version(s) to address accessing nested columns, would it be nice to highlight that capability in the doc example? (Yes you've already included that in test case examples so that's already good)

@SparkQA
Copy link

SparkQA commented Jul 3, 2018

Test build #92568 has finished for PR 21699 at commit d62b7e7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Jul 3, 2018

Were you planning to add a new overload for each existing String version, e.g. pivot(Column) and pivot(Column, java.util.List[Any])?

The methods have been added already. @rednaxelafx Please, look at the lines:
https://github.com/apache/spark/pull/21699/files#diff-95bb2228c67e3cce4c729e44e2d82422R362
https://github.com/apache/spark/pull/21699/files#diff-95bb2228c67e3cce4c729e44e2d82422R377

Yes you've already included that in test case examples so that's already good

I added this test case: https://github.com/apache/spark/pull/21699/files#diff-50aa7d3b7b7934a7df6f414396e74c3cR271 .

@aray
Copy link
Contributor

aray commented Jul 3, 2018

Using either Column or String type was actually in my original PR: #7841
@rxin later modified the api to only take a String prior to the release as part of an API audit: #9929

Considering you can just make a call to withColumn first I'm not really convinced in the utility of this PR.

@SparkQA
Copy link

SparkQA commented Jul 3, 2018

Test build #92577 has finished for PR 21699 at commit fae4fd2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Jul 4, 2018

Considering you can just make a call to withColumn first I'm not really convinced in the utility of this PR.

Purpose of the PR is to make pivot API consistent to groupBy and clear. Our users/clients shouldn't google workarounds like withColumn to apply the pivot() functions. I believe we should follow the principle of least astonishment (POLA). Some of our clients forms Column expressions programmatically as a result of calculation. And for now, instead of just passing Column variable to pivot(), they have to modify whole expression, and inject projection or withColumn. /cc @ssimeonov

@HyukjinKwon
Copy link
Member

@MaxGekk, I think you should talk with @rxin rather then @aray's comment.

@MaxGekk
Copy link
Member Author

MaxGekk commented Jul 8, 2018

@maryannxue Please, have a look at the PR.

@SparkQA
Copy link

SparkQA commented Jul 8, 2018

Test build #92719 has finished for PR 21699 at commit f32a85b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maryannxue
Copy link
Contributor

@MaxGekk Yes, it was caused by my previous PR. The change in my PR was a walk-around for an existing problem in either Aggregate or PivotFirst (I suspect it's Aggregate) with struct-type columns. The change itself worked as designed because Pivot SQL support wouldn't allow any function (like "lowercase") in the pivot column. However it broke your PR coz it aimed to allow any expression.
That said, we have two options here:

  1. Give up the PivotFirst approach and fall back to "else" branch for struct-type pivot columns, i.e., multiple column in pivot FOR clause.
  2. Fix the bug for Aggregate or PivotFirst.
    I will do a little investigation into option 2) tomorrow and get back to you :)

@maryannxue
Copy link
Contributor

@MaxGekk Please take a look at #21926. There was a bug in PivotFirst and this PR should fix your test here.

@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Jul 31, 2018

Test build #93824 has finished for PR 21699 at commit 34535a9.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Jul 31, 2018

Test build #93826 has finished for PR 21699 at commit 34535a9.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 31, 2018

Test build #93832 has finished for PR 21699 at commit cf55135.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maryannxue
Copy link
Contributor

@MaxGekk LGTM, but one more thing to consider:
Since we support column list in SQL, it would be nice to support it and test it in DataFrame pivot too. The only thing that we need to enable is to make pivot values Expressions instead of Literals, coz Literals do not include struct-type literals, e.g., struct(1, 2). The Pivot node already has pivot values as Seq[Expression], so all left to be done is in the DataFrame interfaces.

@MaxGekk
Copy link
Member Author

MaxGekk commented Jul 31, 2018

it would be nice to support it and test it in DataFrame pivot too.

@maryannxue Supported and tested. Please, have a look at: https://github.com/MaxGekk/spark-1/blob/5da5a2c94a1e99cc3edd920080470b3d17cfc699/sql/core/src/test/scala/org/apache/spark/sql/DataFramePivotSuite.scala#L312-L342

import org.apache.spark.sql.functions.struct
groupType match {
case RelationalGroupedDataset.GroupByType =>
val pivotValues = values.map {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm? wait @maryannxue I think we shouldn't do this at least here.

Copy link
Member

@HyukjinKwon HyukjinKwon Aug 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR just proposes an overloaded version, pivot(column) of pivot(string). It's not necessary to fix other things together. Also, it needs another review iteration I guess. For instance, does array or map works and nested struct work, etc. Let's take out this change for now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon Should I revert the last commit and propose it as a separate PR? I think it makes sense to discuss in JIRA ticket possible alternatives for API.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope the last commit is reverted and we go ahead orthogonally if @maryannxue is happy with that too.

@SparkQA
Copy link

SparkQA commented Aug 1, 2018

Test build #93848 has finished for PR 21699 at commit 5da5a2c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maryannxue
Copy link
Contributor

Thank you for the change, @MaxGekk!
@HyukjinKwon my idea was actually that the overloaded versions of pivot would be pivot(column: Column, values, Seq[Column]), so that we can construct different types in "values". The constant check will be done in Analyzer, so we don't need to worry about it here.
Ultimately we would like to support complex-typed values in pivot(column: Column) as well, but I think we can make this in a different PR.

@MaxGekk
Copy link
Member Author

MaxGekk commented Aug 1, 2018

pivot(column: Column, values, Seq[Column]), so that we can construct different types in "values".

Actually I am mostly worry of the pivotColumn. Specifying multiple columns via struct is not intuitive I believe.

@maryannxue
Copy link
Contributor

maryannxue commented Aug 1, 2018

Actually I am mostly worry of the pivotColumn. Specifying multiple columns via struct is not intuitive I believe.

It depends on whether we'd like to add extra interfaces for multiple columns. I don't have a preference between reusing this interface for multiple pivot columns or adding new ones. And we can always decide later.
But back to this interface, I'd assume this is for more advanced users, and the pivot column, even just being a single column, can have complex types, so the "literal object" values might be insufficient. Plus, this is a new interface we haven't pushed out yet, but once we have, we are more likely to end up adding a new one than changing it if we want to make it more sophisticated later on.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Aug 2, 2018

If this PR proposes a different API then an overloaded version of pivot(String, Seq[Any]), it's a different issue though I guess.

I would prefer to have pivot(Column, Seq[Any]) and reuse it for multiple pivot columns if we should do this since that would make more sense with the existing pivot(String, Seq[Any]) - I guess this is still possible and we can do it later.

For the current status, I would let the multiple Column thing defer - I am not quite sure on this yet and I think it's just better to leave it as is for now till it's actually requested or needed.

}
}

test("SPARK-24722: pivoting nested columns") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I usually leave a JIRA number for one specific regression test when it's a bug since that's going to explicitly point out which case was broken .. but not a big deal though.

@MaxGekk
Copy link
Member Author

MaxGekk commented Aug 3, 2018

@HyukjinKwon I revert the last changes. Please, take a look at the PR again.

@SparkQA
Copy link

SparkQA commented Aug 3, 2018

Test build #94109 has finished for PR 21699 at commit ca1250b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Aug 3, 2018

Jenkins, retest this, please

@SparkQA
Copy link

SparkQA commented Aug 3, 2018

Test build #94123 has finished for PR 21699 at commit ca1250b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Merged to master.

* @since 2.4.0
*/
def pivot(pivotColumn: String, values: java.util.List[Any]): RelationalGroupedDataset = {
def pivot(pivotColumn: Column, values: java.util.List[Any]): RelationalGroupedDataset = {
Copy link
Contributor

@cloud-fan cloud-fan Aug 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a bad idea to use Any in the API. For the existing ones we can't remove, but we should not add new ones using Any.

In Spark 3.0 we should audit all the APIs in functions.scala and make them consistent(e.g. only use Column in the API)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there's a plan for auditing it in 3.0.0, I am okay with going ahead with Column but thing is, we should deprecate them first.

For the current status, I think the problem here is, this is an overloaded version of pivot and wouldn't necessarily make the differences.

I used pivot heavily in the previous company before and I am pretty sure pivot(String, actual values) has been a common pattern and usage so far.

@MaxGekk MaxGekk deleted the pivot-column branch August 17, 2019 13:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants