[SPARK-24722][SQL] pivot() with Column type argument #21699

MaxGekk · 2018-07-02T21:07:07Z

What changes were proposed in this pull request?

In the PR, I propose column-based API for the pivot() function. It allows using of any column expressions as the pivot column. Also this makes it consistent with how groupBy() works.

How was this patch tested?

I added new tests to DataFramePivotSuite and updated PySpark examples for the pivot() function.

SparkQA · 2018-07-03T00:49:52Z

Test build #92541 has finished for PR 21699 at commit 0fdd11f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2018-07-03T09:30:52Z

sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala

+   * @since 2.4.0
   */
-  def pivot(pivotColumn: String, values: Seq[Any]): RelationalGroupedDataset = {
+  def pivot(pivotColumn: Column, values: Seq[Any]): RelationalGroupedDataset = {


To make diffs smaller, can you move this under the signature def pivot(pivotColumn: String, values: Seq[Any])?

maropu · 2018-07-03T09:33:25Z

def pivot(pivotColumn: String), too?

maropu · 2018-07-03T09:33:44Z

cc: @rxin @gatorsmile

SparkQA · 2018-07-03T14:37:20Z

Test build #92561 has finished for PR 21699 at commit d62b7e7.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2018-07-03T14:41:18Z

jenkins, retest this, please

HyukjinKwon · 2018-07-03T16:08:55Z

cc @aray

rednaxelafx · 2018-07-03T18:10:12Z

This mostly looks good, but I'd like to ask a few things first:

The new overloaded pivot() that takes Column only exist for pivot(Column, Seq[Any]). Were you planning to add a new overload for each existing String version, e.g. pivot(Column) and pivot(Column, java.util.List[Any])?
Since you're adding the Column version(s) to address accessing nested columns, would it be nice to highlight that capability in the doc example? (Yes you've already included that in test case examples so that's already good)

SparkQA · 2018-07-03T18:33:15Z

Test build #92568 has finished for PR 21699 at commit d62b7e7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2018-07-03T18:36:43Z

Were you planning to add a new overload for each existing String version, e.g. pivot(Column) and pivot(Column, java.util.List[Any])?

The methods have been added already. @rednaxelafx Please, look at the lines:
https://github.com/apache/spark/pull/21699/files#diff-95bb2228c67e3cce4c729e44e2d82422R362
https://github.com/apache/spark/pull/21699/files#diff-95bb2228c67e3cce4c729e44e2d82422R377

Yes you've already included that in test case examples so that's already good

I added this test case: https://github.com/apache/spark/pull/21699/files#diff-50aa7d3b7b7934a7df6f414396e74c3cR271 .

aray · 2018-07-03T23:38:48Z

Using either Column or String type was actually in my original PR: #7841
@rxin later modified the api to only take a String prior to the release as part of an API audit: #9929

Considering you can just make a call to withColumn first I'm not really convinced in the utility of this PR.

SparkQA · 2018-07-03T23:48:17Z

Test build #92577 has finished for PR 21699 at commit fae4fd2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2018-07-04T08:46:28Z

Considering you can just make a call to withColumn first I'm not really convinced in the utility of this PR.

Purpose of the PR is to make pivot API consistent to groupBy and clear. Our users/clients shouldn't google workarounds like withColumn to apply the pivot() functions. I believe we should follow the principle of least astonishment (POLA). Some of our clients forms Column expressions programmatically as a result of calculation. And for now, instead of just passing Column variable to pivot(), they have to modify whole expression, and inject projection or withColumn. /cc @ssimeonov

HyukjinKwon · 2018-07-04T08:53:11Z

@MaxGekk, I think you should talk with @rxin rather then @aray's comment.

MaxGekk · 2018-07-08T16:33:42Z

@maryannxue Please, have a look at the PR.

SparkQA · 2018-07-08T20:13:30Z

Test build #92719 has finished for PR 21699 at commit f32a85b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maryannxue · 2018-07-30T06:06:29Z

@MaxGekk Yes, it was caused by my previous PR. The change in my PR was a walk-around for an existing problem in either Aggregate or PivotFirst (I suspect it's Aggregate) with struct-type columns. The change itself worked as designed because Pivot SQL support wouldn't allow any function (like "lowercase") in the pivot column. However it broke your PR coz it aimed to allow any expression.
That said, we have two options here:

Give up the PivotFirst approach and fall back to "else" branch for struct-type pivot columns, i.e., multiple column in pivot FOR clause.
Fix the bug for Aggregate or PivotFirst.
I will do a little investigation into option 2) tomorrow and get back to you :)

maryannxue · 2018-07-30T23:30:07Z

@MaxGekk Please take a look at #21926. There was a bug in PivotFirst and this PR should fix your test here.

HyukjinKwon · 2018-07-31T06:50:49Z

retest this please

SparkQA · 2018-07-31T07:05:03Z

Test build #93824 has finished for PR 21699 at commit 34535a9.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-07-31T07:31:53Z

retest this please

SparkQA · 2018-07-31T10:32:38Z

Test build #93826 has finished for PR 21699 at commit 34535a9.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-07-31T16:49:41Z

Test build #93832 has finished for PR 21699 at commit cf55135.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

maryannxue · 2018-07-31T16:57:58Z

@MaxGekk LGTM, but one more thing to consider:
Since we support column list in SQL, it would be nice to support it and test it in DataFrame pivot too. The only thing that we need to enable is to make pivot values Expressions instead of Literals, coz Literals do not include struct-type literals, e.g., struct(1, 2). The Pivot node already has pivot values as Seq[Expression], so all left to be done is in the DataFrame interfaces.

MaxGekk · 2018-07-31T21:56:53Z

it would be nice to support it and test it in DataFrame pivot too.

@maryannxue Supported and tested. Please, have a look at: https://github.com/MaxGekk/spark-1/blob/5da5a2c94a1e99cc3edd920080470b3d17cfc699/sql/core/src/test/scala/org/apache/spark/sql/DataFramePivotSuite.scala#L312-L342

HyukjinKwon · 2018-08-01T01:39:53Z

sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala

+    import org.apache.spark.sql.functions.struct
    groupType match {
      case RelationalGroupedDataset.GroupByType =>
+        val pivotValues = values.map {


Hm? wait @maryannxue I think we shouldn't do this at least here.

This PR just proposes an overloaded version, pivot(column) of pivot(string). It's not necessary to fix other things together. Also, it needs another review iteration I guess. For instance, does array or map works and nested struct work, etc. Let's take out this change for now.

@HyukjinKwon Should I revert the last commit and propose it as a separate PR? I think it makes sense to discuss in JIRA ticket possible alternatives for API.

I hope the last commit is reverted and we go ahead orthogonally if @maryannxue is happy with that too.

SparkQA · 2018-08-01T01:43:43Z

Test build #93848 has finished for PR 21699 at commit 5da5a2c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maryannxue · 2018-08-01T07:13:24Z

Thank you for the change, @MaxGekk!
@HyukjinKwon my idea was actually that the overloaded versions of pivot would be pivot(column: Column, values, Seq[Column]), so that we can construct different types in "values". The constant check will be done in Analyzer, so we don't need to worry about it here.
Ultimately we would like to support complex-typed values in pivot(column: Column) as well, but I think we can make this in a different PR.

MaxGekk · 2018-08-01T09:12:58Z

pivot(column: Column, values, Seq[Column]), so that we can construct different types in "values".

Actually I am mostly worry of the pivotColumn. Specifying multiple columns via struct is not intuitive I believe.

maryannxue · 2018-08-01T17:23:33Z

Actually I am mostly worry of the pivotColumn. Specifying multiple columns via struct is not intuitive I believe.

It depends on whether we'd like to add extra interfaces for multiple columns. I don't have a preference between reusing this interface for multiple pivot columns or adding new ones. And we can always decide later.
But back to this interface, I'd assume this is for more advanced users, and the pivot column, even just being a single column, can have complex types, so the "literal object" values might be insufficient. Plus, this is a new interface we haven't pushed out yet, but once we have, we are more likely to end up adding a new one than changing it if we want to make it more sophisticated later on.

HyukjinKwon · 2018-08-02T02:47:05Z

If this PR proposes a different API then an overloaded version of pivot(String, Seq[Any]), it's a different issue though I guess.

I would prefer to have pivot(Column, Seq[Any]) and reuse it for multiple pivot columns if we should do this since that would make more sense with the existing pivot(String, Seq[Any]) - I guess this is still possible and we can do it later.

For the current status, I would let the multiple Column thing defer - I am not quite sure on this yet and I think it's just better to leave it as is for now till it's actually requested or needed.

HyukjinKwon · 2018-08-02T02:51:01Z

sql/core/src/test/scala/org/apache/spark/sql/DataFramePivotSuite.scala

    }
  }
+
+  test("SPARK-24722: pivoting nested columns") {


I usually leave a JIRA number for one specific regression test when it's a bug since that's going to explicitly point out which case was broken .. but not a big deal though.

This reverts commit 5da5a2c.

MaxGekk · 2018-08-03T07:11:59Z

@HyukjinKwon I revert the last changes. Please, take a look at the PR again.

SparkQA · 2018-08-03T09:11:26Z

Test build #94109 has finished for PR 21699 at commit ca1250b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2018-08-03T09:18:40Z

Jenkins, retest this, please

SparkQA · 2018-08-03T13:14:45Z

Test build #94123 has finished for PR 21699 at commit ca1250b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-08-04T06:17:17Z

Merged to master.

cloud-fan · 2018-08-08T07:17:36Z

sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala

+   * @since 2.4.0
   */
-  def pivot(pivotColumn: String, values: java.util.List[Any]): RelationalGroupedDataset = {
+  def pivot(pivotColumn: Column, values: java.util.List[Any]): RelationalGroupedDataset = {


I think it's a bad idea to use Any in the API. For the existing ones we can't remove, but we should not add new ones using Any.

In Spark 3.0 we should audit all the APIs in functions.scala and make them consistent(e.g. only use Column in the API)

If there's a plan for auditing it in 3.0.0, I am okay with going ahead with Column but thing is, we should deprecate them first.

For the current status, I think the problem here is, this is an overloaded version of pivot and wouldn't necessarily make the differences.

I used pivot heavily in the previous company before and I am pretty sure pivot(String, actual values) has been a common pattern and usage so far.

MaxGekk added 6 commits July 2, 2018 19:09

Adding pivot() which takes Column as its argument

889e922

Tests for new function

f736ea2

the since tag is updated

5e68226

Test for nested columns

c82c397

Python test for nested columns

7d0d226

Adding ticket number to test's title

0fdd11f

maropu reviewed Jul 3, 2018

View reviewed changes

MaxGekk mentioned this pull request Jul 3, 2018

[SPARK-24722][SQL] pivot() with Column type argument MaxGekk/spark#7

Closed

MaxGekk added 3 commits July 3, 2018 13:07

Making diff shorter

74ddcdd

Adding a function which accepts Column argument

390d832

Adding Java-specific functions

d62b7e7

Tests for column expression

fae4fd2

MaxGekk added 3 commits July 8, 2018 17:37

Tests for pivot column which refers to multiple other columns

8ffdc32

Test pivoting by constant

57c0f64

Test for pivoting by an aggregate

f32a85b

MaxGekk added 2 commits July 31, 2018 15:28

Merge remote-tracking branch 'origin/master' into pivot-column

e869f85

Fix expected message

cf55135

Support multiple values

5da5a2c

HyukjinKwon reviewed Aug 1, 2018

View reviewed changes

HyukjinKwon reviewed Aug 2, 2018

View reviewed changes

Revert "Support multiple values"

ca1250b

This reverts commit 5da5a2c.

asfgit closed this in 41c2227 Aug 4, 2018

gatorsmile mentioned this pull request Aug 7, 2018

[SPARK-25048][SQL] Pivoting by multiple columns in Scala/Java #22030

Closed

cloud-fan reviewed Aug 8, 2018

View reviewed changes

MaxGekk deleted the pivot-column branch August 17, 2019 13:33

[SPARK-24722][SQL] pivot() with Column type argument #21699

[SPARK-24722][SQL] pivot() with Column type argument #21699

Uh oh!

Conversation

MaxGekk commented Jul 2, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Jul 3, 2018

Uh oh!

maropu Jul 3, 2018

Choose a reason for hiding this comment

Uh oh!

maropu commented Jul 3, 2018

Uh oh!

maropu commented Jul 3, 2018

Uh oh!

SparkQA commented Jul 3, 2018

Uh oh!

MaxGekk commented Jul 3, 2018

Uh oh!

HyukjinKwon commented Jul 3, 2018

Uh oh!

rednaxelafx commented Jul 3, 2018

Uh oh!

SparkQA commented Jul 3, 2018

Uh oh!

MaxGekk commented Jul 3, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aray commented Jul 3, 2018

Uh oh!

SparkQA commented Jul 3, 2018

Uh oh!

MaxGekk commented Jul 4, 2018

Uh oh!

HyukjinKwon commented Jul 4, 2018

Uh oh!

MaxGekk commented Jul 8, 2018

Uh oh!

SparkQA commented Jul 8, 2018

Uh oh!

maryannxue commented Jul 30, 2018

Uh oh!

maryannxue commented Jul 30, 2018

Uh oh!

HyukjinKwon commented Jul 31, 2018

Uh oh!

SparkQA commented Jul 31, 2018

Uh oh!

HyukjinKwon commented Jul 31, 2018

Uh oh!

SparkQA commented Jul 31, 2018

Uh oh!

SparkQA commented Jul 31, 2018

Uh oh!

maryannxue commented Jul 31, 2018

Uh oh!

MaxGekk commented Jul 31, 2018

Uh oh!

HyukjinKwon Aug 1, 2018

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Aug 1, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MaxGekk Aug 1, 2018

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Aug 2, 2018

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 1, 2018

Uh oh!

maryannxue commented Aug 1, 2018

Uh oh!

MaxGekk commented Aug 1, 2018

Uh oh!

MaxGekk commented Jul 2, 2018 •

edited

Loading

MaxGekk commented Jul 3, 2018 •

edited

Loading

HyukjinKwon Aug 1, 2018 •

edited

Loading

maryannxue commented Aug 1, 2018 •

edited

Loading

HyukjinKwon commented Aug 2, 2018 •

edited

Loading

cloud-fan Aug 8, 2018 •

edited

Loading