[SPARK-25048][SQL] Pivoting by multiple columns in Scala/Java #22030

MaxGekk · 2018-08-07T16:44:04Z

What changes were proposed in this pull request?

In the PR, I propose to change signature of the pivot() method and make type of values more specific. The method

def pivot(pivotColumn: Column, values: Seq[Any]): RelationalGroupedDataset

is replaced by:

def pivot(pivotColumn: Column, values: Seq[Column]): RelationalGroupedDataset

This changes allow to combine literal values to structure and perform pivoting by multiple columns.

How was this patch tested?

I added two tests with and without values specification.

MaxGekk · 2018-08-07T16:46:53Z

@maryannxue Please, have a look at the PR.

gatorsmile · 2018-08-07T17:53:56Z

sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala

   * @since 2.4.0
   */
-  def pivot(pivotColumn: Column, values: Seq[Any]): RelationalGroupedDataset = {
+  def pivot(pivotColumn: Column, values: Seq[Column]): RelationalGroupedDataset = {


@HyukjinKwon I think this change is better than what #21699 did.

Hm, wouldn't we better allow this Seq[Column] for both pivot(String ...) and pivot(Column ...) too by Seq[Any] since pivot(String ...)'s signature allows it?

BTW, we should document this in the param and describe the difference clearly in the documentation. Otherwise, seems the current API change makes the usage potentially quite confusing to me.

I am okay if that's what you all guys think. It should really be clearly documented then now if we go ahead with the current way.

@HyukjinKwon
You can just consider pivot(String, Seq[Any]) as a simplified version of pivot(Column, Seq[Column]) for users who don't care to use multiple pivot columns or a pivot column of complex types. Given that now we have the full-functional version and the simple version here, I don't think adding another signature is necessary.

Yea, I didn't mean to add another signature. My only worry is that pivot(String, Seq[Any]) can take actual values as well whereas pivot(Column, Seq[Column]) does not allow actual values, right?

I was thinking we should allow both cases for both APIs. Otherwise, it can be confusing, isn't it? These differences should really be clarified.

No. Seq[Any] takes literal values (objects); Seq[Column] takes Column expressions.

I mean:
Before:

scala> val df = spark.range(10).selectExpr("struct(id) as a") df: org.apache.spark.sql.DataFrame = [a: struct<id: bigint>] scala> df.groupBy().pivot("a", Seq(struct(lit(1)))).count().show() java.lang.RuntimeException: Unsupported literal type class org.apache.spark.sql.Column named_struct(col1, 1) at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:78) at org.apache.spark.sql.RelationalGroupedDataset$$anonfun$pivot$1.apply(RelationalGroupedDataset.scala:419) at org.apache.spark.sql.RelationalGroupedDataset$$anonfun$pivot$1.apply(RelationalGroupedDataset.scala:419) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.immutable.List.foreach(List.scala:392) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.immutable.List.map(List.scala:296) at org.apache.spark.sql.RelationalGroupedDataset.pivot(RelationalGroupedDataset.scala:419) at org.apache.spark.sql.RelationalGroupedDataset.pivot(RelationalGroupedDataset.scala:338) ... 51 elided

After:

scala> val df = spark.range(10).selectExpr("struct(id) as a") df: org.apache.spark.sql.DataFrame = [a: struct<id: bigint>] scala> df.groupBy().pivot("a", Seq(struct(lit(1)))).count().show() +---+ |[1]| +---+ | 1| +---+

I think @MaxGekk's intention was to keep the old signature as it is but somehow used "lit" which takes Column too. Correct me if I'm wrong, @MaxGekk.
So back to the choice between pivot(Column, Seq[Column]) and pivot(Column, Seq[Any]), I think having an explicit Seq[Column] type is less confusing and kind of tells people by itself that we are now support complex types in pivot values.

I think #22030 (comment) makes perfect sense. We really don't need to make it complicated.

having an explicit Seq[Column] type is less confusing and kind of tells people by itself that we are now support complex types in pivot values.

My question was that it's from your speculation or actual feedback from users since the original interface has existed for few years and I haven't seen some complaints about this so far as far as I can tell.

It's okay if we clearly document this with some examples. It wouldn't necessarily make some differences between same overloaded APIs.

My question was that it's from your speculation or actual feedback from users...

This is an actual feedback from our users who want to do pivoting by multiple columns. They have to use an external systems (even Microsoft Excel does it better) for pivoting by many columns for now because Spark doesn't allow this. You cannot express for example this on the latest release:

trainingSales .groupBy($"sales.year") .pivot(struct(lower($"sales.course"), $"training"), Seq( struct(lit("dotnet"), lit("Experts")), struct(lit("java"), lit("Dummies"))) ).agg(sum($"sales.earnings"))

via def pivot(pivotColumn: String, values: Seq[Any]). I am not speaking about the recently added method def pivot(pivotColumn: Column, values: Seq[Any]) which we are going to make more concise and eliminate unnecessary generic type Any.

I think the point is if users really get confused or not by Any because I as a user used this heavily and I have been fine with that for long time. In that case, I thought we better keep it consistent with the original one.

SparkQA · 2018-08-07T20:30:51Z

Test build #94385 has finished for PR 22030 at commit 90fc82b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maryannxue · 2018-08-07T22:56:46Z

sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala

      .collect()
      .map(_.get(0))
+      .collect {
+        case row: GenericRow => struct(row.values.map(lit): _*)


I suspect this will not work for nested struct types, or say, multiple pivot columns with nested type. Could you please add a test like:

test("pivoting column list") { val expected = ... val df = trainingSales .groupBy($"sales.year") .pivot(struct($"sales", $"training")) .agg(sum($"sales.earnings")) checkAnswer(df, expected) }

And can we also check if it works for other complex nested types, like Array(Struct(...))?

maryannxue · 2018-08-07T23:00:18Z

sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala

      .sort(pivotColumn)  // ensure that the output columns are in a consistent logical order
      .collect()
      .map(_.get(0))
+      .collect {


HyukjinKwon · 2018-08-07T23:57:53Z

sql/core/src/test/java/test/org/apache/spark/sql/JavaDataFrameSuite.java

+    Dataset<Row> df = spark.table("courseSales");
+    List<Row> actual = df.groupBy("year")
+            .pivot(col("course"), Arrays.asList(lit("dotNET"), lit("Java")))
+            .agg(sum("earnings")).orderBy("year").collectAsList();


nit: 2 space indentation

HyukjinKwon · 2018-08-08T02:26:32Z

sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala

   */
  def pivot(pivotColumn: String, values: Seq[Any]): RelationalGroupedDataset = {
-    pivot(Column(pivotColumn), values)
+    pivot(Column(pivotColumn), values.map(lit))


This is going to allow pivot(String, Seq[Any]) also take Column. Did I misread the codes?

Yes, you did. This "old" interface only takes in a single named column (say, "a", but not "a+1") by its name, but we turn it into a Column just to reuse the same implementation.

So, we did allow only liternals but not generic columns before, right?

Yes, with Seq[Any] we only allow literal values, not Columns.

This is going to allow pivot(String, Seq[Any]) also take Column

I think using "lit" here is causing the confusion then (perhaps @MaxGekk was not aware of that?). We should keep the current behavior of this signature as it is. Using Column(Literal.create(value)) would do.

Hm? I think we should allow it and then make this #22030 (comment) assumption stay true.

MaxGekk · 2018-09-02T13:35:12Z

Please, review this PR #22316

MaxGekk added 3 commits August 7, 2018 18:12

Use the Column type for values

080b038

Fix a test

b29f5b5

Adding examples to comments

90fc82b

gatorsmile reviewed Aug 7, 2018

View reviewed changes

maryannxue suggested changes Aug 7, 2018

View reviewed changes

HyukjinKwon reviewed Aug 7, 2018

View reviewed changes

HyukjinKwon reviewed Aug 8, 2018

View reviewed changes

MaxGekk closed this Sep 2, 2018

MaxGekk deleted the pivoting-by-multiple-columns branch August 17, 2019 13:33

[SPARK-25048][SQL] Pivoting by multiple columns in Scala/Java #22030

[SPARK-25048][SQL] Pivoting by multiple columns in Scala/Java #22030

Uh oh!

Conversation

MaxGekk commented Aug 7, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

MaxGekk commented Aug 7, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maryannxue Aug 8, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 7, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MaxGekk commented Sep 2, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

maryannxue Aug 8, 2018 •

edited

Loading