Skip to content

Conversation

@MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Aug 7, 2018

What changes were proposed in this pull request?

In the PR, I propose to change signature of the pivot() method and make type of values more specific. The method

def pivot(pivotColumn: Column, values: Seq[Any]): RelationalGroupedDataset

is replaced by:

def pivot(pivotColumn: Column, values: Seq[Column]): RelationalGroupedDataset

This changes allow to combine literal values to structure and perform pivoting by multiple columns.

How was this patch tested?

I added two tests with and without values specification.

@MaxGekk
Copy link
Member Author

MaxGekk commented Aug 7, 2018

@maryannxue Please, have a look at the PR.

* @since 2.4.0
*/
def pivot(pivotColumn: Column, values: Seq[Any]): RelationalGroupedDataset = {
def pivot(pivotColumn: Column, values: Seq[Column]): RelationalGroupedDataset = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon I think this change is better than what #21699 did.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, wouldn't we better allow this Seq[Column] for both pivot(String ...) and pivot(Column ...) too by Seq[Any] since pivot(String ...)'s signature allows it?

BTW, we should document this in the param and describe the difference clearly in the documentation. Otherwise, seems the current API change makes the usage potentially quite confusing to me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am okay if that's what you all guys think. It should really be clearly documented then now if we go ahead with the current way.

Copy link
Contributor

@maryannxue maryannxue Aug 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon
You can just consider pivot(String, Seq[Any]) as a simplified version of pivot(Column, Seq[Column]) for users who don't care to use multiple pivot columns or a pivot column of complex types. Given that now we have the full-functional version and the simple version here, I don't think adding another signature is necessary.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I didn't mean to add another signature. My only worry is that pivot(String, Seq[Any]) can take actual values as well whereas pivot(Column, Seq[Column]) does not allow actual values, right?

I was thinking we should allow both cases for both APIs. Otherwise, it can be confusing, isn't it? These differences should really be clarified.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. Seq[Any] takes literal values (objects); Seq[Column] takes Column expressions.

I mean:
Before:

scala> val df = spark.range(10).selectExpr("struct(id) as a")
df: org.apache.spark.sql.DataFrame = [a: struct<id: bigint>]

scala> df.groupBy().pivot("a", Seq(struct(lit(1)))).count().show()
java.lang.RuntimeException: Unsupported literal type class org.apache.spark.sql.Column named_struct(col1, 1)
  at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:78)
  at org.apache.spark.sql.RelationalGroupedDataset$$anonfun$pivot$1.apply(RelationalGroupedDataset.scala:419)
  at org.apache.spark.sql.RelationalGroupedDataset$$anonfun$pivot$1.apply(RelationalGroupedDataset.scala:419)
  at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.immutable.List.foreach(List.scala:392)
  at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
  at scala.collection.immutable.List.map(List.scala:296)
  at org.apache.spark.sql.RelationalGroupedDataset.pivot(RelationalGroupedDataset.scala:419)
  at org.apache.spark.sql.RelationalGroupedDataset.pivot(RelationalGroupedDataset.scala:338)
  ... 51 elided

After:

scala> val df = spark.range(10).selectExpr("struct(id) as a")
df: org.apache.spark.sql.DataFrame = [a: struct<id: bigint>]

scala> df.groupBy().pivot("a", Seq(struct(lit(1)))).count().show()
+---+
|[1]|
+---+
|  1|
+---+

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @MaxGekk's intention was to keep the old signature as it is but somehow used "lit" which takes Column too. Correct me if I'm wrong, @MaxGekk.
So back to the choice between pivot(Column, Seq[Column]) and pivot(Column, Seq[Any]), I think having an explicit Seq[Column] type is less confusing and kind of tells people by itself that we are now support complex types in pivot values.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think #22030 (comment) makes perfect sense. We really don't need to make it complicated.

having an explicit Seq[Column] type is less confusing and kind of tells people by itself that we are now support complex types in pivot values.

My question was that it's from your speculation or actual feedback from users since the original interface has existed for few years and I haven't seen some complaints about this so far as far as I can tell.

It's okay if we clearly document this with some examples. It wouldn't necessarily make some differences between same overloaded APIs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My question was that it's from your speculation or actual feedback from users...

This is an actual feedback from our users who want to do pivoting by multiple columns. They have to use an external systems (even Microsoft Excel does it better) for pivoting by many columns for now because Spark doesn't allow this. You cannot express for example this on the latest release:

trainingSales
      .groupBy($"sales.year")
      .pivot(struct(lower($"sales.course"), $"training"), Seq(
        struct(lit("dotnet"), lit("Experts")),
        struct(lit("java"), lit("Dummies")))
      ).agg(sum($"sales.earnings"))

via def pivot(pivotColumn: String, values: Seq[Any]). I am not speaking about the recently added method def pivot(pivotColumn: Column, values: Seq[Any]) which we are going to make more concise and eliminate unnecessary generic type Any.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the point is if users really get confused or not by Any because I as a user used this heavily and I have been fine with that for long time. In that case, I thought we better keep it consistent with the original one.

@SparkQA
Copy link

SparkQA commented Aug 7, 2018

Test build #94385 has finished for PR 22030 at commit 90fc82b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

.collect()
.map(_.get(0))
.collect {
case row: GenericRow => struct(row.values.map(lit): _*)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect this will not work for nested struct types, or say, multiple pivot columns with nested type. Could you please add a test like:

  test("pivoting column list") {
    val expected = ...
    val df = trainingSales
      .groupBy($"sales.year")
      .pivot(struct($"sales", $"training"))
      .agg(sum($"sales.earnings"))
     checkAnswer(df, expected)
  }

And can we also check if it works for other complex nested types, like Array(Struct(...))?

.sort(pivotColumn) // ensure that the output columns are in a consistent logical order
.collect()
.map(_.get(0))
.collect {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use "map"?

Dataset<Row> df = spark.table("courseSales");
List<Row> actual = df.groupBy("year")
.pivot(col("course"), Arrays.asList(lit("dotNET"), lit("Java")))
.agg(sum("earnings")).orderBy("year").collectAsList();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: 2 space indentation

*/
def pivot(pivotColumn: String, values: Seq[Any]): RelationalGroupedDataset = {
pivot(Column(pivotColumn), values)
pivot(Column(pivotColumn), values.map(lit))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to allow pivot(String, Seq[Any]) also take Column. Did I misread the codes?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you did. This "old" interface only takes in a single named column (say, "a", but not "a+1") by its name, but we turn it into a Column just to reuse the same implementation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, we did allow only liternals but not generic columns before, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, with Seq[Any] we only allow literal values, not Columns.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to allow pivot(String, Seq[Any]) also take Column

I think using "lit" here is causing the confusion then (perhaps @MaxGekk was not aware of that?). We should keep the current behavior of this signature as it is. Using Column(Literal.create(value)) would do.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm? I think we should allow it and then make this #22030 (comment) assumption stay true.

@MaxGekk
Copy link
Member Author

MaxGekk commented Sep 2, 2018

Please, review this PR #22316

@MaxGekk MaxGekk closed this Sep 2, 2018
@MaxGekk MaxGekk deleted the pivoting-by-multiple-columns branch August 17, 2019 13:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants