-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-25048][SQL] Pivoting by multiple columns in Scala/Java #22030
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@maryannxue Please, have a look at the PR. |
| * @since 2.4.0 | ||
| */ | ||
| def pivot(pivotColumn: Column, values: Seq[Any]): RelationalGroupedDataset = { | ||
| def pivot(pivotColumn: Column, values: Seq[Column]): RelationalGroupedDataset = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HyukjinKwon I think this change is better than what #21699 did.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, wouldn't we better allow this Seq[Column] for both pivot(String ...) and pivot(Column ...) too by Seq[Any] since pivot(String ...)'s signature allows it?
BTW, we should document this in the param and describe the difference clearly in the documentation. Otherwise, seems the current API change makes the usage potentially quite confusing to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am okay if that's what you all guys think. It should really be clearly documented then now if we go ahead with the current way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HyukjinKwon
You can just consider pivot(String, Seq[Any]) as a simplified version of pivot(Column, Seq[Column]) for users who don't care to use multiple pivot columns or a pivot column of complex types. Given that now we have the full-functional version and the simple version here, I don't think adding another signature is necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, I didn't mean to add another signature. My only worry is that pivot(String, Seq[Any]) can take actual values as well whereas pivot(Column, Seq[Column]) does not allow actual values, right?
I was thinking we should allow both cases for both APIs. Otherwise, it can be confusing, isn't it? These differences should really be clarified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. Seq[Any] takes literal values (objects); Seq[Column] takes Column expressions.
I mean:
Before:
scala> val df = spark.range(10).selectExpr("struct(id) as a")
df: org.apache.spark.sql.DataFrame = [a: struct<id: bigint>]
scala> df.groupBy().pivot("a", Seq(struct(lit(1)))).count().show()
java.lang.RuntimeException: Unsupported literal type class org.apache.spark.sql.Column named_struct(col1, 1)
at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:78)
at org.apache.spark.sql.RelationalGroupedDataset$$anonfun$pivot$1.apply(RelationalGroupedDataset.scala:419)
at org.apache.spark.sql.RelationalGroupedDataset$$anonfun$pivot$1.apply(RelationalGroupedDataset.scala:419)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.List.foreach(List.scala:392)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.immutable.List.map(List.scala:296)
at org.apache.spark.sql.RelationalGroupedDataset.pivot(RelationalGroupedDataset.scala:419)
at org.apache.spark.sql.RelationalGroupedDataset.pivot(RelationalGroupedDataset.scala:338)
... 51 elided
After:
scala> val df = spark.range(10).selectExpr("struct(id) as a")
df: org.apache.spark.sql.DataFrame = [a: struct<id: bigint>]
scala> df.groupBy().pivot("a", Seq(struct(lit(1)))).count().show()
+---+
|[1]|
+---+
| 1|
+---+
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @MaxGekk's intention was to keep the old signature as it is but somehow used "lit" which takes Column too. Correct me if I'm wrong, @MaxGekk.
So back to the choice between pivot(Column, Seq[Column]) and pivot(Column, Seq[Any]), I think having an explicit Seq[Column] type is less confusing and kind of tells people by itself that we are now support complex types in pivot values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think #22030 (comment) makes perfect sense. We really don't need to make it complicated.
having an explicit Seq[Column] type is less confusing and kind of tells people by itself that we are now support complex types in pivot values.
My question was that it's from your speculation or actual feedback from users since the original interface has existed for few years and I haven't seen some complaints about this so far as far as I can tell.
It's okay if we clearly document this with some examples. It wouldn't necessarily make some differences between same overloaded APIs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My question was that it's from your speculation or actual feedback from users...
This is an actual feedback from our users who want to do pivoting by multiple columns. They have to use an external systems (even Microsoft Excel does it better) for pivoting by many columns for now because Spark doesn't allow this. You cannot express for example this on the latest release:
trainingSales
.groupBy($"sales.year")
.pivot(struct(lower($"sales.course"), $"training"), Seq(
struct(lit("dotnet"), lit("Experts")),
struct(lit("java"), lit("Dummies")))
).agg(sum($"sales.earnings"))
via def pivot(pivotColumn: String, values: Seq[Any]). I am not speaking about the recently added method def pivot(pivotColumn: Column, values: Seq[Any]) which we are going to make more concise and eliminate unnecessary generic type Any.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the point is if users really get confused or not by Any because I as a user used this heavily and I have been fine with that for long time. In that case, I thought we better keep it consistent with the original one.
|
Test build #94385 has finished for PR 22030 at commit
|
| .collect() | ||
| .map(_.get(0)) | ||
| .collect { | ||
| case row: GenericRow => struct(row.values.map(lit): _*) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect this will not work for nested struct types, or say, multiple pivot columns with nested type. Could you please add a test like:
test("pivoting column list") {
val expected = ...
val df = trainingSales
.groupBy($"sales.year")
.pivot(struct($"sales", $"training"))
.agg(sum($"sales.earnings"))
checkAnswer(df, expected)
}
And can we also check if it works for other complex nested types, like Array(Struct(...))?
| .sort(pivotColumn) // ensure that the output columns are in a consistent logical order | ||
| .collect() | ||
| .map(_.get(0)) | ||
| .collect { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use "map"?
| Dataset<Row> df = spark.table("courseSales"); | ||
| List<Row> actual = df.groupBy("year") | ||
| .pivot(col("course"), Arrays.asList(lit("dotNET"), lit("Java"))) | ||
| .agg(sum("earnings")).orderBy("year").collectAsList(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: 2 space indentation
| */ | ||
| def pivot(pivotColumn: String, values: Seq[Any]): RelationalGroupedDataset = { | ||
| pivot(Column(pivotColumn), values) | ||
| pivot(Column(pivotColumn), values.map(lit)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is going to allow pivot(String, Seq[Any]) also take Column. Did I misread the codes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you did. This "old" interface only takes in a single named column (say, "a", but not "a+1") by its name, but we turn it into a Column just to reuse the same implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, we did allow only liternals but not generic columns before, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, with Seq[Any] we only allow literal values, not Columns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is going to allow pivot(String, Seq[Any]) also take Column
I think using "lit" here is causing the confusion then (perhaps @MaxGekk was not aware of that?). We should keep the current behavior of this signature as it is. Using Column(Literal.create(value)) would do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm? I think we should allow it and then make this #22030 (comment) assumption stay true.
|
Please, review this PR #22316 |
What changes were proposed in this pull request?
In the PR, I propose to change signature of the
pivot()method and make type of values more specific. The methodis replaced by:
This changes allow to combine literal values to structure and perform pivoting by multiple columns.
How was this patch tested?
I added two tests with and without values specification.