-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-8992] [SQL] Add pivot to dataframe api #7841
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Jenkins, ok to test. |
|
@rxin it looks like Jenkins forgot about building this. Can you help trigger the build again? |
|
Test build #1319 has finished for PR 7841 at commit
|
|
@aray FYI this didn't make it into the 1.5 release (was submitted too close to the feature freeze deadline), but we will try to include it in Spark 1.6. |
Conflicts: sql/core/src/test/scala/org/apache/spark/sql/TestData.scala
|
@rxin, do you want to revisit this now for 1.6? |
courseSales.groupBy($"year").pivot($"course", "dotNET", "Java").agg(sum($"earnings")) Also, fixed master merge.
Conflicts: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
|
@rxin and @JoshRosen, this is ready for review now. |
|
@aray Thanks a lot for updating this. To help api design, can you take a look at other frameworks and see what their signatures look like? |
|
@rxin here is my summary of other frameworks API's I'm going to use an example dataset form the pandas doc for all the examples (as df)
This APIscala> df.groupBy("A", "B").pivot("C", "small", "large").sum("D").show
+---+---+-----+-----+
| A| B|small|large|
+---+---+-----+-----+
|foo|two| 6| null|
|bar|two| 6| 7|
|foo|one| 1| 4|
|bar|one| 5| 4|
+---+---+-----+-----+
scala> df.groupBy("A", "B").pivot("C", "small", "large").agg(sum("D"), avg("D")).show
+---+---+------------+------------+------------+------------+
| A| B|small sum(D)|small avg(D)|large sum(D)|large avg(D)|
+---+---+------------+------------+------------+------------+
|foo|two| 6| 3.0| null| null|
|bar|two| 6| 6.0| 7| 7.0|
|foo|one| 1| 1.0| 4| 2.0|
|bar|one| 5| 5.0| 4| 4.0|
+---+---+------------+------------+------------+------------+
scala> df.pivot(Seq($"A", $"B"), $"C", Seq("small", "large"), sum($"D")).show
+---+---+-----+-----+
| A| B|small|large|
+---+---+-----+-----+
|foo|two| 6| null|
|bar|two| 6| 7|
|foo|one| 1| 4|
|bar|one| 5| 4|
+---+---+-----+-----+We require a list of values for the pivot column as we are required to know the output columns of the operator ahead of time. Pandas and reshape2 do not require this but the comparable SQL operators do. We also allow multiple aggregations which not all implementations allow. pandasThe comparable metod for pandas is Example >>> pivot_table(df, values='D', index=['A', 'B'], columns=['C'], aggfunc=np.sum)
small large
foo one 1 4
two 6 NaN
bar one 5 4
two 6 7Pandas also allows multiple aggregations: >>> pivot_table(df, values='D', index=['A', 'B'], columns=['C'], aggfunc=[np.sum, np.average])
sum average
C large small large small
A B
bar one 4 5 4 5
two 7 6 7 6
foo one 4 1 2 1
two NaN 6 NaN 3References
See also: reshape2 (R)The comparable method for reshape2 is > dcast(df, A + B ~ C, sum)
Using D as value column: use value.var to override.
A B large small
1 bar one 4 5
2 bar two 7 6
3 foo one 4 1
4 foo two 0 6Note that by default cast fills with the value from applying fun.aggregate to 0 length vector References
See also: MS SQL ServerSELECT *
FROM df
pivot (sum(D) for C in ([small], [large])) phttp://sqlfiddle.com/#!3/cf887/3/0 References Oracle 11gSELECT *
FROM df
pivot (sum(D) for C in ('small', 'large')) phttp://sqlfiddle.com/#!4/29bc5/3/0 Oracle also allows multiple aggregations and with similar output to this api SELECT *
FROM df
pivot (sum(D) as sum, avg(D) as avg for C in ('small', 'large')) phttp://sqlfiddle.com/#!4/29bc5/5/0 References
Let me know if I can do anything else to help this along. Also would you mind adding me to the jenkins whitelist so I can test it? |
|
ok to test |
|
Test build #44249 has finished for PR 7841 at commit
|
|
I like your 2nd interface more (group by and then pivot), since it is easier to get that working for both Java and Scala. We can implement a simpler interface for Python/R that's closer to existing frameworks. How hard would it be to not require the values? |
|
@rxin, Not requiring the values would necessitate doing a separate query for the distinct values of the column before the pivot query. It looks like at least some DF operations (eg, drop) would need the result so even if we made Pivot.output lazy we would be running an unexpected job. If a user really didn't want to specify the values, they can explicitly do the query: df.groupBy("A", "B").pivot("C", df.select("C").distinct.collect.map(_.getString(0)): _*).sum("D")Needing to know the output columns of an operator for analysis/planning is probably why the other SQL implementations require the values also (technically Oracle supports omitting it but only in XML mode where you essentially just get one column). |
Merge branch 'master' of https://github.com/apache/spark into sql-pivot Conflicts: sql/core/src/main/scala/org/apache/spark/sql/GroupedData.scala
|
Test build #44643 has finished for PR 7841 at commit
|
|
@aray sorry was away for spark summit - back now and will get to this today. |
Merge branch 'master' of https://github.com/apache/spark into sql-pivot Conflicts: sql/core/src/main/scala/org/apache/spark/sql/GroupedData.scala
|
Test build #45316 has finished for PR 7841 at commit
|
|
Test build #45366 has finished for PR 7841 at commit
|
|
@aray I talked to a few more people about this. Most like the 2nd API more (groupBy.pivot.agg). I think it'd also be better to remove the requirement to specify values, e.g. just take in a column without the values. So it looks like courseSales.groupBy($"year").pivot($"course").agg(sum($"earnings"))Can you update the pull request? Thanks. |
|
BTW we can also later add a variant that allows users to specify values directly, in order to avoid materializing the intermediate data. |
…ot provided. Add unit tests for this scenario.
|
@rxin Updated, the values are now optional. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we remove this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why?
|
@aray This is very cool! Here are a few things I'd like to discuss.
|
|
Test build #45645 has finished for PR 7841 at commit
|
- Use Literal's for the pivot column values instead of strings. - Change seperator when using multiple aggregates to `_` instead of space. - Some additional unit testing
|
@yhuai RE your questions (3 was already addressed above):
The argument for not requiring values I think is convenience and also similarity to other non-sql tools mentioned above. The negative is performance, but since we give them the option to specify I don't think that is a problem.
I initially used strings as the type since that is the common usage scenario. But I agree that using Literal's is the better solution and will avoid casts which could hurt performance. For convenience I kept the second method (changed to I really appreciate the review. Let me know if I can do anything else to help! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems we still need to check the number of children and make sure we have a single child?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should now work fine with aggregate functions that have multiple children as long as they ignore updates when all values are null. For example Corr should work since it only updates its aggregation buffer if both its arguments are non null.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, yes. You are right.
|
Test build #45659 has finished for PR 7841 at commit
|
…o prevent unintended OOM errors.
|
@yhuai I think this addresses everything we discussed, let me know if I missed anything or if there is anything else I can do. Again, thanks for the code review. |
|
LGTM pending jenkins. |
|
Test build #45673 has finished for PR 7841 at commit
|
|
Thanks! Merging to master and branch 1.6. |
This adds a pivot method to the dataframe api.
Following the lead of cube and rollup this adds a Pivot operator that is translated into an Aggregate by the analyzer.
Currently the syntax is like:
~~courseSales.pivot(Seq($"year"), $"course", Seq("dotNET", "Java"), sum($"earnings"))~~
~~Would we be interested in the following syntax also/alternatively? and~~
courseSales.groupBy($"year").pivot($"course", "dotNET", "Java").agg(sum($"earnings"))
//or
courseSales.groupBy($"year").pivot($"course").agg(sum($"earnings"))
Later we can add it to `SQLParser`, but as Hive doesn't support it we cant add it there, right?
~~Also what would be the suggested Java friendly method signature for this?~~
Author: Andrew Ray <[email protected]>
Closes #7841 from aray/sql-pivot.
(cherry picked from commit b8ff688)
Signed-off-by: Yin Huai <[email protected]>
|
@aray do you want to submit a pull request for python api too? |
|
@rxin sure I'll put together a PR for the python API tonight |
This adds a pivot method to the dataframe api.
Following the lead of cube and rollup this adds a Pivot operator that is translated into an Aggregate by the analyzer.
Currently the syntax is like:
~~courseSales.pivot(Seq($"year"), $"course", Seq("dotNET", "Java"), sum($"earnings"))~~
~~Would we be interested in the following syntax also/alternatively? and~~
courseSales.groupBy($"year").pivot($"course", "dotNET", "Java").agg(sum($"earnings"))
//or
courseSales.groupBy($"year").pivot($"course").agg(sum($"earnings"))
Later we can add it to `SQLParser`, but as Hive doesn't support it we cant add it there, right?
~~Also what would be the suggested Java friendly method signature for this?~~
Author: Andrew Ray <[email protected]>
Closes apache#7841 from aray/sql-pivot.
|
@aray this pull request was highlighted in http://www.slideshare.net/databricks/deep-dive-into-catalyst-apache-spark-20s-optimizer |
|
thank you |
This adds a pivot method to the dataframe api.
Following the lead of cube and rollup this adds a Pivot operator that is translated into an Aggregate by the analyzer.
Currently the syntax is like:
courseSales.pivot(Seq($"year"), $ "course", Seq("dotNET", "Java"), sum($"earnings"))Would we be interested in the following syntax also/alternatively? andLater we can add it to
SQLParser, but as Hive doesn't support it we cant add it there, right?Also what would be the suggested Java friendly method signature for this?