-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-29427][SQL] Add API to convert RelationalGroupedDataset to KeyValueGroupedDataset #26509
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala
Outdated
Show resolved
Hide resolved
|
Test build #113733 has finished for PR 26509 at commit
|
|
retest this please. |
|
Test build #113744 has finished for PR 26509 at commit
|
|
Test build #113748 has finished for PR 26509 at commit
|
| @Stable | ||
| class RelationalGroupedDataset protected[sql]( | ||
| private[sql] val df: DataFrame, | ||
| class RelationalGroupedDataset[T] protected[sql]( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a stable API, is it OK to add type parameter? @srowen @dongjoon-hyun
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although the goal seems to extend Dataset[Row] to Dataset[T] in 3.0.0, I'm not sure about this approach. Not only this change (the generic class), line 51 also changes from df: DataFrame to ds: Dataset[T]. If there is some 3rd party classes extending this with override, this variable name and type change will break those classes.
@viirya . Do we need this change for your original goal, Add API ...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, it seems that MiMa doesn't complain about this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another option is to add new API as .as[(K, T)]. It's not ideal as T should be known when we create the RelationalGroupedDataset, but we can avoid changing the stable API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's a binary-incompatible change because of type erasure, so MiMa doesn't flag it. However it will indeed probably not be source-compatible. We should only do it if there's a pretty necessary reason in 3.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dongjoon-hyun > Do we need this change for your original goal, Add API ...?
To change from df: DataFrame to ds: Dataset[T] is because we need the typed information. For DataFrame, we have no idea about original object type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.as[(K, T)] sounds good so we don't need changing the stable API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would break source compatibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just add as[K, T] and do not add typed parameter to RelationalGroupedDataset class, does it still break source compatibility?
| val additionalCols = aliasedGrps.filter(g => !df.logicalPlan.outputSet.contains(g.toAttribute)) | ||
| val qe = Dataset.ofRows( | ||
| df.sparkSession, | ||
| Project(df.logicalPlan.output ++ additionalCols, df.logicalPlan)).queryExecution |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems inefficient. Can we make KeyValueGroupedDataset.groupingAttributes a Seq[Expression]?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will it also break break source compatibility?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
groupingAttributes is private in KeyValueGroupedDataset
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
KeyValueGroupedDataset does not produce grouping attributes. These grouping attributes still come from the given queryExecution.
If we change groupingAttributes to Seq[NamedExpression], we still need add this Project to produce these grouping attributes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need the grouping attributes? AFAIK it's used to specify the required distribution, which doesn't have to be attributes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, we use groupingAttributes to construct MapGroups in flatMapGroups. These should be attributes so the UnresolvedDeserializer can be resolved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can make MapGroupsExec follow aggregate, to calculate grouping columns ahead and put it in the buffer row, then run key/value deserializer on buffer row.
But I admit that it's a big change, we can do it in the future.
|
This is going to break source compatibility in major ways. Doesn't make sense to do it this way. |
|
Test build #113856 has finished for PR 26509 at commit
|
|
retest this please. |
|
Test build #113860 has finished for PR 26509 at commit
|
sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala
Outdated
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala
Outdated
Show resolved
Hide resolved
|
Test build #114122 has finished for PR 26509 at commit
|
sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala
Outdated
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala
Outdated
Show resolved
Hide resolved
|
Test build #114137 has finished for PR 26509 at commit
|
|
retest this please |
|
Test build #114144 has finished for PR 26509 at commit
|
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. Merged to master.
Thank you all!
|
Thanks a lot all for solving my issue, especially @viirya 🙌 |
HyukjinKwon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM too. Sorry for late response.
What changes were proposed in this pull request?
This PR proposes to add
asAPI to RelationalGroupedDataset. It creates KeyValueGroupedDataset instance using given grouping expressions, instead of a typed function in groupByKey API. Because it can leverage existing columns, it can use existing data partition, if any, when doing operations like cogroup.Why are the changes needed?
Currently if users want to do cogroup on DataFrames, there is no good way to do except for KeyValueGroupedDataset.
Does this PR introduce any user-facing change?
Yes, this adds a new
asAPI to RelationalGroupedDataset. Users can use it to create KeyValueGroupedDataset and do cogroup.How was this patch tested?
Unit tests.