-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-15780][SQL] Support mapValues on KeyValueGroupedDataset #13526
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-15780][SQL] Support mapValues on KeyValueGroupedDataset #13526
Conversation
|
Test build #60054 has finished for PR 13526 at commit
|
| * data. The grouping key is unchanged by this. | ||
| * | ||
| * {{{ | ||
| * // Create values grouped by key from a Dataset[(K, V)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the example is wrong here. can we use a proper java 8 example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh.. it should be groupByKey, not groupBy. woops
i will also comment that its scala i guess
and put a proper java example in the java api
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my comment was targetting the wrong method. I was referring to the one below should come with a java 8 example.
|
cc @cloud-fan too |
|
I doubt if this feature is really useful? I think users can easily call On the other way, this implemetaion is kind of efficient, everytime we call If we do want to add this feature, we should add optimizer rules for this case. But it's not trivial, and may not worth such a rare case. |
|
Test build #60082 has finished for PR 13526 at commit
|
|
Test build #60080 has finished for PR 13526 at commit
|
|
see this conversation: mapGroups is not a very interesting API, since without support for secondary sort and hence no need for fold operations pushing all the value into the reducer never really makes sense. so the interesting APIs are reduceGroups (when its fixed to be efficient and not use mapGroups) and agg. i am curious to know why appending a column is inefficient? especially when it never materializes? i am open to different designs about this being a rare case: i would argue the opposite. i expect to see a lot of key-value datasets ( |
|
can you explain a bit what is inefficient and would need an optimizer rule? |
|
OK now I agree this is a useful API. For performance, I would expect that I'll take a closer look tomorrow, and let's discuss what's the best way to do it. |
|
ok i will study the physical plans for both and try to understand why one would be slower |
it seems to AppendColumns are not collapsed |
|
A possible approach maybe just keep the function given by |
|
the tricky part with that is that (ds: Dataset[(K, V)]).groupBy(_.1).mapValues(._2) should return a |
|
could we "rewind"/undo the append for the key and change it to a map that creates new data values and key? so remove one append and replace it with another operation? |
| child) | ||
| } | ||
|
|
||
| def apply[T : Encoder, U : Encoder]( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here you use T : Encoder, i.e. with spaces before and after : while...
|
Test build #62388 has finished for PR 13526 at commit
|
|
cc @cloud-fan This looks good to me -- but I don't remember why we didn't merge it earlier already. |
|
it lacks an optimizer rule to collapse |
|
retest this please |
|
Test build #67264 has finished for PR 13526 at commit
|
|
Test build #67268 has finished for PR 13526 at commit
|
|
Alright merging in master. Thanks. @koertkuipers would you be able to add the optimizer rule? |
|
@rxin i can give it a try (the optimizer rule) looking at it, currently what happens under the hood with how do we intend to optimize chaining two AppendColumns with functions optimized would look something like this under the hood: this would require some serious refactoring, AppendColumnsExec can not facilitate such a flow currently i think |
|
To optimize |
|
@cloud-fan i can try to optimize |
|
That's a good point, let's focus on |
|
@cloud-fan that makes sense to me, but its definitely not a quick win to create that optimization. |
|
if they chain like that then i think i know how to do the optimization. but do they? look for example at dataset.groupByKey(...).mapValues(...) Dataset[T].groupByKey[K] uses function T => K and creates KeyValueGroupedDataset[K, T].mapValues[W] uses function T => W and creates so i have T => K and then T => W On Thu, Oct 20, 2016 at 8:26 PM, Wenchen Fan [email protected]
|
## What changes were proposed in this pull request? Add mapValues to KeyValueGroupedDataset ## How was this patch tested? New test in DatasetSuite for groupBy function, mapValues, flatMap Author: Koert Kuipers <[email protected]> Closes apache#13526 from koertkuipers/feat-keyvaluegroupeddataset-mapvalues.
## What changes were proposed in this pull request? Add mapValues to KeyValueGroupedDataset ## How was this patch tested? New test in DatasetSuite for groupBy function, mapValues, flatMap Author: Koert Kuipers <[email protected]> Closes apache#13526 from koertkuipers/feat-keyvaluegroupeddataset-mapvalues.
What changes were proposed in this pull request?
Add mapValues to KeyValueGroupedDataset
How was this patch tested?
New test in DatasetSuite for groupBy function, mapValues, flatMap