[SPARK-29427][SQL] Add API to convert RelationalGroupedDataset to KeyValueGroupedDataset #26509

viirya · 2019-11-13T22:55:10Z

What changes were proposed in this pull request?

This PR proposes to add as API to RelationalGroupedDataset. It creates KeyValueGroupedDataset instance using given grouping expressions, instead of a typed function in groupByKey API. Because it can leverage existing columns, it can use existing data partition, if any, when doing operations like cogroup.

Why are the changes needed?

Currently if users want to do cogroup on DataFrames, there is no good way to do except for KeyValueGroupedDataset.

KeyValueGroupedDataset ignores existing data partition if any. That is a problem.
groupByKey calls typed function to create additional keys. You can not reuse existing columns, if you just need grouping by them.

// df1 and df2 are certainly partitioned and sorted.
val df1 = Seq((1, 2, 3), (2, 3, 4)).toDF("a", "b", "c")
  .repartition($"a").sortWithinPartitions("a")
val df2 = Seq((1, 2, 4), (2, 3, 5)).toDF("a", "b", "c")
  .repartition($"a").sortWithinPartitions("a")

// This groupBy.as.cogroup won't unnecessarily repartition the data 
val df3 = df1.groupBy("a").as[Int]
  .cogroup(df2.groupBy("a").as[Int]) { case (key, data1, data2) =>
    data1.zip(data2).map { p =>
      p._1.getInt(2) + p._2.getInt(2)
    }
}

== Physical Plan ==
*(5) SerializeFromObject [input[0, int, false] AS value#11247]
+- CoGroup org.apache.spark.sql.DataFrameSuite$$Lambda$4922/1206709281@6eec1b6f, a#11209: int, createexternalrow(a#11209, b#11210, c#11211, StructField(a,IntegerType,false), StructField(b,IntegerType,false), StructField(c,IntegerType,false)), createexternalrow(a#11225, b#11226, c#11227, StructField(a,IntegerType,false), StructField(b,IntegerType,false), StructField(c,IntegerType,false)), [a#11209], [a#11225], [a#11209, b#11210, c#11211], [a#11225, b#11226, c#11227], obj#11246: int
   :- *(2) Sort [a#11209 ASC NULLS FIRST], false, 0
   :  +- Exchange hashpartitioning(a#11209, 5), false, [id=#10218]
   :     +- *(1) Project [_1#11202 AS a#11209, _2#11203 AS b#11210, _3#11204 AS c#11211]
   :        +- *(1) LocalTableScan [_1#11202, _2#11203, _3#11204]
   +- *(4) Sort [a#11225 ASC NULLS FIRST], false, 0
      +- Exchange hashpartitioning(a#11225, 5), false, [id=#10223]
         +- *(3) Project [_1#11218 AS a#11225, _2#11219 AS b#11226, _3#11220 AS c#11227]
            +- *(3) LocalTableScan [_1#11218, _2#11219, _3#11220]

// Current approach creates additional AppendColumns and repartition data again
val df4 = df1.groupByKey(r => r.getInt(0)).cogroup(df2.groupByKey(r => r.getInt(0))) {
  case (key, data1, data2) =>
    data1.zip(data2).map { p =>
      p._1.getInt(2) + p._2.getInt(2)
  }
}

== Physical Plan ==
*(7) SerializeFromObject [input[0, int, false] AS value#11257]
+- CoGroup org.apache.spark.sql.DataFrameSuite$$Lambda$4933/1381027007@37171997, value#11252: int, createexternalrow(a#11209, b#11210, c#11211, StructField(a,IntegerType,false), StructField(b,IntegerType,false), StructField(c,IntegerType,false)), createexternalrow(a#11225, b#11226, c#11227, StructField(a,IntegerType,false), StructField(b,IntegerType,false), StructField(c,IntegerType,false)), [value#11252], [value#11254], [a#11209, b#11210, c#11211], [a#11225, b#11226, c#11227], obj#11256: int
   :- *(3) Sort [value#11252 ASC NULLS FIRST], false, 0
   :  +- Exchange hashpartitioning(value#11252, 5), true, [id=#10302]
   :     +- AppendColumns org.apache.spark.sql.DataFrameSuite$$Lambda$4930/1952919534@7ce07f47, createexternalrow(a#11209, b#11210, c#11211, StructField(a,IntegerType,false), StructField(b,IntegerType,false), StructField(c,IntegerType,false)), [input[0, int, false] AS value#11252]
   :        +- *(2) Sort [a#11209 ASC NULLS FIRST], false, 0
   :           +- Exchange hashpartitioning(a#11209, 5), false, [id=#10297]
   :              +- *(1) Project [_1#11202 AS a#11209, _2#11203 AS b#11210, _3#11204 AS c#11211]
   :                 +- *(1) LocalTableScan [_1#11202, _2#11203, _3#11204]
   +- *(6) Sort [value#11254 ASC NULLS FIRST], false, 0
      +- Exchange hashpartitioning(value#11254, 5), true, [id=#10312]
         +- AppendColumns org.apache.spark.sql.DataFrameSuite$$Lambda$4932/1526528849@1f0e0c1f, createexternalrow(a#11225, b#11226, c#11227, StructField(a,IntegerType,false), StructField(b,IntegerType,false), StructField(c,IntegerType,false)), [input[0, int, false] AS value#11254]
            +- *(5) Sort [a#11225 ASC NULLS FIRST], false, 0
               +- Exchange hashpartitioning(a#11225, 5), false, [id=#10307]
                  +- *(4) Project [_1#11218 AS a#11225, _2#11219 AS b#11226, _3#11220 AS c#11227]
                     +- *(4) LocalTableScan [_1#11218, _2#11219, _3#11220]

Does this PR introduce any user-facing change?

Yes, this adds a new as API to RelationalGroupedDataset. Users can use it to create KeyValueGroupedDataset and do cogroup.

How was this patch tested?

Unit tests.

viirya · 2019-11-13T23:01:30Z

cc @cloud-fan @HyukjinKwon @hagerf

sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala

SparkQA · 2019-11-14T02:02:51Z

Test build #113733 has finished for PR 26509 at commit 61b1947.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class RelationalGroupedDataset[T] protected[sql](

viirya · 2019-11-14T02:15:38Z

retest this please.

…29427-2

SparkQA · 2019-11-14T05:58:05Z

Test build #113744 has finished for PR 26509 at commit 61b1947.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class RelationalGroupedDataset[T] protected[sql](

SparkQA · 2019-11-14T07:34:34Z

Test build #113748 has finished for PR 26509 at commit c5f2e26.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-11-14T07:35:50Z

sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala

 @Stable
-class RelationalGroupedDataset protected[sql](
-    private[sql] val df: DataFrame,
+class RelationalGroupedDataset[T] protected[sql](


This is a stable API, is it OK to add type parameter? @srowen @dongjoon-hyun

Although the goal seems to extend Dataset[Row] to Dataset[T] in 3.0.0, I'm not sure about this approach. Not only this change (the generic class), line 51 also changes from df: DataFrame to ds: Dataset[T]. If there is some 3rd party classes extending this with override, this variable name and type change will break those classes.

@viirya . Do we need this change for your original goal, Add API ...?

BTW, it seems that MiMa doesn't complain about this change?

another option is to add new API as .as[(K, T)]. It's not ideal as T should be known when we create the RelationalGroupedDataset, but we can avoid changing the stable API.

I don't think it's a binary-incompatible change because of type erasure, so MiMa doesn't flag it. However it will indeed probably not be source-compatible. We should only do it if there's a pretty necessary reason in 3.0

@dongjoon-hyun > Do we need this change for your original goal, Add API ...?
To change from df: DataFrame to ds: Dataset[T] is because we need the typed information. For DataFrame, we have no idea about original object type.

.as[(K, T)] sounds good so we don't need changing the stable API.

This would break source compatibility.

Just add as[K, T] and do not add typed parameter to RelationalGroupedDataset class, does it still break source compatibility?

cloud-fan · 2019-11-14T07:38:31Z

sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala

+    val additionalCols = aliasedGrps.filter(g => !df.logicalPlan.outputSet.contains(g.toAttribute))
+    val qe = Dataset.ofRows(
+      df.sparkSession,
+      Project(df.logicalPlan.output ++ additionalCols, df.logicalPlan)).queryExecution


This seems inefficient. Can we make KeyValueGroupedDataset.groupingAttributes a Seq[Expression]?

Will it also break break source compatibility?

groupingAttributes is private in KeyValueGroupedDataset

KeyValueGroupedDataset does not produce grouping attributes. These grouping attributes still come from the given queryExecution.

If we change groupingAttributes to Seq[NamedExpression], we still need add this Project to produce these grouping attributes.

Why do we need the grouping attributes? AFAIK it's used to specify the required distribution, which doesn't have to be attributes.

For example, we use groupingAttributes to construct MapGroups in flatMapGroups. These should be attributes so the UnresolvedDeserializer can be resolved.

We can make MapGroupsExec follow aggregate, to calculate grouping columns ahead and put it in the buffer row, then run key/value deserializer on buffer row.

But I admit that it's a big change, we can do it in the future.

rxin · 2019-11-14T18:21:52Z

This is going to break source compatibility in major ways. Doesn't make sense to do it this way.

SparkQA · 2019-11-15T08:05:01Z

Test build #113856 has finished for PR 26509 at commit 576558d.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class RelationalGroupedDataset protected[sql](

viirya · 2019-11-15T08:13:11Z

retest this please.

SparkQA · 2019-11-15T12:05:51Z

Test build #113860 has finished for PR 26509 at commit 576558d.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class RelationalGroupedDataset protected[sql](

sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala

sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala

SparkQA · 2019-11-20T01:44:08Z

Test build #114122 has finished for PR 26509 at commit 04aa387.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala

sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala

SparkQA · 2019-11-20T08:05:02Z

Test build #114137 has finished for PR 26509 at commit 5b0923c.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-11-20T08:08:03Z

retest this please

SparkQA · 2019-11-20T12:13:40Z

Test build #114144 has finished for PR 26509 at commit 5b0923c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun

+1, LGTM. Merged to master.
Thank you all!

hagerf · 2019-11-22T19:06:57Z

Thanks a lot all for solving my issue, especially @viirya 🙌

HyukjinKwon

LGTM too. Sorry for late response.

viirya added 2 commits November 13, 2019 10:25

Add API to convert RelationalGroupedDataset to KeyValueGroupedDataset.

1f58793

Add API to convert RelationalGroupedDataset to KeyValueGroupedDataset.

61b1947

viirya commented Nov 13, 2019

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala Outdated Show resolved Hide resolved

viirya added 3 commits November 13, 2019 18:28

Add doc.

ba93d60

Merge branch 'SPARK-29427-2' of github.com:viirya/spark-1 into SPARK-…

5c3942f

…29427-2

Resolve merging conflict.

c5f2e26

cloud-fan reviewed Nov 14, 2019

View reviewed changes

dongjoon-hyun added the SQL label Nov 14, 2019

Remove typed parameter to RelationalGroupedDataset.

576558d

cloud-fan reviewed Nov 19, 2019

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Nov 19, 2019

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Nov 19, 2019

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Nov 19, 2019

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala Show resolved Hide resolved

Address comments.

04aa387

cloud-fan reviewed Nov 20, 2019

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Nov 20, 2019

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala Outdated Show resolved Hide resolved

Address comments.

5b0923c

cloud-fan approved these changes Nov 20, 2019

View reviewed changes

dongjoon-hyun approved these changes Nov 22, 2019

View reviewed changes

dongjoon-hyun closed this in 6b0e391 Nov 22, 2019

HyukjinKwon reviewed Dec 5, 2019

View reviewed changes

viirya deleted the SPARK-29427-2 branch December 27, 2023 18:23

[SPARK-29427][SQL] Add API to convert RelationalGroupedDataset to KeyValueGroupedDataset #26509

[SPARK-29427][SQL] Add API to convert RelationalGroupedDataset to KeyValueGroupedDataset #26509

Uh oh!

Conversation

viirya commented Nov 13, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

viirya commented Nov 13, 2019

Uh oh!

Uh oh!

SparkQA commented Nov 14, 2019

Uh oh!

viirya commented Nov 14, 2019

Uh oh!

SparkQA commented Nov 14, 2019

Uh oh!

SparkQA commented Nov 14, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Nov 14, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya Nov 18, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rxin commented Nov 14, 2019

Uh oh!

SparkQA commented Nov 15, 2019

Uh oh!

viirya commented Nov 15, 2019

Uh oh!

SparkQA commented Nov 15, 2019

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SparkQA commented Nov 20, 2019

Uh oh!

Uh oh!

Uh oh!

SparkQA commented Nov 20, 2019

Uh oh!

cloud-fan commented Nov 20, 2019

Uh oh!

SparkQA commented Nov 20, 2019

Uh oh!

viirya commented Nov 13, 2019 •

edited

Loading

dongjoon-hyun Nov 14, 2019 •

edited

Loading

viirya Nov 18, 2019 •

edited

Loading