[SPARK-11451][SQL] Support single distinct count on multiple columns. #9409

hvanhovell · 2015-11-02T11:37:26Z

This PR adds support for multiple column in a single count distinct aggregate to the new aggregation path.

rxin · 2015-11-05T23:49:39Z

Can you add some test cases for this?

hvanhovell · 2015-11-06T08:15:08Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/utils.scala

We need to check if the no-structs allowed in grouping keys policy will create a problem for us when we try to use this in a multiple distinct setting.

yhuai · 2015-11-06T18:49:13Z

ok to test

yhuai · 2015-11-06T19:15:42Z

test this please

SparkQA · 2015-11-06T19:26:11Z

Test build #45237 has started for PR 9409 at commit aa2ac6b.

marmbrus · 2015-11-06T23:18:26Z

test this please

marmbrus · 2015-11-06T23:20:47Z

...talyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/conditionalExpressions.scala

This needs to be child.eval(input).asInstanceOf[InternalRow].

Could you also add an test in ConditionalExpressionSuite, I think that would have caught this bug.

I'll fix this and add a test tomorrow morning.

So it turns out this is actually correct. The eval method of UnaryExpression will call eval on the child expression and pass the result on the nullSafeEval method if it's not null.

oh, you are right. I did not notice that.

marmbrus · 2015-11-06T23:23:07Z

This is great. Only minor comments.

marmbrus · 2015-11-06T23:46:39Z

add to whitelist

marmbrus · 2015-11-06T23:46:47Z

ok to test

hvanhovell · 2015-11-07T12:12:05Z

test this please

yhuai · 2015-11-07T15:31:33Z

test this please

SparkQA · 2015-11-07T15:36:03Z

Test build #2004 has started for PR 9409 at commit 8538e11.

hvanhovell · 2015-11-07T15:58:41Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Utils.scala

@yhuai if we combine this with the distinct rewriting rule. It will add a struct to the groupBy clause of the first aggregate. This is currently not allowed in the new UDAF path, so it'll fall back to the old path. For example:

val data2 = Seq[(Integer, Integer, Integer)]( (1, 10, -10), (null, -60, 60), (1, 30, -30), (1, 30, 30), (2, 1, 1), (null, -10, 10), (2, -1, null), (2, 1, 1), (2, null, 1), (null, 100, -10), (3, null, 3), (null, null, null), (3, null, null)).toDF("key", "value1", "value2") data2.registerTempTable("agg2") val q sql( """ |SELECT | key, | count(distinct value1), | count(distinct value2), | count(distinct value1, value2) |FROM agg2 |GROUP BY key """.stripMargin)

Will create the following physical plan:

== Physical Plan == TungstenAggregate(key=[key#3], functions=[(count(if ((gid#44 = 1)) attributereference#45 else null),mode=Final,isDistinct=false),(count(if ((gid#44 = 3)) attributereference#47 else null),mode=Final,isDistinct=false),(count(if ((gid#44 = 2)) dropanynull#46 else null),mode=Final,isDistinct=false)], output=[key#3,_c1#32L,_c2#33L,_c3#34L]) TungstenExchange(Shuffle without coordinator) hashpartitioning(key#3,200), None TungstenAggregate(key=[key#3], functions=[(count(if ((gid#44 = 1)) attributereference#45 else null),mode=Partial,isDistinct=false),(count(if ((gid#44 = 3)) attributereference#47 else null),mode=Partial,isDistinct=false),(count(if ((gid#44 = 2)) dropanynull#46 else null),mode=Partial,isDistinct=false)], output=[key#3,count#49L,count#53L,count#51L]) Aggregate false, [key#3,attributereference#45,dropanynull#46,attributereference#47,gid#44], [key#3,attributereference#45,dropanynull#46,attributereference#47,gid#44] ConvertToSafe TungstenExchange(Shuffle without coordinator) hashpartitioning(key#3,attributereference#45,dropanynull#46,attributereference#47,gid#44,200), None ConvertToUnsafe Aggregate true, [key#3,attributereference#45,dropanynull#46,attributereference#47,gid#44], [key#3,attributereference#45,dropanynull#46,attributereference#47,gid#44] !Expand [List(key#3, value1#4, null, null, 1),List(key#3, null, dropanynull(struct(value1#4,value2#5)), null, 2),List(key#3, null, null, value2#5, 3)], [key#3,attributereference#45,dropanynull#46,attributereference#47,gid#44] LocalTableScan [key#3,value1#4,value2#5], [[1,10,-10],[null,-60,60],[1,30,-30],[1,30,30],[2,1,1],[null,-10,10],[2,-1,null],[2,1,1],[2,null,1],[null,100,-10],[3,null,3],[null,null,null],[3,null,null]]

Is it possible to add support for fixed width structs as group by expression to the new aggregation path?

Quick follow-up.

Allowing structs does not seem to create a problem. I disabled this line locally: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Utils.scala#L36. And now it uses the TungstenAggregate.

oh, yes. Based on https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala#L240-L266, we can compare two struct values. Looks like only array and map types are not handled there. So, I think we can visit all data types of a struct and if it does not have array or map, we can use new agg code path. Can you update Utils.scala? I am thinking about if an array or a map appear in the grouping expressions, we throw an analysis error and say it is not allowed right now.

Added (proper) StructType checking.

Do you want me to also start throwing AnalysisError's?

I can make the change of throwing analysis error in my pr.

hvanhovell · 2015-11-07T20:44:59Z

test this please

yhuai · 2015-11-07T21:06:44Z

test this please

yhuai · 2015-11-07T21:10:24Z

retest this please

yhuai · 2015-11-07T21:15:24Z

hmm... seems jenkins did not pick up this pr.

yhuai · 2015-11-07T21:16:48Z

Hope https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2005/ can finish without any problem.

hvanhovell · 2015-11-07T21:24:01Z

This one is currently running: https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2005/consoleFull

yhuai · 2015-11-07T21:34:40Z

oh, right, I pasted the wrong link.

hvanhovell · 2015-11-07T22:24:14Z

Seems like I have broken something. I'll need to rebase anyway.

SparkQA · 2015-11-07T22:48:11Z

Test build #2005 has finished for PR 9409 at commit 5c46cec.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * case class DropAnyNull(child: Expression) extends UnaryExpression with ExpectsInputTypes\n

…all test for multiple column count distinct.

…by key.

yhuai · 2015-11-08T02:54:04Z

Looks like the failed test exposed a problem of our rewriter? Different results are all from regular count.

…without analysis errors).

…urposes).

…pressions and attributes didn't align.

hvanhovell · 2015-11-08T12:57:20Z

test this please

hvanhovell · 2015-11-08T13:13:50Z

retest this please

hvanhovell · 2015-11-08T13:23:52Z

Jenkins does not like me...

hvanhovell · 2015-11-08T14:53:15Z

@yhuai can you get jenkins to test this?

The bug exposed by this patch affected the regular aggregation path, as soon as we used more than one regular aggregate, the chance existed that an attribute and its source expression got misaligned. This has been fixed. I have also added a test for this situation.

If we choose not to add this to the 1.6 branch, then we have to create a separte PR containing only the bugfix and get that one in.

yhuai · 2015-11-08T15:49:41Z

test this please

yhuai · 2015-11-08T15:50:57Z

ok to test

yhuai · 2015-11-08T15:51:02Z

add to whitelist

yhuai · 2015-11-08T15:52:29Z

Not sure why jenkins did not get triggered after you updated the PR. Let's try to get this in branch 1.6 since it is needed to remove old agg path.

SparkQA · 2015-11-08T17:45:31Z

Test build #2012 has finished for PR 9409 at commit 4e53aab.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * case class DropAnyNull(child: Expression) extends UnaryExpression with ExpectsInputTypes\n

yhuai · 2015-11-08T19:05:04Z

LGTM. Merging to master and branch 1.6

This PR adds support for multiple column in a single count distinct aggregate to the new aggregation path. cc yhuai Author: Herman van Hovell <[email protected]> Closes #9409 from hvanhovell/SPARK-11451. (cherry picked from commit 30c8ba7) Signed-off-by: Yin Huai <[email protected]>

In #9409 we enabled multi-column counting. The approach taken in that PR introduces a bit of overhead by first creating a row only to check if all of the columns are non-null. This PR fixes that technical debt. Count now takes multiple columns as its input. In order to make this work I have also added support for multiple columns in the single distinct code path. cc yhuai Author: Herman van Hovell <[email protected]> Closes #10015 from hvanhovell/SPARK-12024. (cherry picked from commit 3d28081) Signed-off-by: Yin Huai <[email protected]>

In #9409 we enabled multi-column counting. The approach taken in that PR introduces a bit of overhead by first creating a row only to check if all of the columns are non-null. This PR fixes that technical debt. Count now takes multiple columns as its input. In order to make this work I have also added support for multiple columns in the single distinct code path. cc yhuai Author: Herman van Hovell <[email protected]> Closes #10015 from hvanhovell/SPARK-12024.

hvanhovell reviewed Nov 6, 2015
View reviewed changes

hvanhovell force-pushed the SPARK-11451 branch from 51c46bb to 9a959e2 Compare November 6, 2015 09:22

marmbrus reviewed Nov 6, 2015
View reviewed changes

marmbrus mentioned this pull request Nov 7, 2015

[SPARK-9241][SQL] Supporting multiple DISTINCT columns (2) - Rewriting Rule #9406

Closed

hvanhovell reviewed Nov 7, 2015
View reviewed changes

hvanhovell force-pushed the SPARK-11451 branch from 8538e11 to ae26526 Compare November 7, 2015 19:09

hvanhovell added 4 commits November 7, 2015 23:32

Add multiple column support for count distinct.

98853b2

Added tests.

45363ce

Added NULL case to tests.

89f90b9

Add ExpectsInputTypes and some more tests.

f265e44

hvanhovell added 3 commits November 8, 2015 00:55

Add tests.

68f34b4

Allow struct as a grouping key in the new aggregate code path. Add sm…

4e75bdf

…all test for multiple column count distinct.

Improve checking if a StructType is suitable as an aggregation group …

d34e4cf

…by key.

hvanhovell added 3 commits November 8, 2015 12:26

Fix references in expand (so we can use this directly in a dataframe …

f19a6d4

…without analysis errors).

Improve attribute naming in MultipleDistinctRewriter (for debugging p…

82b9c60

…urposes).

Fix bug in regular aggregation path of the MultipleDistinctWriter: ex…

4e53aab

…pressions and attributes didn't align.

hvanhovell force-pushed the SPARK-11451 branch from 5c46cec to 4e53aab Compare November 8, 2015 12:54

asfgit closed this in 30c8ba7 Nov 8, 2015

hvanhovell mentioned this pull request Nov 27, 2015

[SPARK-12024][SQL] More efficient multi-column counting. #10015

Closed

[SPARK-11451][SQL] Support single distinct count on multiple columns. #9409

[SPARK-11451][SQL] Support single distinct count on multiple columns. #9409

Uh oh!

Conversation

hvanhovell commented Nov 2, 2015

Uh oh!

rxin commented Nov 5, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yhuai commented Nov 6, 2015

Uh oh!

yhuai commented Nov 6, 2015

Uh oh!

SparkQA commented Nov 6, 2015

Uh oh!

marmbrus commented Nov 6, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

marmbrus commented Nov 6, 2015

Uh oh!

marmbrus commented Nov 6, 2015

Uh oh!

marmbrus commented Nov 6, 2015

Uh oh!

hvanhovell commented Nov 7, 2015

Uh oh!

yhuai commented Nov 7, 2015

Uh oh!

SparkQA commented Nov 7, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hvanhovell commented Nov 7, 2015

Uh oh!

yhuai commented Nov 7, 2015

Uh oh!

yhuai commented Nov 7, 2015

Uh oh!

yhuai commented Nov 7, 2015

Uh oh!

yhuai commented Nov 7, 2015

Uh oh!

hvanhovell commented Nov 7, 2015

Uh oh!

yhuai commented Nov 7, 2015

Uh oh!

hvanhovell commented Nov 7, 2015

Uh oh!

SparkQA commented Nov 7, 2015

Uh oh!

yhuai commented Nov 8, 2015

Uh oh!

hvanhovell commented Nov 8, 2015

Uh oh!

hvanhovell commented Nov 8, 2015

Uh oh!

hvanhovell commented Nov 8, 2015

Uh oh!

hvanhovell commented Nov 8, 2015

Uh oh!

yhuai commented Nov 8, 2015

Uh oh!