[SPARK-18969][SQL] Support grouping by nondeterministic expressions #16404

cloud-fan · 2016-12-26T10:44:10Z

What changes were proposed in this pull request?

Currently nondeterministic expressions are allowed in Aggregate(see the comment), but the PullOutNondeterministic analyzer rule failed to handle Aggregate, this PR fixes it.

close #16379

There is still one remaining issue: SELECT a + rand() FROM t GROUP BY a + rand() is not allowed, because the 2 rand() are different(we generate random seed as the default seed for rand()). https://issues.apache.org/jira/browse/SPARK-19035 is tracking this issue.

How was this patch tested?

a new test suite

cloud-fan · 2016-12-26T10:44:28Z

cc @rxin @gatorsmile

SparkQA · 2016-12-26T13:04:27Z

Test build #70597 has finished for PR 16404 at commit 67f032b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
trait Nondeterministic extends LeafExpression
abstract class RDG extends LeafExpression with ExpectsInputTypes with Nondeterministic
case class Rand(seedExpr: Expression) extends RDG
case class Randn(seedExpr: Expression) extends RDG

SparkQA · 2016-12-26T15:09:03Z

Test build #70603 has finished for PR 16404 at commit 2ba6469.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
trait Nondeterministic extends LeafExpression
abstract class RDG extends LeafExpression with ExpectsInputTypes with Nondeterministic
case class Rand(seedExpr: Expression) extends RDG
case class Randn(seedExpr: Expression) extends RDG

SparkQA · 2016-12-27T03:48:05Z

Test build #70617 has finished for PR 16404 at commit f145188.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-12-27T06:50:41Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

+
+    private def getNondeterToAttr(exprs: Seq[Expression]): Map[Expression, NamedExpression] = {
+      exprs.filterNot(_.deterministic).flatMap { expr =>
+        val leafNondeterministic = expr.collect { case n: Nondeterministic => n }


Not all the non-deterministic expressions extend Nondeterministic. This might not cover all the cases.

this is same with the previous behavior

according to the variable name: leafNondeterministic, it seems reasonable to collect Nondeterministic here.

This rule is running once. Thus, it should be safe; otherwise, it might generate many useless Project when some expressions are not extending Nondeterministic but its deterministics is false.

Maybe added a nagative test case for this check

sql(s"CREATE TEMPORARY FUNCTION statefulUDF AS '${classOf[StatefulUDF].getName}'") val df = Seq((1, 1)).toDF("a", "b") df.createOrReplaceTempView("data") sql("select a, statefulUDF(), sum(b) from data group by a, 2").show()

is it same with the existing test? select a, rand(0), sum(b) from data group by a, 2

https://github.com/apache/spark/pull/16404/files#diff-a0f2e45a5da747e9ec483f3557aa1b8bR138

statefulUDF() is a stateful/non-deterministic UDF which does not exend Nondeterministic, but its deterministic is equal to false

seems we need to add a new API, here we wanna get non-deterministic leaf nodes, and trait Nondeterministic is not suitable.

this problem was already there, let's send a new PR to fix it.

gatorsmile · 2016-12-27T07:30:19Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

      // todo: It's hard to write a general rule to pull out nondeterministic expressions
      // from LogicalPlan, currently we only do it for UnaryNode which has same output
      // schema with its child.
      case p: UnaryNode if p.output == p.child.output && p.expressions.exists(!_.deterministic) =>


Just a question. What is the reason why we need to have such a condition p.output == p.child.output?

to narrow down the scope of the affected operators, but ideally we should use a white-list

Thanks! I see

cloud-fan · 2016-12-30T14:02:08Z

retest this please

SparkQA · 2016-12-30T16:20:32Z

Test build #70753 has finished for PR 16404 at commit f145188.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-12-30T16:20:47Z

Test build #70754 has finished for PR 16404 at commit f145188.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-12-30T18:31:25Z

LGTM cc @rxin

gatorsmile · 2017-01-03T03:31:00Z

Found a bug filed in a JIRA https://issues.apache.org/jira/browse/SPARK-19035. This PR does not resolves it.

gatorsmile · 2017-01-03T20:25:49Z

sql("select a + rand() from testData2 group by a, a + rand()").explain(true)

After we merging this PR, I am afraid we might hitting a common misunderstanding. Users might assume a + rand() in the aggregate and group by are referring to the same expression. However, the non-deterministic expressions rand() in aggregate and group by actually return different results. See the plan

(CAST(a AS DOUBLE) + rand(-8319334304660323190)): double
Aggregate [a#23, (cast(a#23 as double) + _nondeterministic#249)], [(cast(a#23 as double) + rand(-8319334304660323190)) AS (CAST(a AS DOUBLE) + rand(-8319334304660323190))#248]
+- Project [a#23, b#24, rand(-993354445377164024) AS _nondeterministic#249]
   +- SubqueryAlias testdata2, `testData2`
      +- SerializeFromObject [assertnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData2, true], top level Product input object), - root class: "org.apache.spark.sql.test.SQLTestData.TestData2").a AS a#23, assertnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData2, true], top level Product input object), - root class: "org.apache.spark.sql.test.SQLTestData.TestData2").b AS b#24]
         +- ExternalRDD [obj#22]

cloud-fan · 2017-01-04T04:22:47Z

How do other databases handle this case? Do they forbid using non-deterministic expressions in GROUP BY, or give a better error message?

gatorsmile · 2017-01-04T06:20:23Z

DB2 has such a limit. See the error message SQL -583: http://www.ibm.com/support/knowledgecenter/SSEPGG_10.5.0/com.ibm.db2.luw.messages.sql.doc/doc/msql00583n.html

The routine (function or method) or expression is defined as non-deterministic or as having external action. This is not supported in the context in which it is used. The contexts in which these are not valid are:

in an expression of a GROUP BY clause

It documents the same workaround:

Remove the non-deterministic or external action routine or expression from the GROUP BY clause. If grouping is desired on a column of the result that is based on a non-deterministic or external action routine or expression use a nested table expression or a common table expression to first provide a result table with the expression as a column of the result.

gatorsmile · 2017-01-04T06:54:52Z

Oracle allows it. It sounds like they treat (username || dbms_random.string('a', 10)) in aggregate and group-by as the same expression.

SQL> select (username || dbms_random.string('a', 10)) from all_users group by (username || dbms_random.string('a', 10));

(USERNAME||DBMS_RANDOM.STRING('A',10))
--------------------------------------------------------------------------------
APEX_040000cklbMYhekl
FLOWS_FILESVmTbIIeiUs
CTXSYSPmgqeRFPry
SYSTEMxQLrzXxHth
XDBRRTfatsLlU
SYSoLDWRKMvlZ
XS$NULLXAaOykZCDH
APEX_PUBLIC_USERvcLswvpbcw
ANONYMOUSgupWiktQKh
OUTLNjLdKOTZoFI
MDSYSxEOhwTwQqa

(USERNAME||DBMS_RANDOM.STRING('A',10))
--------------------------------------------------------------------------------
HRkovpxQztYU

12 rows selected.

If I change the order, I got the error:

SQL> select (dbms_random.string('a', 10) || username) from all_users group by (username || dbms_random.string('a', 10))
  2  ;
select (dbms_random.string('a', 10) || username) from all_users group by (username || dbms_random.string('a', 10))
                                       *
ERROR at line 1:
ORA-00979: not a GROUP BY expression

gatorsmile · 2017-01-04T07:29:45Z

MySQL 5.7 treats them differently...

mysql> select c1, concat(rand(), c1) from t1 group by c1;
+------+----------------------+
| c1   | concat(rand(), c1)   |
+------+----------------------+
|    1 | 0.084388771172974981 |
|    3 | 0.116890648488784823 |
+------+----------------------+
2 rows in set (0.00 sec)

mysql> select c1, concat(rand(), c1) from t1 group by c1, concat(rand(), c1);
+------+----------------------+
| c1   | concat(rand(), c1)   |
+------+----------------------+
|    1 | 0.16241911441313021  |
|    1 | 0.461423657332941551 |
|    3 | 0.81986097415896223  |
+------+----------------------+
3 rows in set (0.00 sec)

MySQL 5.5 is even more crazy... It allows the following query:

mysql> select c1, rand() from t1 group by rand();
+------+---------------------+
| c1   | rand()              |
+------+---------------------+
|    2 | 0.10147774974021852 |
|    1 | 0.15563200614939632 |
|    1 | 0.22556077058013602 |
+------+---------------------+
3 rows in set (0.00 sec)

tejasapatil · 2017-01-05T16:50:04Z

Presto allows it and uses the same value of rand() computed during projection for doing aggregation. Internally, every aggregation column is hashed and hash values are appended to the intermediate row. These hashes are used for local aggregation and final aggregation after a shuffle.

cloud-fan · 2017-01-12T03:28:44Z

postgres handles rand in aggregate specially

cloud=# select random(), random();
      random       |      random       
-------------------+-------------------
 0.627067061141133 | 0.324854184873402
(1 row)

cloud=# select random(), random() from (select 1) t group by random();
      random       |      random       
-------------------+-------------------
 0.459220345597714 | 0.459220345597714
(1 row)

cloud=# select a + random(), random() from (select 1 as a) t group by a + random();
     ?column?     |      random       
------------------+-------------------
 1.38031229563057 | 0.274804535787553
(1 row)

Hive has the same behavior.
I think we can follow it, to use a fixed seed for rand in Aggregate's grouping expressions

cloud-fan · 2017-01-12T03:34:08Z

how about we fix this in follow-up PR? Looks like the fix is not trivial.

cloud-fan · 2017-01-12T03:34:47Z

retest this please

gatorsmile · 2017-01-12T05:46:39Z

It sounds like different RDBMS have different behaviors. Have we decided which way we should follow?

SparkQA · 2017-01-12T06:00:56Z

Test build #71244 has finished for PR 16404 at commit f145188.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-01-12T07:44:26Z

Since Hive and postgres have same behavior on this, I'd like to follow them.

gatorsmile · 2017-01-12T08:23:55Z

Agree. Oracle behaves the same.

gatorsmile · 2017-01-12T08:24:20Z

I am fine to merge this at first, as long as we can fix it before the release.

cloud-fan · 2017-01-12T08:41:46Z

also cc @rxin to take a look

rxin · 2017-01-12T09:32:08Z

LGTM on the behavior

rxin · 2017-01-12T09:35:21Z

Make sure you update the pull request and jira ticket description before you merge.

cloud-fan · 2017-01-12T12:21:28Z

updated. Merging to master/2.1/2.0!

## What changes were proposed in this pull request? Currently nondeterministic expressions are allowed in `Aggregate`(see the [comment](https://github.com/apache/spark/blob/v2.0.2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala#L249-L251)), but the `PullOutNondeterministic` analyzer rule failed to handle `Aggregate`, this PR fixes it. close #16379 There is still one remaining issue: `SELECT a + rand() FROM t GROUP BY a + rand()` is not allowed, because the 2 `rand()` are different(we generate random seed as the default seed for `rand()`). https://issues.apache.org/jira/browse/SPARK-19035 is tracking this issue. ## How was this patch tested? a new test suite Author: Wenchen Fan <[email protected]> Closes #16404 from cloud-fan/groupby. (cherry picked from commit 871d266) Signed-off-by: Wenchen Fan <[email protected]>

## What changes were proposed in this pull request? Currently nondeterministic expressions are allowed in `Aggregate`(see the [comment](https://github.com/apache/spark/blob/v2.0.2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala#L249-L251)), but the `PullOutNondeterministic` analyzer rule failed to handle `Aggregate`, this PR fixes it. close apache#16379 There is still one remaining issue: `SELECT a + rand() FROM t GROUP BY a + rand()` is not allowed, because the 2 `rand()` are different(we generate random seed as the default seed for `rand()`). https://issues.apache.org/jira/browse/SPARK-19035 is tracking this issue. ## How was this patch tested? a new test suite Author: Wenchen Fan <[email protected]> Closes apache#16404 from cloud-fan/groupby.

cloud-fan force-pushed the groupby branch from 67f032b to 2ba6469 Compare December 26, 2016 13:33

Support grouping by nondeterministic expressions

f145188

cloud-fan force-pushed the groupby branch from 2ba6469 to f145188 Compare December 27, 2016 01:30

gatorsmile reviewed Dec 27, 2016

View reviewed changes

cloud-fan closed this Jan 4, 2017

cloud-fan reopened this Jan 4, 2017

asfgit closed this in 871d266 Jan 12, 2017

WangGuangxin mentioned this pull request Feb 16, 2022

[SPARK-38160][SQL] Shuffle by rand could lead to incorrect answers when ShuffleFetchFailed happend #35460

Closed

[SPARK-18969][SQL] Support grouping by nondeterministic expressions #16404

[SPARK-18969][SQL] Support grouping by nondeterministic expressions #16404

Uh oh!

Conversation

cloud-fan commented Dec 26, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

cloud-fan commented Dec 26, 2016

Uh oh!

SparkQA commented Dec 26, 2016

Uh oh!

SparkQA commented Dec 26, 2016

Uh oh!

SparkQA commented Dec 27, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Dec 30, 2016

Uh oh!

SparkQA commented Dec 30, 2016

Uh oh!

SparkQA commented Dec 30, 2016

Uh oh!

gatorsmile commented Dec 30, 2016

Uh oh!

gatorsmile commented Jan 3, 2017

Uh oh!

gatorsmile commented Jan 3, 2017

Uh oh!

cloud-fan commented Jan 4, 2017

Uh oh!

gatorsmile commented Jan 4, 2017

Uh oh!

gatorsmile commented Jan 4, 2017

Uh oh!

gatorsmile commented Jan 4, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tejasapatil commented Jan 5, 2017

Uh oh!

cloud-fan commented Jan 12, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cloud-fan commented Jan 12, 2017

Uh oh!

cloud-fan commented Jan 12, 2017

Uh oh!

gatorsmile commented Jan 12, 2017

Uh oh!

SparkQA commented Jan 12, 2017

Uh oh!

cloud-fan commented Jan 12, 2017

Uh oh!

gatorsmile commented Jan 12, 2017

Uh oh!

gatorsmile commented Jan 12, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cloud-fan commented Dec 26, 2016 •

edited

Loading

gatorsmile commented Jan 4, 2017 •

edited

Loading

cloud-fan commented Jan 12, 2017 •

edited

Loading

gatorsmile commented Jan 12, 2017 •

edited

Loading

cloud-fan commented Jan 12, 2017 •

edited

Loading