Skip to content

Conversation

@cloud-fan
Copy link
Contributor

@cloud-fan cloud-fan commented Dec 26, 2016

What changes were proposed in this pull request?

Currently nondeterministic expressions are allowed in Aggregate(see the comment), but the PullOutNondeterministic analyzer rule failed to handle Aggregate, this PR fixes it.

close #16379

There is still one remaining issue: SELECT a + rand() FROM t GROUP BY a + rand() is not allowed, because the 2 rand() are different(we generate random seed as the default seed for rand()). https://issues.apache.org/jira/browse/SPARK-19035 is tracking this issue.

How was this patch tested?

a new test suite

@cloud-fan
Copy link
Contributor Author

cc @rxin @gatorsmile

@SparkQA
Copy link

SparkQA commented Dec 26, 2016

Test build #70597 has finished for PR 16404 at commit 67f032b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • trait Nondeterministic extends LeafExpression
  • abstract class RDG extends LeafExpression with ExpectsInputTypes with Nondeterministic
  • case class Rand(seedExpr: Expression) extends RDG
  • case class Randn(seedExpr: Expression) extends RDG

@SparkQA
Copy link

SparkQA commented Dec 26, 2016

Test build #70603 has finished for PR 16404 at commit 2ba6469.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • trait Nondeterministic extends LeafExpression
  • abstract class RDG extends LeafExpression with ExpectsInputTypes with Nondeterministic
  • case class Rand(seedExpr: Expression) extends RDG
  • case class Randn(seedExpr: Expression) extends RDG

@SparkQA
Copy link

SparkQA commented Dec 27, 2016

Test build #70617 has finished for PR 16404 at commit f145188.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


private def getNondeterToAttr(exprs: Seq[Expression]): Map[Expression, NamedExpression] = {
exprs.filterNot(_.deterministic).flatMap { expr =>
val leafNondeterministic = expr.collect { case n: Nondeterministic => n }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not all the non-deterministic expressions extend Nondeterministic. This might not cover all the cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. this is same with the previous behavior
  2. according to the variable name: leafNondeterministic, it seems reasonable to collect Nondeterministic here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This rule is running once. Thus, it should be safe; otherwise, it might generate many useless Project when some expressions are not extending Nondeterministic but its deterministics is false.

Maybe added a nagative test case for this check

sql(s"CREATE TEMPORARY FUNCTION statefulUDF AS '${classOf[StatefulUDF].getName}'")
val df = Seq((1, 1)).toDF("a", "b")
df.createOrReplaceTempView("data")
sql("select a, statefulUDF(), sum(b) from data group by a, 2").show()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it same with the existing test? select a, rand(0), sum(b) from data group by a, 2

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

statefulUDF() is a stateful/non-deterministic UDF which does not exend Nondeterministic, but its deterministic is equal to false

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems we need to add a new API, here we wanna get non-deterministic leaf nodes, and trait Nondeterministic is not suitable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this problem was already there, let's send a new PR to fix it.

// todo: It's hard to write a general rule to pull out nondeterministic expressions
// from LogicalPlan, currently we only do it for UnaryNode which has same output
// schema with its child.
case p: UnaryNode if p.output == p.child.output && p.expressions.exists(!_.deterministic) =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a question. What is the reason why we need to have such a condition p.output == p.child.output?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to narrow down the scope of the affected operators, but ideally we should use a white-list

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I see

@cloud-fan
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Dec 30, 2016

Test build #70753 has finished for PR 16404 at commit f145188.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 30, 2016

Test build #70754 has finished for PR 16404 at commit f145188.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

LGTM cc @rxin

@gatorsmile
Copy link
Member

Found a bug filed in a JIRA https://issues.apache.org/jira/browse/SPARK-19035. This PR does not resolves it.

@gatorsmile
Copy link
Member

sql("select a + rand() from testData2 group by a, a + rand()").explain(true)

After we merging this PR, I am afraid we might hitting a common misunderstanding. Users might assume a + rand() in the aggregate and group by are referring to the same expression. However, the non-deterministic expressions rand() in aggregate and group by actually return different results. See the plan

(CAST(a AS DOUBLE) + rand(-8319334304660323190)): double
Aggregate [a#23, (cast(a#23 as double) + _nondeterministic#249)], [(cast(a#23 as double) + rand(-8319334304660323190)) AS (CAST(a AS DOUBLE) + rand(-8319334304660323190))#248]
+- Project [a#23, b#24, rand(-993354445377164024) AS _nondeterministic#249]
   +- SubqueryAlias testdata2, `testData2`
      +- SerializeFromObject [assertnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData2, true], top level Product input object), - root class: "org.apache.spark.sql.test.SQLTestData.TestData2").a AS a#23, assertnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData2, true], top level Product input object), - root class: "org.apache.spark.sql.test.SQLTestData.TestData2").b AS b#24]
         +- ExternalRDD [obj#22]

@cloud-fan
Copy link
Contributor Author

How do other databases handle this case? Do they forbid using non-deterministic expressions in GROUP BY, or give a better error message?

@cloud-fan cloud-fan closed this Jan 4, 2017
@cloud-fan cloud-fan reopened this Jan 4, 2017
@gatorsmile
Copy link
Member

DB2 has such a limit. See the error message SQL -583: http://www.ibm.com/support/knowledgecenter/SSEPGG_10.5.0/com.ibm.db2.luw.messages.sql.doc/doc/msql00583n.html

The routine (function or method) or expression is defined as non-deterministic or as having external action. This is not supported in the context in which it is used. The contexts in which these are not valid are:

in an expression of a GROUP BY clause

It documents the same workaround:

Remove the non-deterministic or external action routine or expression from the GROUP BY clause. If grouping is desired on a column of the result that is based on a non-deterministic or external action routine or expression use a nested table expression or a common table expression to first provide a result table with the expression as a column of the result.

@gatorsmile
Copy link
Member

Oracle allows it. It sounds like they treat (username || dbms_random.string('a', 10)) in aggregate and group-by as the same expression.

SQL> select (username || dbms_random.string('a', 10)) from all_users group by (username || dbms_random.string('a', 10));

(USERNAME||DBMS_RANDOM.STRING('A',10))
--------------------------------------------------------------------------------
APEX_040000cklbMYhekl
FLOWS_FILESVmTbIIeiUs
CTXSYSPmgqeRFPry
SYSTEMxQLrzXxHth
XDBRRTfatsLlU
SYSoLDWRKMvlZ
XS$NULLXAaOykZCDH
APEX_PUBLIC_USERvcLswvpbcw
ANONYMOUSgupWiktQKh
OUTLNjLdKOTZoFI
MDSYSxEOhwTwQqa

(USERNAME||DBMS_RANDOM.STRING('A',10))
--------------------------------------------------------------------------------
HRkovpxQztYU

12 rows selected.

If I change the order, I got the error:

SQL> select (dbms_random.string('a', 10) || username) from all_users group by (username || dbms_random.string('a', 10))
  2  ;
select (dbms_random.string('a', 10) || username) from all_users group by (username || dbms_random.string('a', 10))
                                       *
ERROR at line 1:
ORA-00979: not a GROUP BY expression

@gatorsmile
Copy link
Member

gatorsmile commented Jan 4, 2017

MySQL 5.7 treats them differently...

mysql> select c1, concat(rand(), c1) from t1 group by c1;
+------+----------------------+
| c1   | concat(rand(), c1)   |
+------+----------------------+
|    1 | 0.084388771172974981 |
|    3 | 0.116890648488784823 |
+------+----------------------+
2 rows in set (0.00 sec)

mysql> select c1, concat(rand(), c1) from t1 group by c1, concat(rand(), c1);
+------+----------------------+
| c1   | concat(rand(), c1)   |
+------+----------------------+
|    1 | 0.16241911441313021  |
|    1 | 0.461423657332941551 |
|    3 | 0.81986097415896223  |
+------+----------------------+
3 rows in set (0.00 sec)

MySQL 5.5 is even more crazy... It allows the following query:

mysql> select c1, rand() from t1 group by rand();
+------+---------------------+
| c1   | rand()              |
+------+---------------------+
|    2 | 0.10147774974021852 |
|    1 | 0.15563200614939632 |
|    1 | 0.22556077058013602 |
+------+---------------------+
3 rows in set (0.00 sec)

@tejasapatil
Copy link
Contributor

Presto allows it and uses the same value of rand() computed during projection for doing aggregation. Internally, every aggregation column is hashed and hash values are appended to the intermediate row. These hashes are used for local aggregation and final aggregation after a shuffle.

@cloud-fan
Copy link
Contributor Author

cloud-fan commented Jan 12, 2017

postgres handles rand in aggregate specially

cloud=# select random(), random();
      random       |      random       
-------------------+-------------------
 0.627067061141133 | 0.324854184873402
(1 row)

cloud=# select random(), random() from (select 1) t group by random();
      random       |      random       
-------------------+-------------------
 0.459220345597714 | 0.459220345597714
(1 row)

cloud=# select a + random(), random() from (select 1 as a) t group by a + random();
     ?column?     |      random       
------------------+-------------------
 1.38031229563057 | 0.274804535787553
(1 row)

Hive has the same behavior.
I think we can follow it, to use a fixed seed for rand in Aggregate's grouping expressions

@cloud-fan
Copy link
Contributor Author

how about we fix this in follow-up PR? Looks like the fix is not trivial.

@cloud-fan
Copy link
Contributor Author

retest this please

@gatorsmile
Copy link
Member

It sounds like different RDBMS have different behaviors. Have we decided which way we should follow?

@SparkQA
Copy link

SparkQA commented Jan 12, 2017

Test build #71244 has finished for PR 16404 at commit f145188.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

Since Hive and postgres have same behavior on this, I'd like to follow them.

@gatorsmile
Copy link
Member

Agree. Oracle behaves the same.

@gatorsmile
Copy link
Member

gatorsmile commented Jan 12, 2017

I am fine to merge this at first, as long as we can fix it before the release.

@cloud-fan
Copy link
Contributor Author

also cc @rxin to take a look

@rxin
Copy link
Contributor

rxin commented Jan 12, 2017

LGTM on the behavior

@rxin
Copy link
Contributor

rxin commented Jan 12, 2017

Make sure you update the pull request and jira ticket description before you merge.

@cloud-fan
Copy link
Contributor Author

cloud-fan commented Jan 12, 2017

updated. Merging to master/2.1/2.0!

@asfgit asfgit closed this in 871d266 Jan 12, 2017
asfgit pushed a commit that referenced this pull request Jan 12, 2017
## What changes were proposed in this pull request?

Currently nondeterministic expressions are allowed in `Aggregate`(see the [comment](https://github.com/apache/spark/blob/v2.0.2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala#L249-L251)), but the `PullOutNondeterministic` analyzer rule failed to handle `Aggregate`, this PR fixes it.

close #16379

There is still one remaining issue: `SELECT a + rand() FROM t GROUP BY a + rand()` is not allowed, because the 2 `rand()` are different(we generate random seed as the default seed for `rand()`). https://issues.apache.org/jira/browse/SPARK-19035 is tracking this issue.

## How was this patch tested?

a new test suite

Author: Wenchen Fan <[email protected]>

Closes #16404 from cloud-fan/groupby.

(cherry picked from commit 871d266)
Signed-off-by: Wenchen Fan <[email protected]>
asfgit pushed a commit that referenced this pull request Jan 12, 2017
## What changes were proposed in this pull request?

Currently nondeterministic expressions are allowed in `Aggregate`(see the [comment](https://github.com/apache/spark/blob/v2.0.2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala#L249-L251)), but the `PullOutNondeterministic` analyzer rule failed to handle `Aggregate`, this PR fixes it.

close #16379

There is still one remaining issue: `SELECT a + rand() FROM t GROUP BY a + rand()` is not allowed, because the 2 `rand()` are different(we generate random seed as the default seed for `rand()`). https://issues.apache.org/jira/browse/SPARK-19035 is tracking this issue.

## How was this patch tested?

a new test suite

Author: Wenchen Fan <[email protected]>

Closes #16404 from cloud-fan/groupby.

(cherry picked from commit 871d266)
Signed-off-by: Wenchen Fan <[email protected]>
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
## What changes were proposed in this pull request?

Currently nondeterministic expressions are allowed in `Aggregate`(see the [comment](https://github.com/apache/spark/blob/v2.0.2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala#L249-L251)), but the `PullOutNondeterministic` analyzer rule failed to handle `Aggregate`, this PR fixes it.

close apache#16379

There is still one remaining issue: `SELECT a + rand() FROM t GROUP BY a + rand()` is not allowed, because the 2 `rand()` are different(we generate random seed as the default seed for `rand()`). https://issues.apache.org/jira/browse/SPARK-19035 is tracking this issue.

## How was this patch tested?

a new test suite

Author: Wenchen Fan <[email protected]>

Closes apache#16404 from cloud-fan/groupby.
cmonkey pushed a commit to cmonkey/spark that referenced this pull request Feb 15, 2017
## What changes were proposed in this pull request?

Currently nondeterministic expressions are allowed in `Aggregate`(see the [comment](https://github.com/apache/spark/blob/v2.0.2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala#L249-L251)), but the `PullOutNondeterministic` analyzer rule failed to handle `Aggregate`, this PR fixes it.

close apache#16379

There is still one remaining issue: `SELECT a + rand() FROM t GROUP BY a + rand()` is not allowed, because the 2 `rand()` are different(we generate random seed as the default seed for `rand()`). https://issues.apache.org/jira/browse/SPARK-19035 is tracking this issue.

## How was this patch tested?

a new test suite

Author: Wenchen Fan <[email protected]>

Closes apache#16404 from cloud-fan/groupby.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants