-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-12054] [SQL] Consider nullability of expression in codegen #10333
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @liancheng |
|
Test build #47843 has finished for PR 10333 at commit
|
|
Test build #47835 has finished for PR 10333 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this branch necessary? (not suggesting you change it) but does the nullable path collapse correctly if left and right are non nullable? What I mean is:
if eval1.isNull and eval2.isNull is always just false, do we get the same behavior as this special casing from the compiler optimizations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's not necessary (in terms of performance). Compiler can do all these, but not sure how far Janino had achieved on constant folding.
We don't need to do this for every expression, but since UnaryExpression/BinaryExpression/TernaryExpression are used by many, this changes may worth it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition to Janino the JIT might also do more constant folding etc, which makes it hard to tell unfortunately.
|
Test build #2222 has finished for PR 10333 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably shouldn't show join keys multiple times in the result set. For LEFT/RIGHT JOIN USING queries, both PostgreSQL and MySQL show join keys only once. ScalaDoc of this overloaded DataFrame.join method also has similar description:
/**
* Equi-join with another [[DataFrame]] using the given columns.
*
* Different from other join functions, the join columns will only appear once in the output,
* i.e. similar to SQL's `JOIN USING` syntax.
...
*/The following example comes from PostgreSQL docs (section 7.2.1.1):
CREATE TABLE t1 (num INT, name TEXT);
INSERT INTO t1 VALUES (1, 'a');
INSERT INTO t1 VALUES (2, 'b');
INSERT INTO t1 VALUES (3, 'c');
CREATE TABLE t2 (num INT, value TEXT);
INSERT INTO T2 VALUES (1, 'xxx');
INSERT INTO t2 VALUES (3, 'yyy');
INSERT INTO t2 VALUES (5, 'zzz');
SELECT * FROM t1 LEFT JOIN t2 USING (num);PostgreSQL results in:
num | name | value
-----+------+-------
1 | a | xxx
2 | b |
3 | c | yyy
(3 rows)
and MySQL results in:
+------+------+-------+
| num | name | value |
+------+------+-------+
| 1 | a | xxx |
| 2 | b | NULL |
| 3 | c | yyy |
+------+------+-------+
3 rows in set (0.01 sec)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But there does exist bugs other than the nullability issue in the original DataFrame.join method. For example it doesn't handle full outer join correctly. Using the same example tables mentioned above, the following spark-shell snippet
import sqlContext._
val t1 = table("t1")
val t2 = table("t2")
t1.join(t2, Seq("num"), "fullouter").show()produces wrong query result:
+----+----+-----+
| num|name|value|
+----+----+-----+
| 1| a| xxx|
| 2| b| null|
| 3| c| yyy|
|null|null| zzz|
+----+----+-----+
Here's the result from PostgreSQL:
postgres=# SELECT * FROM t1 FULL JOIN t2 USING (num);
num | name | value
-----+------+-------
1 | a | xxx
2 | b |
3 | c | yyy
5 | | zzz
(4 rows)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will create seperate JIRA for it (and fix it)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This bug will be fixed by #10353
Conflicts: sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala sql/core/src/test/scala/org/apache/spark/sql/DataFrameJoinSuite.scala
|
Test build #47932 has finished for PR 10333 at commit
|
|
LGTM |
|
Test build #2227 has finished for PR 10333 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we assuming if a BinaryExpression is not nullable, its children are also not nullable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should forbid non-nullable BinaryExpression to call nullSafeCodeGen as it doesn't make sense(passing a f that supposed to only apply to not-null children, but actually it isn't.), and they should take care of null children themselves, i.e. override genCode directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we can add an assert: assert(nullable || (children.forall(!_.nullable)))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even left or right is nullable, the new code is still correct, if the old code is correct.
|
Test build #47986 has finished for PR 10333 at commit
|
|
Test build #47987 has finished for PR 10333 at commit
|
|
Test build #47993 has finished for PR 10333 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if a TernaryExpression is nullable, currently we will always generate 3 nested if branches. But we still have chance to remove some if branches if some children are non-nullable, how about doing this optimization based on children's nullability?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could, but it have too much combinations.
|
I'm merging this into master. There could be still some bugs about nullability, we could fix them later. |
|
Do we have any performance numbers on this? |
|
@rxin Just ran a simple query as After this commit, the runtime went to 46.8s from 49.2s, about 5% improvement. |
This could simplify the generated code for expressions that is not nullable.
This PR fix lots of bugs about nullability.