[SPARK-12054] [SQL] Consider nullability of expression in codegen #10333

davies · 2015-12-16T18:31:03Z

This could simplify the generated code for expressions that is not nullable.

This PR fix lots of bugs about nullability.

davies · 2015-12-16T18:31:23Z

cc @liancheng

SparkQA · 2015-12-16T19:41:05Z

Test build #47843 has finished for PR 10333 at commit 88b2107.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-12-16T20:16:47Z

Test build #47835 has finished for PR 10333 at commit fd4c945.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

nongli · 2015-12-16T21:02:09Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala

Is this branch necessary? (not suggesting you change it) but does the nullable path collapse correctly if left and right are non nullable? What I mean is:

if eval1.isNull and eval2.isNull is always just false, do we get the same behavior as this special casing from the compiler optimizations?

I think it's not necessary (in terms of performance). Compiler can do all these, but not sure how far Janino had achieved on constant folding.

We don't need to do this for every expression, but since UnaryExpression/BinaryExpression/TernaryExpression are used by many, this changes may worth it.

In addition to Janino the JIT might also do more constant folding etc, which makes it hard to tell unfortunately.

SparkQA · 2015-12-16T21:47:22Z

Test build #2222 has finished for PR 10333 at commit e418358.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

liancheng · 2015-12-17T05:44:27Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameJoinSuite.scala

We probably shouldn't show join keys multiple times in the result set. For LEFT/RIGHT JOIN USING queries, both PostgreSQL and MySQL show join keys only once. ScalaDoc of this overloaded DataFrame.join method also has similar description:

/** * Equi-join with another [[DataFrame]] using the given columns. * * Different from other join functions, the join columns will only appear once in the output, * i.e. similar to SQL's `JOIN USING` syntax. ... */

The following example comes from PostgreSQL docs (section 7.2.1.1):

CREATE TABLE t1 (num INT, name TEXT); INSERT INTO t1 VALUES (1, 'a'); INSERT INTO t1 VALUES (2, 'b'); INSERT INTO t1 VALUES (3, 'c'); CREATE TABLE t2 (num INT, value TEXT); INSERT INTO T2 VALUES (1, 'xxx'); INSERT INTO t2 VALUES (3, 'yyy'); INSERT INTO t2 VALUES (5, 'zzz'); SELECT * FROM t1 LEFT JOIN t2 USING (num);

PostgreSQL results in:

num | name | value -----+------+------- 1 | a | xxx 2 | b | 3 | c | yyy (3 rows)

and MySQL results in:

+------+------+-------+ | num | name | value | +------+------+-------+ | 1 | a | xxx | | 2 | b | NULL | | 3 | c | yyy | +------+------+-------+ 3 rows in set (0.01 sec)

But there does exist bugs other than the nullability issue in the original DataFrame.join method. For example it doesn't handle full outer join correctly. Using the same example tables mentioned above, the following spark-shell snippet

import sqlContext._ val t1 = table("t1") val t2 = table("t2") t1.join(t2, Seq("num"), "fullouter").show()

produces wrong query result:

+----+----+-----+ | num|name|value| +----+----+-----+ | 1| a| xxx| | 2| b| null| | 3| c| yyy| |null|null| zzz| +----+----+-----+

Here's the result from PostgreSQL:

postgres=# SELECT * FROM t1 FULL JOIN t2 USING (num); num | name | value -----+------+------- 1 | a | xxx 2 | b | 3 | c | yyy 5 | | zzz (4 rows)

Will create seperate JIRA for it (and fix it)

This bug will be fixed by #10353

Conflicts: sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala sql/core/src/test/scala/org/apache/spark/sql/DataFrameJoinSuite.scala

SparkQA · 2015-12-17T16:34:38Z

Test build #47932 has finished for PR 10333 at commit 765f735.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

nongli · 2015-12-17T19:14:23Z

LGTM

SparkQA · 2015-12-18T01:27:55Z

Test build #2227 has finished for PR 10333 at commit 9adad17.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2015-12-18T03:37:39Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala

are we assuming if a BinaryExpression is not nullable, its children are also not nullable?

I think we should forbid non-nullable BinaryExpression to call nullSafeCodeGen as it doesn't make sense(passing a f that supposed to only apply to not-null children, but actually it isn't.), and they should take care of null children themselves, i.e. override genCode directly.

maybe we can add an assert: assert(nullable || (children.forall(!_.nullable)))

Even left or right is nullable, the new code is still correct, if the old code is correct.

SparkQA · 2015-12-18T06:53:05Z

Test build #47986 has finished for PR 10333 at commit 9f7d763.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * abstract class ImperativeAggregate extends AggregateFunction with CodegenFallback\n * case class UnresolvedWindowExpression(\n * case class WindowExpression(\n * case class Lead(input: Expression, offset: Expression, default: Expression)\n * case class Lag(input: Expression, offset: Expression, default: Expression)\n * abstract class AggregateWindowFunction extends DeclarativeAggregate with WindowFunction\n * abstract class RowNumberLike extends AggregateWindowFunction\n * trait SizeBasedWindowFunction extends AggregateWindowFunction\n * case class RowNumber() extends RowNumberLike\n * case class CumeDist() extends RowNumberLike with SizeBasedWindowFunction\n * case class NTile(buckets: Expression) extends RowNumberLike with SizeBasedWindowFunction\n * abstract class RankLike extends AggregateWindowFunction\n * case class Rank(children: Seq[Expression]) extends RankLike\n * case class DenseRank(children: Seq[Expression]) extends RankLike\n * case class PercentRank(children: Seq[Expression]) extends RankLike with SizeBasedWindowFunction\n

SparkQA · 2015-12-18T07:38:27Z

Test build #47987 has finished for PR 10333 at commit 3b1e42f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-12-18T10:30:01Z

Test build #47993 has finished for PR 10333 at commit 3cc4cdf.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2015-12-18T17:11:42Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala

if a TernaryExpression is nullable, currently we will always generate 3 nested if branches. But we still have chance to remove some if branches if some children are non-nullable, how about doing this optimization based on children's nullability?

We could, but it have too much combinations.

davies · 2015-12-18T18:08:56Z

I'm merging this into master. There could be still some bugs about nullability, we could fix them later.

rxin · 2015-12-18T19:50:53Z

Do we have any performance numbers on this?

davies · 2015-12-18T20:26:10Z

@rxin Just ran a simple query as

sqlContext.range(1<<30).groupBy().sum().collect()

After this commit, the runtime went to 46.8s from 49.2s, about 5% improvement.

Davies Liu added 10 commits November 30, 2015 14:44

consider nullable in codegen

8956204

remove GenerateProjection

ddca7e2

fix build

8304251

Merge branch 'remove_generate_projection' into skip_nullable

7e1b5e0

fix bug

e3cccda

fix test

2cca3a1

Merge branch 'remove_generate_projection' into skip_nullable

7e300b7

fix nullable

8aa8bb5

fix nullability

f160069

Merge branch 'master' of github.com:apache/spark into skip_nullable

fd4c945

fix bug

88b2107

fix GetStructField

e418358

nongli reviewed Dec 16, 2015
View reviewed changes

liancheng reviewed Dec 17, 2015
View reviewed changes

Merge branch 'master' of github.com:apache/spark into skip_nullable

765f735

Conflicts: sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala sql/core/src/test/scala/org/apache/spark/sql/DataFrameJoinSuite.scala

fix style

9adad17

cloud-fan reviewed Dec 18, 2015
View reviewed changes

Davies Liu added 3 commits December 17, 2015 21:23

Merge branch 'master' of github.com:apache/spark into skip_nullable

7e5381b

fix nullable of lead/lag

55bb6a9

cast date to string

9f7d763

remove suffix

3b1e42f

fix nullability of Expand

3cc4cdf

liancheng mentioned this pull request Dec 18, 2015

[SPARK-12335][SPARK-12336][SPARK-12341][SPARK-12342][SQL] Fixes several expression nullablility bugs #10296

Closed

cloud-fan reviewed Dec 18, 2015
View reviewed changes

asfgit closed this in 4af647c Dec 18, 2015

cloud-fan mentioned this pull request Jan 29, 2016

[SPARK-13072][SQL] simplify and improve murmur3 hash expression codegen #10974

Closed

[SPARK-12054] [SQL] Consider nullability of expression in codegen #10333

[SPARK-12054] [SQL] Consider nullability of expression in codegen #10333

Uh oh!

Conversation

davies commented Dec 16, 2015

Uh oh!

davies commented Dec 16, 2015

Uh oh!

SparkQA commented Dec 16, 2015

Uh oh!

SparkQA commented Dec 16, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 16, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 17, 2015

Uh oh!

nongli commented Dec 17, 2015

Uh oh!

SparkQA commented Dec 18, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 18, 2015

Uh oh!

SparkQA commented Dec 18, 2015

Uh oh!

SparkQA commented Dec 18, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

davies commented Dec 18, 2015

Uh oh!

rxin commented Dec 18, 2015

Uh oh!

davies commented Dec 18, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants