Skip to content

Conversation

@davies
Copy link
Contributor

@davies davies commented Dec 16, 2015

This could simplify the generated code for expressions that is not nullable.

This PR fix lots of bugs about nullability.

@davies
Copy link
Contributor Author

davies commented Dec 16, 2015

cc @liancheng

@SparkQA
Copy link

SparkQA commented Dec 16, 2015

Test build #47843 has finished for PR 10333 at commit 88b2107.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 16, 2015

Test build #47835 has finished for PR 10333 at commit fd4c945.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this branch necessary? (not suggesting you change it) but does the nullable path collapse correctly if left and right are non nullable? What I mean is:

if eval1.isNull and eval2.isNull is always just false, do we get the same behavior as this special casing from the compiler optimizations?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's not necessary (in terms of performance). Compiler can do all these, but not sure how far Janino had achieved on constant folding.

We don't need to do this for every expression, but since UnaryExpression/BinaryExpression/TernaryExpression are used by many, this changes may worth it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to Janino the JIT might also do more constant folding etc, which makes it hard to tell unfortunately.

@SparkQA
Copy link

SparkQA commented Dec 16, 2015

Test build #2222 has finished for PR 10333 at commit e418358.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably shouldn't show join keys multiple times in the result set. For LEFT/RIGHT JOIN USING queries, both PostgreSQL and MySQL show join keys only once. ScalaDoc of this overloaded DataFrame.join method also has similar description:

  /**
   * Equi-join with another [[DataFrame]] using the given columns.
   *
   * Different from other join functions, the join columns will only appear once in the output,
   * i.e. similar to SQL's `JOIN USING` syntax.
   ...
   */

The following example comes from PostgreSQL docs (section 7.2.1.1):

CREATE TABLE t1 (num INT, name TEXT);
INSERT INTO t1 VALUES (1, 'a');
INSERT INTO t1 VALUES (2, 'b');
INSERT INTO t1 VALUES (3, 'c');

CREATE TABLE t2 (num INT, value TEXT);
INSERT INTO T2 VALUES (1, 'xxx');
INSERT INTO t2 VALUES (3, 'yyy');
INSERT INTO t2 VALUES (5, 'zzz');

SELECT * FROM t1 LEFT JOIN t2 USING (num);

PostgreSQL results in:

 num | name | value
-----+------+-------
   1 | a    | xxx
   2 | b    |
   3 | c    | yyy
(3 rows)

and MySQL results in:

+------+------+-------+
| num  | name | value |
+------+------+-------+
|    1 | a    | xxx   |
|    2 | b    | NULL  |
|    3 | c    | yyy   |
+------+------+-------+
3 rows in set (0.01 sec)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But there does exist bugs other than the nullability issue in the original DataFrame.join method. For example it doesn't handle full outer join correctly. Using the same example tables mentioned above, the following spark-shell snippet

import sqlContext._
val t1 = table("t1")
val t2 = table("t2")
t1.join(t2, Seq("num"), "fullouter").show()

produces wrong query result:

+----+----+-----+
| num|name|value|
+----+----+-----+
|   1|   a|  xxx|
|   2|   b| null|
|   3|   c|  yyy|
|null|null|  zzz|
+----+----+-----+

Here's the result from PostgreSQL:

postgres=# SELECT * FROM t1 FULL JOIN t2 USING (num);

 num | name | value
-----+------+-------
   1 | a    | xxx
   2 | b    |
   3 | c    | yyy
   5 |      | zzz
(4 rows)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will create seperate JIRA for it (and fix it)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This bug will be fixed by #10353

Conflicts:
	sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala
	sql/core/src/test/scala/org/apache/spark/sql/DataFrameJoinSuite.scala
@SparkQA
Copy link

SparkQA commented Dec 17, 2015

Test build #47932 has finished for PR 10333 at commit 765f735.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@nongli
Copy link
Contributor

nongli commented Dec 17, 2015

LGTM

@SparkQA
Copy link

SparkQA commented Dec 18, 2015

Test build #2227 has finished for PR 10333 at commit 9adad17.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we assuming if a BinaryExpression is not nullable, its children are also not nullable?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should forbid non-nullable BinaryExpression to call nullSafeCodeGen as it doesn't make sense(passing a f that supposed to only apply to not-null children, but actually it isn't.), and they should take care of null children themselves, i.e. override genCode directly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can add an assert: assert(nullable || (children.forall(!_.nullable)))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even left or right is nullable, the new code is still correct, if the old code is correct.

@SparkQA
Copy link

SparkQA commented Dec 18, 2015

Test build #47986 has finished for PR 10333 at commit 9f7d763.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * abstract class ImperativeAggregate extends AggregateFunction with CodegenFallback\n * case class UnresolvedWindowExpression(\n * case class WindowExpression(\n * case class Lead(input: Expression, offset: Expression, default: Expression)\n * case class Lag(input: Expression, offset: Expression, default: Expression)\n * abstract class AggregateWindowFunction extends DeclarativeAggregate with WindowFunction\n * abstract class RowNumberLike extends AggregateWindowFunction\n * trait SizeBasedWindowFunction extends AggregateWindowFunction\n * case class RowNumber() extends RowNumberLike\n * case class CumeDist() extends RowNumberLike with SizeBasedWindowFunction\n * case class NTile(buckets: Expression) extends RowNumberLike with SizeBasedWindowFunction\n * abstract class RankLike extends AggregateWindowFunction\n * case class Rank(children: Seq[Expression]) extends RankLike\n * case class DenseRank(children: Seq[Expression]) extends RankLike\n * case class PercentRank(children: Seq[Expression]) extends RankLike with SizeBasedWindowFunction\n

@SparkQA
Copy link

SparkQA commented Dec 18, 2015

Test build #47987 has finished for PR 10333 at commit 3b1e42f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 18, 2015

Test build #47993 has finished for PR 10333 at commit 3cc4cdf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if a TernaryExpression is nullable, currently we will always generate 3 nested if branches. But we still have chance to remove some if branches if some children are non-nullable, how about doing this optimization based on children's nullability?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could, but it have too much combinations.

@davies
Copy link
Contributor Author

davies commented Dec 18, 2015

I'm merging this into master. There could be still some bugs about nullability, we could fix them later.

@asfgit asfgit closed this in 4af647c Dec 18, 2015
@rxin
Copy link
Contributor

rxin commented Dec 18, 2015

Do we have any performance numbers on this?

@davies
Copy link
Contributor Author

davies commented Dec 18, 2015

@rxin Just ran a simple query as

sqlContext.range(1<<30).groupBy().sum().collect()

After this commit, the runtime went to 46.8s from 49.2s, about 5% improvement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants