[SPARK-15140][SPARK-15441][SQL][WIP] support null object in encoder #13322

cloud-fan · 2016-05-26T07:11:53Z

What changes were proposed in this pull request?

Currently we can't encoder top level null object into internal row, it throws NPE in 1.6 and returns incorrect result in 2.0.

The root cause is: Spark SQL doesn't allow row to be null, only its columns can be null. This is different from objects, object itself can be null, and its fields can also be null.

This is not a problem before, as we assume the input object is never null. However, for outer join, we do need the semantic of null object.

This PR tries to resolve this fundamental problem by adding a hidden column when serialize object to row, to indicate if the object is null or not.

How was this patch tested?

existing test and new test in DatasetSuite

TODO: add more comments, code cleanup

SparkQA · 2016-05-26T08:21:54Z

Test build #59355 has finished for PR 13322 at commit d7369dd.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

davies · 2016-05-26T23:30:35Z

@cloud-fan Maybe it's not that easy to propogate the special column all the way down, we could just use this trick to fix the outer join issue?

cloud-fan · 2016-05-27T07:56:06Z

@davies , to add this trick to outer-join, we still need to improve our encoder framework to support it. So in this PR I added the infrastructure to encoder framework but only use it for outer join.

cloud-fan · 2016-05-27T07:57:16Z

cc @yhuai @marmbrus

SparkQA · 2016-05-27T09:10:23Z

Test build #59476 has finished for PR 13322 at commit 6b7a9f0.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

davies · 2016-05-27T16:50:10Z

sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala

+    val right = Seq(ClassData("x", 2), ClassData("y", 3)).toDS().as("right")
+    val joined = left.joinWith(right, $"left.b" === $"right.b", "left")
+    joined.explain(true)
+    joined.show()


Should we check the result?

zhzhan · 2016-05-30T07:43:32Z

My understanding is that this new added hidden column is mainly for serdes object to/from row. How would you leverage it to solve the the out join case where the null object is actually added during query execution?

SparkQA · 2016-05-31T20:05:53Z

Test build #59667 has finished for PR 13322 at commit d639122.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-05-31T20:07:54Z

@zhzhan , for outer join, if it's a typed join, then both of the join side will include this hidden column, as their columns are produced by serializing custom object. If we need to return null for one join side, we will null out all columns of that side, including the hidden column. Then when we deserialize it, we can know that this join side is null and we can build a null object for it.

cloud-fan · 2016-05-31T23:03:03Z

closing in favor of #13425

support null object in outer join

6b7a9f0

cloud-fan force-pushed the outer-join branch from d7369dd to 6b7a9f0 Compare May 27, 2016 07:46

cloud-fan changed the title ~~[SPARK-15140][SPARK-15441][SQL][WIP] support null object in encoder~~ [SPARK-15441][SQL] support null object in outer join May 27, 2016

davies reviewed May 27, 2016
View reviewed changes

improve

bd76b18

cloud-fan added 2 commits May 31, 2016 10:11

Merge remote-tracking branch 'origin/master' into outer-join

4b3e9af

update

d639122

cloud-fan changed the title ~~[SPARK-15441][SQL] support null object in outer join~~ [SPARK-15140][SPARK-15441][SQL][WIP] support null object in encoder May 31, 2016

cloud-fan closed this May 31, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-15140][SPARK-15441][SQL][WIP] support null object in encoder #13322

[SPARK-15140][SPARK-15441][SQL][WIP] support null object in encoder #13322

Uh oh!

cloud-fan commented May 26, 2016 •

edited

Loading

Uh oh!

SparkQA commented May 26, 2016

Uh oh!

davies commented May 26, 2016

Uh oh!

cloud-fan commented May 27, 2016

Uh oh!

cloud-fan commented May 27, 2016

Uh oh!

SparkQA commented May 27, 2016

Uh oh!

davies May 27, 2016

Uh oh!

zhzhan commented May 30, 2016

Uh oh!

SparkQA commented May 31, 2016

Uh oh!

cloud-fan commented May 31, 2016

Uh oh!

cloud-fan commented May 31, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-15140][SPARK-15441][SQL][WIP] support null object in encoder #13322

[SPARK-15140][SPARK-15441][SQL][WIP] support null object in encoder #13322

Uh oh!

Conversation

cloud-fan commented May 26, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented May 26, 2016

Uh oh!

davies commented May 26, 2016

Uh oh!

cloud-fan commented May 27, 2016

Uh oh!

cloud-fan commented May 27, 2016

Uh oh!

SparkQA commented May 27, 2016

Uh oh!

davies May 27, 2016

Choose a reason for hiding this comment

Uh oh!

zhzhan commented May 30, 2016

Uh oh!

SparkQA commented May 31, 2016

Uh oh!

cloud-fan commented May 31, 2016

Uh oh!

cloud-fan commented May 31, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

cloud-fan commented May 26, 2016 •

edited

Loading