Skip to content

Conversation

@cloud-fan
Copy link
Contributor

@cloud-fan cloud-fan commented May 26, 2016

What changes were proposed in this pull request?

Currently we can't encoder top level null object into internal row, it throws NPE in 1.6 and returns incorrect result in 2.0.

The root cause is: Spark SQL doesn't allow row to be null, only its columns can be null. This is different from objects, object itself can be null, and its fields can also be null.

This is not a problem before, as we assume the input object is never null. However, for outer join, we do need the semantic of null object.

This PR tries to resolve this fundamental problem by adding a hidden column when serialize object to row, to indicate if the object is null or not.

How was this patch tested?

existing test and new test in DatasetSuite

TODO: add more comments, code cleanup

@SparkQA
Copy link

SparkQA commented May 26, 2016

Test build #59355 has finished for PR 13322 at commit d7369dd.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@davies
Copy link
Contributor

davies commented May 26, 2016

@cloud-fan Maybe it's not that easy to propogate the special column all the way down, we could just use this trick to fix the outer join issue?

@cloud-fan cloud-fan changed the title [SPARK-15140][SPARK-15441][SQL][WIP] support null object in encoder [SPARK-15441][SQL] support null object in outer join May 27, 2016
@cloud-fan
Copy link
Contributor Author

@davies , to add this trick to outer-join, we still need to improve our encoder framework to support it. So in this PR I added the infrastructure to encoder framework but only use it for outer join.

@cloud-fan
Copy link
Contributor Author

cc @yhuai @marmbrus

@SparkQA
Copy link

SparkQA commented May 27, 2016

Test build #59476 has finished for PR 13322 at commit 6b7a9f0.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

val right = Seq(ClassData("x", 2), ClassData("y", 3)).toDS().as("right")
val joined = left.joinWith(right, $"left.b" === $"right.b", "left")
joined.explain(true)
joined.show()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we check the result?

@zhzhan
Copy link
Contributor

zhzhan commented May 30, 2016

My understanding is that this new added hidden column is mainly for serdes object to/from row. How would you leverage it to solve the the out join case where the null object is actually added during query execution?

@cloud-fan cloud-fan changed the title [SPARK-15441][SQL] support null object in outer join [SPARK-15140][SPARK-15441][SQL][WIP] support null object in encoder May 31, 2016
@SparkQA
Copy link

SparkQA commented May 31, 2016

Test build #59667 has finished for PR 13322 at commit d639122.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

@zhzhan , for outer join, if it's a typed join, then both of the join side will include this hidden column, as their columns are produced by serializing custom object. If we need to return null for one join side, we will null out all columns of that side, including the hidden column. Then when we deserialize it, we can know that this join side is null and we can build a null object for it.

@cloud-fan
Copy link
Contributor Author

closing in favor of #13425

@cloud-fan cloud-fan closed this May 31, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants