-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-15140][SPARK-15441][SQL][WIP] support null object in encoder #13322
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #59355 has finished for PR 13322 at commit
|
|
@cloud-fan Maybe it's not that easy to propogate the special column all the way down, we could just use this trick to fix the outer join issue? |
|
@davies , to add this trick to outer-join, we still need to improve our encoder framework to support it. So in this PR I added the infrastructure to encoder framework but only use it for outer join. |
|
Test build #59476 has finished for PR 13322 at commit
|
| val right = Seq(ClassData("x", 2), ClassData("y", 3)).toDS().as("right") | ||
| val joined = left.joinWith(right, $"left.b" === $"right.b", "left") | ||
| joined.explain(true) | ||
| joined.show() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we check the result?
|
My understanding is that this new added hidden column is mainly for serdes object to/from row. How would you leverage it to solve the the out join case where the null object is actually added during query execution? |
|
Test build #59667 has finished for PR 13322 at commit
|
|
@zhzhan , for outer join, if it's a typed join, then both of the join side will include this hidden column, as their columns are produced by serializing custom object. If we need to return null for one join side, we will null out all columns of that side, including the hidden column. Then when we deserialize it, we can know that this join side is null and we can build a null object for it. |
|
closing in favor of #13425 |
What changes were proposed in this pull request?
Currently we can't encoder top level null object into internal row, it throws NPE in 1.6 and returns incorrect result in 2.0.
The root cause is: Spark SQL doesn't allow row to be null, only its columns can be null. This is different from objects, object itself can be null, and its fields can also be null.
This is not a problem before, as we assume the input object is never null. However, for outer join, we do need the semantic of null object.
This PR tries to resolve this fundamental problem by adding a hidden column when serialize object to row, to indicate if the object is null or not.
How was this patch tested?
existing test and new test in
DatasetSuiteTODO: add more comments, code cleanup