-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-14139][SQL] RowEncoder should preserve schema nullability #12364
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…test RowEncoderSuite:encode/decode:Product
|
Test build #55729 has finished for PR 12364 at commit
|
|
Test build #55730 has finished for PR 12364 at commit
|
| val encoder = RowEncoder(schema) | ||
| assert(encoder.serializer.length == 1) | ||
| assert(encoder.serializer.head.dataType == IntegerType) | ||
| assert(encoder.serializer.head.nullable == false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will we throw an exception if there is a null in the data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good point, actually we will, we should add the runtime null check like we did for product encoder
|
Test build #55831 has finished for PR 12364 at commit
|
|
Test build #55828 has finished for PR 12364 at commit
|
|
Is the TODO in the pr description still valid? |
|
yea, it's still valid, but I'd like to do it in follow-ups, as this PR is taken over from other people, I don't want to enlarge the scope too much. |
|
OK please label it as a todo for future pr; otherwise it is difficult to tell if it is meant to be done as the current one. |
|
Test build #56083 has finished for PR 12364 at commit
|
|
Test build #56165 has finished for PR 12364 at commit
|
| val inputObject = BoundReference(0, ObjectType(cls), nullable = true) | ||
| // We use an If expression to wrap extractorsFor result of StructType | ||
| val serializer = serializerFor(inputObject, schema).asInstanceOf[If].falseValue | ||
| val inputObject = BoundReference(0, ObjectType(cls), nullable = false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reason of this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The input object should never be null, we also use this assumption in ExpressionEncoder
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually ExpressionEncoder allows null input now. But I agree that for RowEncoder this assumption is reasonable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably add a comment here. It's not super intuitive.
|
Test build #57607 has finished for PR 12364 at commit
|
|
Test build #57633 has finished for PR 12364 at commit
|
|
Test build #57716 has finished for PR 12364 at commit
|
|
cc @liancheng |
| // We use an If expression to wrap extractorsFor result of StructType | ||
| val serializer = serializerFor(inputObject, schema).asInstanceOf[If].falseValue | ||
| val inputObject = BoundReference(0, ObjectType(cls), nullable = false) | ||
| val serializer = serializerFor(inputObject, schema) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is also because we don't allow null input Rows now, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, there is no if anymore for the top row object.
|
Mostly LGTM except a few minor comments. |
|
Test build #57879 has finished for PR 12364 at commit
|
|
Test build #57894 has finished for PR 12364 at commit
|
|
LGTM, merging to master and branch-2.0. |
## What changes were proposed in this pull request? The problem is: In `RowEncoder`, we use `Invoke` to get the field of an external row, which lose the nullability information. This PR creates a `GetExternalRowField` expression, so that we can preserve the nullability info. TODO: simplify the null handling logic in `RowEncoder`, to remove so many if branches, in follow-up PR. ## How was this patch tested? new tests in `RowEncoderSuite` Note that, This PR takes over #11980, with a little simplification, so all credits should go to koertkuipers Author: Wenchen Fan <[email protected]> Author: Koert Kuipers <[email protected]> Closes #12364 from cloud-fan/nullable. (cherry picked from commit 55cc1c9) Signed-off-by: Cheng Lian <[email protected]>
…Encoder ## What changes were proposed in this pull request? SPARK-15241: We now support java decimal and catalyst decimal in external row, it makes sense to also support scala decimal. SPARK-15242: This is a long-standing bug, and is exposed after #12364, which eliminate the `If` expression if the field is not nullable: ``` val fieldValue = serializerFor( GetExternalRowField(inputObject, i, externalDataTypeForInput(f.dataType)), f.dataType) if (f.nullable) { If( Invoke(inputObject, "isNullAt", BooleanType, Literal(i) :: Nil), Literal.create(null, f.dataType), fieldValue) } else { fieldValue } ``` Previously, we always use `DecimalType.SYSTEM_DEFAULT` as the output type of converted decimal field, which is wrong as it doesn't match the real decimal type. However, it works well because we always put converted field into `If` expression to do the null check, and `If` use its `trueValue`'s data type as its output type. Now if we have a not nullable decimal field, then the converted field's output type will be `DecimalType.SYSTEM_DEFAULT`, and we will write wrong data into unsafe row. The fix is simple, just use the given decimal type as the output type of converted decimal field. These 2 issues was found at #13008 ## How was this patch tested? new tests in RowEncoderSuite Author: Wenchen Fan <[email protected]> Closes #13019 from cloud-fan/encoder-decimal. (cherry picked from commit d8935db) Signed-off-by: Davies Liu <[email protected]>
What changes were proposed in this pull request?
The problem is: In
RowEncoder, we useInvoketo get the field of an external row, which lose the nullability information. This PR creates aGetExternalRowFieldexpression, so that we can preserve the nullability info.TODO: simplify the null handling logic in
RowEncoder, to remove so many if branches, in follow-up PR.How was this patch tested?
new tests in
RowEncoderSuiteNote that, This PR takes over #11980, with a little simplification, so all credits should go to @koertkuipers