-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-12439][SQL] Fix toCatalystArray and MapObjects #10391
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #48052 has finished for PR 10391 at commit
|
|
Good catch! One mirror comment, can we write the test in |
|
+1 to moving the test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we have to add these classes?
I think a simple test could be encodeDecodeTest(Seq(Some(1), None), "seq of option")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also test map case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Besides, encodeDecodeTest(Seq(Some(1), None) may not work because the converted back will be Seq(Some(1), null).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I can remove these classes. I will update later.
|
Test build #48191 has finished for PR 10391 at commit
|
|
Test build #48197 has finished for PR 10391 at commit
|
|
retest this please. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we use the helper function encodeDecodeTest? It can simplify the test code and give detail error message when it fails, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because for Seq(Some(1), None) the converted back result is Seq(Some(1), null), encodeDecodeTest will report it failed. Any suggestion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds like a bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure about this. Because you can see at ScalaReflectionRelationSuite.scala#L128 where this test is originally coming from, looks like it is expected to have Seq(Some(1), null) converted back for Seq(Some(1), None).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original implementation in ScalaReflection only had access to the schema as a DataType when it was going to produce external Row objects. This schema is erased in that it can't differentiate Option from null. This is not true for encoders since we know the full type the user is expecting from the provided TypeTag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I am investigating the now.
|
Test build #48198 has finished for PR 10391 at commit
|
|
retest this please. |
|
Test build #48204 has finished for PR 10391 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Waiting #10443 to getting merged too.
|
Test build #48226 has finished for PR 10391 at commit
|
|
retest this please. |
1 similar comment
|
retest this please. |
|
Test build #48232 has finished for PR 10391 at commit
|
|
retest this please. |
|
Test build #48239 has finished for PR 10391 at commit
|
|
retest this please. |
|
Test build #48245 has finished for PR 10391 at commit
|
|
Test build #48246 has finished for PR 10391 at commit
|
|
@cloud-fan @marmbrus Are you ok for latest updates? |
|
@cloud-fan Can you also review the updates? Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure exactly what is going on here, but this looks like a hack. Ideally this node should not need to know what type its child is to operate correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me explain it.
When we pass in an array with None. It will be encoded as null internally. When we decode it back, WrapOption is called to re-construct it.
The logic of MapObjects is to assign an element as null if its given input element is null. So It will not actually go into WrapOption to re-construct a None back. In order to do that, we need to call lambdaFunction even the element is null.
But we can't simply ignore loopVar.isNull and call all kinds of lambdaFunctions. I tried before but for some lambdaFunctions, a null input value causes problematic results.
In the end I can only check if lambdaFunction is WrapOption or not to make the decision here. Do you have other suggestion other than a hack like this here? Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@marmbrus I have refactored this part and removed the check of WrapOption now. Please take a look it if it is better now. Thanks.
…side CreateStruct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
@cloud-fan Please review this update if you have time to do. Thanks. |
|
Test build #48407 has finished for PR 10391 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same with extractExpressions, we should also strip the top-most If for constructExpression
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unlike extractExpressions, constructExpression hasn't been wrapped with an If here. Because there are two constructorFor methods, and the one wrapped with If is the second one.
|
overall LGTM |
|
@marmbrus @cloud-fan If this patch is no problem. Can you merge #10443 first so I can rebase some codes here? Thanks. |
|
retest this please. |
|
Test build #48538 has finished for PR 10391 at commit
|
|
@marmbrus I've rebased this for updates. Please take a look if it is ok for you now. Thanks. |
|
ping @marmbrus please take a look for the updates, thanks. |
|
Thanks, merging to master. |
JIRA: https://issues.apache.org/jira/browse/SPARK-12439 In toCatalystArray, we should look at the data type returned by dataTypeFor instead of silentSchemaFor, to determine if the element is native type. An obvious problem is when the element is Option[Int] class, catalsilentSchemaFor will return Int, then we will wrongly recognize the element is native type. There is another problem when using Option as array element. When we encode data like Seq(Some(1), Some(2), None) with encoder, we will use MapObjects to construct an array for it later. But in MapObjects, we don't check if the return value of lambdaFunction is null or not. That causes a bug that the decoded data for Seq(Some(1), Some(2), None) would be Seq(1, 2, -1), instead of Seq(1, 2, null). Author: Liang-Chi Hsieh <[email protected]> Closes apache#10391 from viirya/fix-catalystarray.
JIRA: https://issues.apache.org/jira/browse/SPARK-12439
In toCatalystArray, we should look at the data type returned by dataTypeFor instead of silentSchemaFor, to determine if the element is native type. An obvious problem is when the element is Option[Int] class, catalsilentSchemaFor will return Int, then we will wrongly recognize the element is native type.
There is another problem when using Option as array element. When we encode data like Seq(Some(1), Some(2), None) with encoder, we will use MapObjects to construct an array for it later. But in MapObjects, we don't check if the return value of lambdaFunction is null or not. That causes a bug that the decoded data for Seq(Some(1), Some(2), None) would be Seq(1, 2, -1), instead of Seq(1, 2, null).