Skip to content

Conversation

@viirya
Copy link
Member

@viirya viirya commented Dec 19, 2015

JIRA: https://issues.apache.org/jira/browse/SPARK-12439

In toCatalystArray, we should look at the data type returned by dataTypeFor instead of silentSchemaFor, to determine if the element is native type. An obvious problem is when the element is Option[Int] class, catalsilentSchemaFor will return Int, then we will wrongly recognize the element is native type.

There is another problem when using Option as array element. When we encode data like Seq(Some(1), Some(2), None) with encoder, we will use MapObjects to construct an array for it later. But in MapObjects, we don't check if the return value of lambdaFunction is null or not. That causes a bug that the decoded data for Seq(Some(1), Some(2), None) would be Seq(1, 2, -1), instead of Seq(1, 2, null).

@SparkQA
Copy link

SparkQA commented Dec 19, 2015

Test build #48052 has finished for PR 10391 at commit 788a7c6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Dec 21, 2015

cc @cloud-fan @marmbrus @davies

@cloud-fan
Copy link
Contributor

Good catch! One mirror comment, can we write the test in ExpressionEncoderSuite?

@marmbrus
Copy link
Contributor

+1 to moving the test

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have to add these classes?
I think a simple test could be encodeDecodeTest(Seq(Some(1), None), "seq of option")

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also test map case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides, encodeDecodeTest(Seq(Some(1), None) may not work because the converted back will be Seq(Some(1), null).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I can remove these classes. I will update later.

@SparkQA
Copy link

SparkQA commented Dec 22, 2015

Test build #48191 has finished for PR 10391 at commit 180048c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 22, 2015

Test build #48197 has finished for PR 10391 at commit a726fd8.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Dec 22, 2015

retest this please.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use the helper function encodeDecodeTest? It can simplify the test code and give detail error message when it fails, thanks!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because for Seq(Some(1), None) the converted back result is Seq(Some(1), null), encodeDecodeTest will report it failed. Any suggestion?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds like a bug.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure about this. Because you can see at ScalaReflectionRelationSuite.scala#L128 where this test is originally coming from, looks like it is expected to have Seq(Some(1), null) converted back for Seq(Some(1), None).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original implementation in ScalaReflection only had access to the schema as a DataType when it was going to produce external Row objects. This schema is erased in that it can't differentiate Option from null. This is not true for encoders since we know the full type the user is expecting from the provided TypeTag.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I am investigating the now.

@SparkQA
Copy link

SparkQA commented Dec 22, 2015

Test build #48198 has finished for PR 10391 at commit a726fd8.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Dec 22, 2015

retest this please.

@SparkQA
Copy link

SparkQA commented Dec 22, 2015

Test build #48204 has finished for PR 10391 at commit a726fd8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The part of fixing is from pr #10401. Without this, the following fixing to MapObjects will cause null exception. Can we merge #10401 first then I do rebase here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Waiting #10443 to getting merged too.

@SparkQA
Copy link

SparkQA commented Dec 23, 2015

Test build #48226 has finished for PR 10391 at commit 7389e15.

  • This patch fails from timeout after a configured wait of 250m.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Dec 23, 2015

retest this please.

1 similar comment
@viirya
Copy link
Member Author

viirya commented Dec 23, 2015

retest this please.

@SparkQA
Copy link

SparkQA commented Dec 23, 2015

Test build #48232 has finished for PR 10391 at commit 7389e15.

  • This patch fails from timeout after a configured wait of 250m.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Dec 23, 2015

retest this please.

@SparkQA
Copy link

SparkQA commented Dec 23, 2015

Test build #48239 has finished for PR 10391 at commit 7389e15.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Dec 23, 2015

retest this please.

@SparkQA
Copy link

SparkQA commented Dec 23, 2015

Test build #48245 has finished for PR 10391 at commit f5b9d50.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 23, 2015

Test build #48246 has finished for PR 10391 at commit f5b9d50.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Dec 24, 2015

@cloud-fan @marmbrus Are you ok for latest updates?

@viirya
Copy link
Member Author

viirya commented Dec 28, 2015

@cloud-fan Can you also review the updates? Thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure exactly what is going on here, but this looks like a hack. Ideally this node should not need to know what type its child is to operate correctly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me explain it.

When we pass in an array with None. It will be encoded as null internally. When we decode it back, WrapOption is called to re-construct it.

The logic of MapObjects is to assign an element as null if its given input element is null. So It will not actually go into WrapOption to re-construct a None back. In order to do that, we need to call lambdaFunction even the element is null.

But we can't simply ignore loopVar.isNull and call all kinds of lambdaFunctions. I tried before but for some lambdaFunctions, a null input value causes problematic results.

In the end I can only check if lambdaFunction is WrapOption or not to make the decision here. Do you have other suggestion other than a hack like this here? Thanks.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marmbrus I have refactored this part and removed the check of WrapOption now. Please take a look it if it is better now. Thanks.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#10443 will fix this by using true as default value for propagateNull. As that pr is not merged yet, in order to pass tests now, I set propagateNull as true here. Once #10443 is getting merged, I can remove the two lines here.

@viirya
Copy link
Member Author

viirya commented Dec 29, 2015

@cloud-fan Please review this update if you have time to do. Thanks.

@SparkQA
Copy link

SparkQA commented Dec 29, 2015

Test build #48407 has finished for PR 10391 at commit 7728124.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same with extractExpressions, we should also strip the top-most If for constructExpression

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unlike extractExpressions, constructExpression hasn't been wrapped with an If here. Because there are two constructorFor methods, and the one wrapped with If is the second one.

@cloud-fan
Copy link
Contributor

overall LGTM

@viirya
Copy link
Member Author

viirya commented Dec 30, 2015

@marmbrus @cloud-fan If this patch is no problem. Can you merge #10443 first so I can rebase some codes here? Thanks.

@viirya
Copy link
Member Author

viirya commented Dec 31, 2015

retest this please.

@SparkQA
Copy link

SparkQA commented Dec 31, 2015

Test build #48538 has finished for PR 10391 at commit e541e04.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Dec 31, 2015

@marmbrus I've rebased this for updates. Please take a look if it is ok for you now. Thanks.

@viirya
Copy link
Member Author

viirya commented Jan 3, 2016

ping @marmbrus @rxin

@viirya
Copy link
Member Author

viirya commented Jan 4, 2016

ping @marmbrus please take a look for the updates, thanks.

@marmbrus
Copy link
Contributor

marmbrus commented Jan 5, 2016

Thanks, merging to master.

@asfgit asfgit closed this in d202ad2 Jan 5, 2016
marmbrus pushed a commit to marmbrus/spark that referenced this pull request Jan 7, 2016
JIRA: https://issues.apache.org/jira/browse/SPARK-12439

In toCatalystArray, we should look at the data type returned by dataTypeFor instead of silentSchemaFor, to determine if the element is native type. An obvious problem is when the element is Option[Int] class, catalsilentSchemaFor will return Int, then we will wrongly recognize the element is native type.

There is another problem when using Option as array element. When we encode data like Seq(Some(1), Some(2), None) with encoder, we will use MapObjects to construct an array for it later. But in MapObjects, we don't check if the return value of lambdaFunction is null or not. That causes a bug that the decoded data for Seq(Some(1), Some(2), None) would be Seq(1, 2, -1), instead of Seq(1, 2, null).

Author: Liang-Chi Hsieh <[email protected]>

Closes apache#10391 from viirya/fix-catalystarray.
@viirya viirya deleted the fix-catalystarray branch December 27, 2023 18:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants