Skip to content

Conversation

@mgaido91
Copy link
Contributor

@mgaido91 mgaido91 commented Dec 3, 2018

What changes were proposed in this pull request?

When we encode a Decimal from external source we don't check for overflow. That method is useful not only in order to enforce that we can represent the correct value in the specified range, but it also changes the underlying data to the right precision/scale. Since in our code generation we assume that a decimal has exactly the same precision and scale of its data type, missing to enforce it can lead to corrupted output/results when there are subsequent transformations.

How was this patch tested?

added UT

@SparkQA
Copy link

SparkQA commented Dec 3, 2018

Test build #99621 has finished for PR 23210 at commit 91d3e1b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

checkAnswer(ds.select("x"), Seq(Row(1), Row(2)))
}

test("SPARK-26233: serializer should enforce decimal precision and scale") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have a test case in RowEncoderSuite, too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, everything is possible, but it is not easy actually. Because the issue here happens in the codegen, not when we retrieve the output. So if we just encode and decode everything is fine. The problem happens if there is any transformation in the codegen meanwhile, because there the underlying decimal is used (assuming that it has the same precision and scale of the data type - which without the current change is not always true). I tried checking the precision and scale of the serialized object, but it is not really feasible as they are converted when it is read (please see UnsafeRow)... So I'd avoid this actually.

@dongjoon-hyun
Copy link
Member

LGTM except one minor comment about a new test case.

Ping, @gatorsmile and @cloud-fan . Could you review this PR? This is to fix a correctness issue.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.
It seems that there is no other comments, I'll merge this to master.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Dec 4, 2018

Thank you, @mgaido91 . This needs to land branch-2.4/branch-2.3/branch-2.2, but it fails at branch-2.4 due to the conflicts in the test case file. Could you make separate backport PRs for each branch?

@asfgit asfgit closed this in 556d83e Dec 4, 2018
@cloud-fan
Copy link
Contributor

a late LGTM

mgaido91 added a commit to mgaido91/spark that referenced this pull request Dec 5, 2018
When we encode a Decimal from external source we don't check for overflow. That method is useful not only in order to enforce that we can represent the correct value in the specified range, but it also changes the underlying data to the right precision/scale. Since in our code generation we assume that a decimal has exactly the same precision and scale of its data type, missing to enforce it can lead to corrupted output/results when there are subsequent transformations.

added UT

Closes apache#23210 from mgaido91/SPARK-26233.

Authored-by: Marco Gaido <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
mgaido91 added a commit to mgaido91/spark that referenced this pull request Dec 5, 2018
When we encode a Decimal from external source we don't check for overflow. That method is useful not only in order to enforce that we can represent the correct value in the specified range, but it also changes the underlying data to the right precision/scale. Since in our code generation we assume that a decimal has exactly the same precision and scale of its data type, missing to enforce it can lead to corrupted output/results when there are subsequent transformations.

added UT

Closes apache#23210 from mgaido91/SPARK-26233.

Authored-by: Marco Gaido <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
mgaido91 added a commit to mgaido91/spark that referenced this pull request Dec 5, 2018
When we encode a Decimal from external source we don't check for overflow. That method is useful not only in order to enforce that we can represent the correct value in the specified range, but it also changes the underlying data to the right precision/scale. Since in our code generation we assume that a decimal has exactly the same precision and scale of its data type, missing to enforce it can lead to corrupted output/results when there are subsequent transformations.

added UT

Closes apache#23210 from mgaido91/SPARK-26233.

Authored-by: Marco Gaido <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
@mgaido91
Copy link
Contributor Author

mgaido91 commented Dec 5, 2018

thanks @cloud-fan @dongjoon-hyun, I created the PRs for the backports.

jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
## What changes were proposed in this pull request?

When we encode a Decimal from external source we don't check for overflow. That method is useful not only in order to enforce that we can represent the correct value in the specified range, but it also changes the underlying data to the right precision/scale. Since in our code generation we assume that a decimal has exactly the same precision and scale of its data type, missing to enforce it can lead to corrupted output/results when there are subsequent transformations.

## How was this patch tested?

added UT

Closes apache#23210 from mgaido91/SPARK-26233.

Authored-by: Marco Gaido <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants