[SPARK-24957][SQL][BACKPORT-2.2] Average with decimal followed by aggregation returns wrong result #21949

mgaido91 · 2018-08-01T19:27:05Z

What changes were proposed in this pull request?

When we do an average, the result is computed dividing the sum of the values by their count. In the case the result is a DecimalType, the way we are casting/managing the precision and scale is not really optimized and it is not coherent with what we do normally.

In particular, a problem can happen when the Divide operand returns a result which contains a precision and scale different by the ones which are expected as output of the Divide operand. In the case reported in the JIRA, for instance, the result of the Divide operand is a Decimal(38, 36), while the output data type for Divide is 38, 22. This is not an issue when the Divide is followed by a CheckOverflow or a Cast to the right data type, as these operations return a decimal with the defined precision and scale. Despite in the Average operator we do have a Cast, this may be bypassed if the result of Divide is the same type which it is casted to, hence the issue reported in the JIRA may arise.

The PR proposes to use the normal rules/handling of the arithmetic operators with Decimal data type, so we both reuse the existing code (having a single logic for operations between decimals) and we fix this problem as the result is always guarded by CheckOverflow.

How was this patch tested?

added UT

…ns wrong result

holdensmagicalunicorn · 2018-08-01T19:27:07Z

@mgaido91, thanks! I am a bot who has found some folks who might be able to help with the review:@rxin, @marmbrus and @gatorsmile

mgaido91 · 2018-08-01T19:27:23Z

cc @cloud-fan

gatorsmile

LGTM pending Jenkins.

SparkQA · 2018-08-01T22:09:11Z

Test build #93900 has finished for PR 21949 at commit 1f817a0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…regation returns wrong result ## What changes were proposed in this pull request? When we do an average, the result is computed dividing the sum of the values by their count. In the case the result is a DecimalType, the way we are casting/managing the precision and scale is not really optimized and it is not coherent with what we do normally. In particular, a problem can happen when the Divide operand returns a result which contains a precision and scale different by the ones which are expected as output of the Divide operand. In the case reported in the JIRA, for instance, the result of the Divide operand is a Decimal(38, 36), while the output data type for Divide is 38, 22. This is not an issue when the Divide is followed by a CheckOverflow or a Cast to the right data type, as these operations return a decimal with the defined precision and scale. Despite in the Average operator we do have a Cast, this may be bypassed if the result of Divide is the same type which it is casted to, hence the issue reported in the JIRA may arise. The PR proposes to use the normal rules/handling of the arithmetic operators with Decimal data type, so we both reuse the existing code (having a single logic for operations between decimals) and we fix this problem as the result is always guarded by CheckOverflow. ## How was this patch tested? added UT Author: Marco Gaido <[email protected]> Closes #21949 from mgaido91/SPARK-24957_2.2.

gatorsmile · 2018-08-01T23:00:18Z

Thanks! Merged to 2.2.

Could you close this PR?

mgaido91 · 2018-08-02T08:45:14Z

Thanks, closed.

…regation returns wrong result ## What changes were proposed in this pull request? When we do an average, the result is computed dividing the sum of the values by their count. In the case the result is a DecimalType, the way we are casting/managing the precision and scale is not really optimized and it is not coherent with what we do normally. In particular, a problem can happen when the Divide operand returns a result which contains a precision and scale different by the ones which are expected as output of the Divide operand. In the case reported in the JIRA, for instance, the result of the Divide operand is a Decimal(38, 36), while the output data type for Divide is 38, 22. This is not an issue when the Divide is followed by a CheckOverflow or a Cast to the right data type, as these operations return a decimal with the defined precision and scale. Despite in the Average operator we do have a Cast, this may be bypassed if the result of Divide is the same type which it is casted to, hence the issue reported in the JIRA may arise. The PR proposes to use the normal rules/handling of the arithmetic operators with Decimal data type, so we both reuse the existing code (having a single logic for operations between decimals) and we fix this problem as the result is always guarded by CheckOverflow. ## How was this patch tested? added UT Author: Marco Gaido <[email protected]> Closes apache#21949 from mgaido91/SPARK-24957_2.2.

[SPARK-24957][SQL] Average with decimal followed by aggregation retur…

1f817a0

…ns wrong result

gatorsmile reviewed Aug 1, 2018

View reviewed changes

mgaido91 closed this Aug 2, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-24957][SQL][BACKPORT-2.2] Average with decimal followed by aggregation returns wrong result #21949

[SPARK-24957][SQL][BACKPORT-2.2] Average with decimal followed by aggregation returns wrong result #21949

Uh oh!

mgaido91 commented Aug 1, 2018

Uh oh!

holdensmagicalunicorn commented Aug 1, 2018

Uh oh!

mgaido91 commented Aug 1, 2018

Uh oh!

gatorsmile left a comment

Uh oh!

SparkQA commented Aug 1, 2018

Uh oh!

gatorsmile commented Aug 1, 2018

Uh oh!

mgaido91 commented Aug 2, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-24957][SQL][BACKPORT-2.2] Average with decimal followed by aggregation returns wrong result #21949

[SPARK-24957][SQL][BACKPORT-2.2] Average with decimal followed by aggregation returns wrong result #21949

Uh oh!

Conversation

mgaido91 commented Aug 1, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

holdensmagicalunicorn commented Aug 1, 2018

Uh oh!

mgaido91 commented Aug 1, 2018

Uh oh!

gatorsmile left a comment

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 1, 2018

Uh oh!

gatorsmile commented Aug 1, 2018

Uh oh!

mgaido91 commented Aug 2, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants