-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-24957][SQL][BACKPORT-2.2] Average with decimal followed by aggregation returns wrong result #21949
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@mgaido91, thanks! I am a bot who has found some folks who might be able to help with the review:@rxin, @marmbrus and @gatorsmile |
|
cc @cloud-fan |
gatorsmile
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM pending Jenkins.
|
Test build #93900 has finished for PR 21949 at commit
|
…regation returns wrong result ## What changes were proposed in this pull request? When we do an average, the result is computed dividing the sum of the values by their count. In the case the result is a DecimalType, the way we are casting/managing the precision and scale is not really optimized and it is not coherent with what we do normally. In particular, a problem can happen when the Divide operand returns a result which contains a precision and scale different by the ones which are expected as output of the Divide operand. In the case reported in the JIRA, for instance, the result of the Divide operand is a Decimal(38, 36), while the output data type for Divide is 38, 22. This is not an issue when the Divide is followed by a CheckOverflow or a Cast to the right data type, as these operations return a decimal with the defined precision and scale. Despite in the Average operator we do have a Cast, this may be bypassed if the result of Divide is the same type which it is casted to, hence the issue reported in the JIRA may arise. The PR proposes to use the normal rules/handling of the arithmetic operators with Decimal data type, so we both reuse the existing code (having a single logic for operations between decimals) and we fix this problem as the result is always guarded by CheckOverflow. ## How was this patch tested? added UT Author: Marco Gaido <[email protected]> Closes #21949 from mgaido91/SPARK-24957_2.2.
|
Thanks! Merged to 2.2. Could you close this PR? |
|
Thanks, closed. |
…regation returns wrong result ## What changes were proposed in this pull request? When we do an average, the result is computed dividing the sum of the values by their count. In the case the result is a DecimalType, the way we are casting/managing the precision and scale is not really optimized and it is not coherent with what we do normally. In particular, a problem can happen when the Divide operand returns a result which contains a precision and scale different by the ones which are expected as output of the Divide operand. In the case reported in the JIRA, for instance, the result of the Divide operand is a Decimal(38, 36), while the output data type for Divide is 38, 22. This is not an issue when the Divide is followed by a CheckOverflow or a Cast to the right data type, as these operations return a decimal with the defined precision and scale. Despite in the Average operator we do have a Cast, this may be bypassed if the result of Divide is the same type which it is casted to, hence the issue reported in the JIRA may arise. The PR proposes to use the normal rules/handling of the arithmetic operators with Decimal data type, so we both reuse the existing code (having a single logic for operations between decimals) and we fix this problem as the result is always guarded by CheckOverflow. ## How was this patch tested? added UT Author: Marco Gaido <[email protected]> Closes apache#21949 from mgaido91/SPARK-24957_2.2.
…regation returns wrong result ## What changes were proposed in this pull request? When we do an average, the result is computed dividing the sum of the values by their count. In the case the result is a DecimalType, the way we are casting/managing the precision and scale is not really optimized and it is not coherent with what we do normally. In particular, a problem can happen when the Divide operand returns a result which contains a precision and scale different by the ones which are expected as output of the Divide operand. In the case reported in the JIRA, for instance, the result of the Divide operand is a Decimal(38, 36), while the output data type for Divide is 38, 22. This is not an issue when the Divide is followed by a CheckOverflow or a Cast to the right data type, as these operations return a decimal with the defined precision and scale. Despite in the Average operator we do have a Cast, this may be bypassed if the result of Divide is the same type which it is casted to, hence the issue reported in the JIRA may arise. The PR proposes to use the normal rules/handling of the arithmetic operators with Decimal data type, so we both reuse the existing code (having a single logic for operations between decimals) and we fix this problem as the result is always guarded by CheckOverflow. ## How was this patch tested? added UT Author: Marco Gaido <[email protected]> Closes apache#21949 from mgaido91/SPARK-24957_2.2.
What changes were proposed in this pull request?
When we do an average, the result is computed dividing the sum of the values by their count. In the case the result is a DecimalType, the way we are casting/managing the precision and scale is not really optimized and it is not coherent with what we do normally.
In particular, a problem can happen when the Divide operand returns a result which contains a precision and scale different by the ones which are expected as output of the Divide operand. In the case reported in the JIRA, for instance, the result of the Divide operand is a Decimal(38, 36), while the output data type for Divide is 38, 22. This is not an issue when the Divide is followed by a CheckOverflow or a Cast to the right data type, as these operations return a decimal with the defined precision and scale. Despite in the Average operator we do have a Cast, this may be bypassed if the result of Divide is the same type which it is casted to, hence the issue reported in the JIRA may arise.
The PR proposes to use the normal rules/handling of the arithmetic operators with Decimal data type, so we both reuse the existing code (having a single logic for operations between decimals) and we fix this problem as the result is always guarded by CheckOverflow.
How was this patch tested?
added UT