-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-24957][SQL] Average with decimal followed by aggregation returns wrong result #21910
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@mgaido91, thanks! I am a bot who has found some folks who might be able to help with the review:@rxin, @marmbrus and @gatorsmile |
|
Test build #93747 has finished for PR 21910 at commit
|
| Cast(Cast(sum, dt) / Cast(count, DecimalType.bounded(DecimalType.MAX_PRECISION, 0)), | ||
| case _: DecimalType => | ||
| Cast( | ||
| DecimalPrecision.decimalAndDecimal.lift(sum / Cast(count, DecimalType.LongDecimal)).get, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can we just call apply instead of lift(...).get?
|
good catch! LGTM |
|
Test build #93779 has finished for PR 21910 at commit
|
|
thanks, merging to master/2.3! |
1 similar comment
|
thanks, merging to master/2.3! |
…ns wrong result ## What changes were proposed in this pull request? When we do an average, the result is computed dividing the sum of the values by their count. In the case the result is a DecimalType, the way we are casting/managing the precision and scale is not really optimized and it is not coherent with what we do normally. In particular, a problem can happen when the `Divide` operand returns a result which contains a precision and scale different by the ones which are expected as output of the `Divide` operand. In the case reported in the JIRA, for instance, the result of the `Divide` operand is a `Decimal(38, 36)`, while the output data type for `Divide` is 38, 22. This is not an issue when the `Divide` is followed by a `CheckOverflow` or a `Cast` to the right data type, as these operations return a decimal with the defined precision and scale. Despite in the `Average` operator we do have a `Cast`, this may be bypassed if the result of `Divide` is the same type which it is casted to, hence the issue reported in the JIRA may arise. The PR proposes to use the normal rules/handling of the arithmetic operators with Decimal data type, so we both reuse the existing code (having a single logic for operations between decimals) and we fix this problem as the result is always guarded by `CheckOverflow`. ## How was this patch tested? added UT Author: Marco Gaido <[email protected]> Closes #21910 from mgaido91/SPARK-24957. (cherry picked from commit 85505fc) Signed-off-by: Wenchen Fan <[email protected]>
…ns wrong result ## What changes were proposed in this pull request? When we do an average, the result is computed dividing the sum of the values by their count. In the case the result is a DecimalType, the way we are casting/managing the precision and scale is not really optimized and it is not coherent with what we do normally. In particular, a problem can happen when the `Divide` operand returns a result which contains a precision and scale different by the ones which are expected as output of the `Divide` operand. In the case reported in the JIRA, for instance, the result of the `Divide` operand is a `Decimal(38, 36)`, while the output data type for `Divide` is 38, 22. This is not an issue when the `Divide` is followed by a `CheckOverflow` or a `Cast` to the right data type, as these operations return a decimal with the defined precision and scale. Despite in the `Average` operator we do have a `Cast`, this may be bypassed if the result of `Divide` is the same type which it is casted to, hence the issue reported in the JIRA may arise. The PR proposes to use the normal rules/handling of the arithmetic operators with Decimal data type, so we both reuse the existing code (having a single logic for operations between decimals) and we fix this problem as the result is always guarded by `CheckOverflow`. ## How was this patch tested? added UT Author: Marco Gaido <[email protected]> Closes apache#21910 from mgaido91/SPARK-24957.
|
@mgaido91 do you mind open a PR for 2.2? I think this fixes a serious bug which is very hard to detect. Maybe that's the reason no one report it for such a long time. |
|
@cloud-fan sure, will do (anyway the cherry-pick to 2.2 was clean for me) |
What changes were proposed in this pull request?
When we do an average, the result is computed dividing the sum of the values by their count. In the case the result is a DecimalType, the way we are casting/managing the precision and scale is not really optimized and it is not coherent with what we do normally.
In particular, a problem can happen when the
Divideoperand returns a result which contains a precision and scale different by the ones which are expected as output of theDivideoperand. In the case reported in the JIRA, for instance, the result of theDivideoperand is aDecimal(38, 36), while the output data type forDivideis 38, 22. This is not an issue when theDivideis followed by aCheckOverflowor aCastto the right data type, as these operations return a decimal with the defined precision and scale. Despite in theAverageoperator we do have aCast, this may be bypassed if the result ofDivideis the same type which it is casted to, hence the issue reported in the JIRA may arise.The PR proposes to use the normal rules/handling of the arithmetic operators with Decimal data type, so we both reuse the existing code (having a single logic for operations between decimals) and we fix this problem as the result is always guarded by
CheckOverflow.How was this patch tested?
added UT