Skip to content

Conversation

@mgaido91
Copy link
Contributor

What changes were proposed in this pull request?

When we do an average, the result is computed dividing the sum of the values by their count. In the case the result is a DecimalType, the way we are casting/managing the precision and scale is not really optimized and it is not coherent with what we do normally.

In particular, a problem can happen when the Divide operand returns a result which contains a precision and scale different by the ones which are expected as output of the Divide operand. In the case reported in the JIRA, for instance, the result of the Divide operand is a Decimal(38, 36), while the output data type for Divide is 38, 22. This is not an issue when the Divide is followed by a CheckOverflow or a Cast to the right data type, as these operations return a decimal with the defined precision and scale. Despite in the Average operator we do have a Cast, this may be bypassed if the result of Divide is the same type which it is casted to, hence the issue reported in the JIRA may arise.

The PR proposes to use the normal rules/handling of the arithmetic operators with Decimal data type, so we both reuse the existing code (having a single logic for operations between decimals) and we fix this problem as the result is always guarded by CheckOverflow.

How was this patch tested?

added UT

@holdensmagicalunicorn
Copy link

@mgaido91, thanks! I am a bot who has found some folks who might be able to help with the review:@rxin, @marmbrus and @gatorsmile

@SparkQA
Copy link

SparkQA commented Jul 29, 2018

Test build #93747 has finished for PR 21910 at commit 1a1252d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Cast(Cast(sum, dt) / Cast(count, DecimalType.bounded(DecimalType.MAX_PRECISION, 0)),
case _: DecimalType =>
Cast(
DecimalPrecision.decimalAndDecimal.lift(sum / Cast(count, DecimalType.LongDecimal)).get,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we just call apply instead of lift(...).get?

@cloud-fan
Copy link
Contributor

good catch! LGTM

@SparkQA
Copy link

SparkQA commented Jul 30, 2018

Test build #93779 has finished for PR 21910 at commit 1938be0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@asfgit asfgit closed this in 85505fc Jul 30, 2018
@cloud-fan
Copy link
Contributor

thanks, merging to master/2.3!

1 similar comment
@cloud-fan
Copy link
Contributor

thanks, merging to master/2.3!

asfgit pushed a commit that referenced this pull request Jul 30, 2018
…ns wrong result

## What changes were proposed in this pull request?

When we do an average, the result is computed dividing the sum of the values by their count. In the case the result is a DecimalType, the way we are casting/managing the precision and scale is not really optimized and it is not coherent with what we do normally.

In particular, a problem can happen when the `Divide` operand returns a result which contains a precision and scale different by the ones which are expected as output of the `Divide` operand. In the case reported in the JIRA, for instance, the result of the `Divide` operand is a `Decimal(38, 36)`, while the output data type for `Divide` is 38, 22. This is not an issue when the `Divide` is followed by a `CheckOverflow` or a `Cast` to the right data type, as these operations return a decimal with the defined precision and scale. Despite in the `Average` operator we do have a `Cast`, this may be bypassed if the result of `Divide` is the same type which it is casted to, hence the issue reported in the JIRA may arise.

The PR proposes to use the normal rules/handling of the arithmetic operators with Decimal data type, so we both reuse the existing code (having a single logic for operations between decimals) and we fix this problem as the result is always guarded by `CheckOverflow`.

## How was this patch tested?

added UT

Author: Marco Gaido <[email protected]>

Closes #21910 from mgaido91/SPARK-24957.

(cherry picked from commit 85505fc)
Signed-off-by: Wenchen Fan <[email protected]>
robert3005 pushed a commit to palantir/spark that referenced this pull request Jul 31, 2018
…ns wrong result

## What changes were proposed in this pull request?

When we do an average, the result is computed dividing the sum of the values by their count. In the case the result is a DecimalType, the way we are casting/managing the precision and scale is not really optimized and it is not coherent with what we do normally.

In particular, a problem can happen when the `Divide` operand returns a result which contains a precision and scale different by the ones which are expected as output of the `Divide` operand. In the case reported in the JIRA, for instance, the result of the `Divide` operand is a `Decimal(38, 36)`, while the output data type for `Divide` is 38, 22. This is not an issue when the `Divide` is followed by a `CheckOverflow` or a `Cast` to the right data type, as these operations return a decimal with the defined precision and scale. Despite in the `Average` operator we do have a `Cast`, this may be bypassed if the result of `Divide` is the same type which it is casted to, hence the issue reported in the JIRA may arise.

The PR proposes to use the normal rules/handling of the arithmetic operators with Decimal data type, so we both reuse the existing code (having a single logic for operations between decimals) and we fix this problem as the result is always guarded by `CheckOverflow`.

## How was this patch tested?

added UT

Author: Marco Gaido <[email protected]>

Closes apache#21910 from mgaido91/SPARK-24957.
@cloud-fan
Copy link
Contributor

@mgaido91 do you mind open a PR for 2.2? I think this fixes a serious bug which is very hard to detect. Maybe that's the reason no one report it for such a long time.

@mgaido91
Copy link
Contributor Author

mgaido91 commented Aug 1, 2018

@cloud-fan sure, will do (anyway the cherry-pick to 2.2 was clean for me)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants