[SPARK-24957][SQL] Average with decimal followed by aggregation returns wrong result

mgaido91 · cloud-fan · commit 25ea27b09147 · 2018-07-30T20:58:09.000+08:00
## What changes were proposed in this pull request? When we do an average, the result is computed dividing the sum of the values by their count. In the case the result is a DecimalType, the way we are casting/managing the precision and scale is not really optimized and it is not coherent with what we do normally. In particular, a problem can happen when the `Divide` operand returns a result which contains a precision and scale different by the ones which are expected as output of the `Divide` operand. In the case reported in the JIRA, for instance, the result of the `Divide` operand is a `Decimal(38, 36)`, while the output data type for `Divide` is 38, 22. This is not an issue when the `Divide` is followed by a `CheckOverflow` or a `Cast` to the right data type, as these operations return a decimal with the defined precision and scale. Despite in the `Average` operator we do have a `Cast`, this may be bypassed if the result of `Divide` is the same type which it is casted to, hence the issue reported in the JIRA may arise. The PR proposes to use the normal rules/handling of the arithmetic operators with Decimal data type, so we both reuse the existing code (having a single logic for operations between decimals) and we fix this problem as the result is always guarded by `CheckOverflow`. ## How was this patch tested? added UT Author: Marco Gaido <marcogaido91@gmail.com> Closes #21910 from mgaido91/SPARK-24957. (cherry picked from commit 85505fc) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/DecimalPrecision.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/DecimalPrecision.scala
@@ -89,7 +89,7 @@ object DecimalPrecision extends TypeCoercionRule {
   }
 
   /** Decimal precision promotion for +, -, *, /, %, pmod, and binary comparison. */
-  private val decimalAndDecimal: PartialFunction[Expression, Expression] = {
+  private[catalyst] val decimalAndDecimal: PartialFunction[Expression, Expression] = {
     // Skip nodes whose children have not been resolved yet
     case e if !e.childrenResolved => e
 
diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Average.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Average.scala
@@ -17,7 +17,7 @@
 
 package org.apache.spark.sql.catalyst.expressions.aggregate
 
-import org.apache.spark.sql.catalyst.analysis.TypeCheckResult
+import org.apache.spark.sql.catalyst.analysis.{DecimalPrecision, TypeCheckResult}
 import org.apache.spark.sql.catalyst.dsl.expressions._
 import org.apache.spark.sql.catalyst.expressions._
 import org.apache.spark.sql.catalyst.util.TypeUtils
@@ -77,10 +77,9 @@ case class Average(child: Expression) extends DeclarativeAggregate with Implicit
 
   // If all input are nulls, count will be 0 and we will get null after the division.
   override lazy val evaluateExpression = child.dataType match {
-    case DecimalType.Fixed(p, s) =>
-      // increase the precision and scale to prevent precision loss
-      val dt = DecimalType.bounded(p + 14, s + 4)
-      Cast(Cast(sum, dt) / Cast(count, DecimalType.bounded(DecimalType.MAX_PRECISION, 0)),
+    case _: DecimalType =>
+      Cast(
+        DecimalPrecision.decimalAndDecimal(sum / Cast(count, DecimalType.LongDecimal)),
         resultType)
     case _ =>
       Cast(sum, resultType) / Cast(count, resultType)
diff --git a/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/AggregationQuerySuite.scala b/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/AggregationQuerySuite.scala
@@ -1005,6 +1005,19 @@ abstract class AggregationQuerySuite extends QueryTest with SQLTestUtils with Te
       )
     )
   }
+
+  test("SPARK-24957: average with decimal followed by aggregation returning wrong result") {
+    val df = Seq(("a", BigDecimal("12.0")),
+      ("a", BigDecimal("12.0")),
+      ("a", BigDecimal("11.9999999988")),
+      ("a", BigDecimal("12.0")),
+      ("a", BigDecimal("12.0")),
+      ("a", BigDecimal("11.9999999988")),
+      ("a", BigDecimal("11.9999999988"))).toDF("text", "number")
+    val agg1 = df.groupBy($"text").agg(avg($"number").as("avg_res"))
+    val agg2 = agg1.groupBy($"text").agg(sum($"avg_res"))
+    checkAnswer(agg2, Row("a", BigDecimal("11.9999999994857142860000")))
+  }
 }
 
 

Original file line number	Diff line number	Diff line change
`@@ -89,7 +89,7 @@ object DecimalPrecision extends TypeCoercionRule {`
`89`	`89`	`}`
`90`	`90`
`91`	`91`	`/** Decimal precision promotion for +, -, , /, %, pmod, and binary comparison. /`
`92`		`- private val decimalAndDecimal: PartialFunction[Expression, Expression] = {`
	`92`	`+ private[catalyst] val decimalAndDecimal: PartialFunction[Expression, Expression] = {`
`93`	`93`	`// Skip nodes whose children have not been resolved yet`
`94`	`94`	`case e if !e.childrenResolved => e`
`95`	`95`
Original file line number	Diff line number	Diff line change
`@@ -1005,6 +1005,19 @@ abstract class AggregationQuerySuite extends QueryTest with SQLTestUtils with Te`
`1005`	`1005`	`)`
`1006`	`1006`	`)`
`1007`	`1007`	`}`
	`1008`	`+`
	`1009`	`+ test("SPARK-24957: average with decimal followed by aggregation returning wrong result") {`
	`1010`	`+ val df = Seq(("a", BigDecimal("12.0")),`
	`1011`	`+ ("a", BigDecimal("12.0")),`
	`1012`	`+ ("a", BigDecimal("11.9999999988")),`
	`1013`	`+ ("a", BigDecimal("12.0")),`
	`1014`	`+ ("a", BigDecimal("12.0")),`
	`1015`	`+ ("a", BigDecimal("11.9999999988")),`
	`1016`	`+ ("a", BigDecimal("11.9999999988"))).toDF("text", "number")`
	`1017`	`+ val agg1 = df.groupBy($"text").agg(avg($"number").as("avg_res"))`
	`1018`	`+ val agg2 = agg1.groupBy($"text").agg(sum($"avg_res"))`
	`1019`	`+ checkAnswer(agg2, Row("a", BigDecimal("11.9999999994857142860000")))`
	`1020`	`+ }`
`1008`	`1021`	`}`
`1009`	`1022`
`1010`	`1023`