Skip to content

Commit f886d51

Browse files
c21cloud-fan
authored andcommitted
[SPARK-37001][SQL] Disable two level of map for final hash aggregation by default
### What changes were proposed in this pull request? This PR is to disable two level of maps for final hash aggregation by default. The feature was introduced in #32242 and we found it can lead to query performance regression when the final aggregation gets rows with a lot of distinct keys. The 1st level hash map is full so a lot of rows will waste the 1st hash map lookup and inserted into 2nd hash map. This feature still benefits query with not so many distinct keys though, so introducing a config here `spark.sql.codegen.aggregate.final.map.twolevel.enabled`, to allow query to enable the feature when seeing benefit. ### Why are the changes needed? Fix query regression. ### Does this PR introduce _any_ user-facing change? Yes, the introduced `spark.sql.codegen.aggregate.final.map.twolevel.enabled` config. ### How was this patch tested? Existing unit test in `AggregationQuerySuite.scala`. Also verified generated code for an example query in the file: ``` spark.sql( """ |SELECT key, avg(value) |FROM agg1 |GROUP BY key """.stripMargin) ``` Verified the generated code for final hash aggregation not have two level maps by default: https://gist.github.com/c21/d4ce87ef28a22d1ce839e0cda000ce14 . Verified the generated code for final hash aggregation have two level maps if enabling the config: https://gist.github.com/c21/4b59752c1f3f98303b60ccff66b5db69 . Closes #34270 from c21/agg-fix. Authored-by: Cheng Su <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 3354a21) Signed-off-by: Wenchen Fan <[email protected]>
1 parent d93d056 commit f886d51

File tree

3 files changed

+19
-2
lines changed

3 files changed

+19
-2
lines changed

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/grouping.scala

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -217,9 +217,9 @@ case class Grouping(child: Expression) extends Expression with Unevaluable
217217
Examples:
218218
> SELECT name, _FUNC_(), sum(age), avg(height) FROM VALUES (2, 'Alice', 165), (5, 'Bob', 180) people(age, name, height) GROUP BY cube(name, height);
219219
Alice 0 2 165.0
220-
Bob 0 5 180.0
221220
Alice 1 2 165.0
222221
NULL 3 7 172.5
222+
Bob 0 5 180.0
223223
Bob 1 5 180.0
224224
NULL 2 2 165.0
225225
NULL 2 5 180.0

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1685,6 +1685,16 @@ object SQLConf {
16851685
.booleanConf
16861686
.createWithDefault(true)
16871687

1688+
val ENABLE_TWOLEVEL_AGG_MAP_PARTIAL_ONLY =
1689+
buildConf("spark.sql.codegen.aggregate.map.twolevel.partialOnly")
1690+
.internal()
1691+
.doc("Enable two-level aggregate hash map for partial aggregate only, " +
1692+
"because final aggregate might get more distinct keys compared to partial aggregate. " +
1693+
"Overhead of looking up 1st-level map might dominate when having a lot of distinct keys.")
1694+
.version("3.2.1")
1695+
.booleanConf
1696+
.createWithDefault(true)
1697+
16881698
val ENABLE_VECTORIZED_HASH_MAP =
16891699
buildConf("spark.sql.codegen.aggregate.map.vectorized.enable")
16901700
.internal()

sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -667,7 +667,14 @@ case class HashAggregateExec(
667667
val isNotByteArrayDecimalType = bufferSchema.map(_.dataType).filter(_.isInstanceOf[DecimalType])
668668
.forall(!DecimalType.isByteArrayDecimalType(_))
669669

670-
isSupported && isNotByteArrayDecimalType
670+
val isEnabledForAggModes =
671+
if (modes.forall(mode => mode == Partial || mode == PartialMerge)) {
672+
true
673+
} else {
674+
!conf.getConf(SQLConf.ENABLE_TWOLEVEL_AGG_MAP_PARTIAL_ONLY)
675+
}
676+
677+
isSupported && isNotByteArrayDecimalType && isEnabledForAggModes
671678
}
672679

673680
private def enableTwoLevelHashMap(ctx: CodegenContext): Unit = {

0 commit comments

Comments
 (0)