[SPARK-16525][SQL] Enable Row Based HashMap in HashAggregateExec #14176

ooq · 2016-07-13T08:15:57Z

What changes were proposed in this pull request?

This PR is the second step for the following feature:

For hash aggregation in Spark SQL, we use a fast aggregation hashmap to act as a "cache" in order to boost aggregation performance. Previously, the hashmap is backed by a ColumnarBatch. This has performance issues when we have wide schema for the aggregation table (large number of key fields or value fields).
In this JIRA, we support another implementation of fast hashmap, which is backed by a RowBatch. We then automatically pick between the two implementations based on certain knobs.

In this second-step PR, we enable RowBasedHashMapGenerator in HashAggregateExec.

How was this patch tested?

Added tests: RowBasedAggregateHashMapSuite and VectorizedAggregateHashMapSuite
Additional micro-benchmarks tests and TPCDS results will be added in a separate PR in the series.

ooq · 2016-07-13T08:17:28Z

cc @sameeragarwal @davies @rxin

SparkQA · 2016-07-13T08:21:08Z

Test build #62227 has finished for PR 14176 at commit a3360e0.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-07-13T09:40:57Z

Test build #62229 has finished for PR 14176 at commit 9b0b294.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-07-14T22:59:51Z

Test build #62349 has finished for PR 14176 at commit 225b661.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-07-15T19:32:18Z

Test build #62382 has finished for PR 14176 at commit a158125.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-07-18T01:24:49Z

Test build #62440 has finished for PR 14176 at commit ecff4ff.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-07-18T10:07:23Z

Test build #62452 has finished for PR 14176 at commit ce72d90.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-07-18T20:54:52Z

Test build #62482 has finished for PR 14176 at commit 461028e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-07-27T21:12:25Z

Test build #62934 has finished for PR 14176 at commit 5fae053.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…orized hashmap

SparkQA · 2016-07-27T22:30:35Z

Test build #62944 has finished for PR 14176 at commit 7194394.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

…es of 8).

SparkQA · 2016-07-28T02:51:52Z

Test build #62952 has finished for PR 14176 at commit 122cf18.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-07-29T01:14:18Z

Test build #62990 has finished for PR 14176 at commit def94cc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sameeragarwal · 2016-08-01T19:21:26Z

...rc/main/java/org/apache/spark/sql/catalyst/expressions/FixedLengthRowBasedKeyValueBatch.java

+    int keySize = 0;
+    int valueSize = 0;
+    for (String name : keySchema.fieldNames()) {
+      keySize += (keySchema.apply(name).dataType().defaultSize() + 7) / 8 * 8;


Let's add a small comment about this implicit ceiling logic and the reason why schema.defaultSize() doesn't work.

ooq · 2016-08-03T08:47:39Z

Added the explicit SQL tests for both hash map implementations. The test suites extend DataFrameAggregateSuite and reuse all tests there. For the two bugs that failed previous builds: the length bug would be caught by those tests; the decimal bug is tested with an added "SQL decimal test".

SparkQA · 2016-08-03T10:33:54Z

Test build #63164 has finished for PR 14176 at commit b32cb7b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sameeragarwal · 2016-08-05T19:03:06Z

sql/core/src/test/scala/org/apache/spark/sql/AggregateHashMapSuite.scala

+      super.beforeAll()
+  }
+
+  test("SQL decimal test") {


can we just add this in DataFrameAggregateSuite?

sameeragarwal · 2016-08-05T19:04:10Z

LGTM

sameeragarwal · 2016-08-05T19:27:18Z

sql/core/src/test/scala/org/apache/spark/sql/AggregateHashMapSuite.scala

+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types.DecimalType
+
+abstract class AggregateHashMapSuite extends DataFrameAggregateSuite {


As discussed offline, let's just move this in DataFrameAggregateSuite to prevent inadvertent overrides.

davies · 2016-08-05T22:20:00Z

Let's hold on this, if we are going to have single implementation for fast hash map (based on the benchmark result in another PR), do need to merge this fancy implementation choosing. cc @rxin

ooq · 2016-08-05T23:42:47Z

@davies @sameeragarwal I updated more results in the benchmark PR #14266 .

ooq · 2016-08-25T00:15:01Z

Thanks for the comments @davies @sameeragarwal . This PR has been updated. Basically the only public boolean flag now is called spark.sql.codegen.aggregate.map.twolevel.enable. There is a separate non-public flag spark.sql.codegen.aggregate.map.vectorized.enable that allows testing and benchmarking of vectorized hashmap before we remove vectorized hashmap completely.

SparkQA · 2016-08-25T02:22:18Z

Test build #64387 has finished for PR 14176 at commit a58314c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

davies · 2016-08-25T16:22:56Z

Can we make this spark.sql.codegen.aggregate.map.twolevel.enable internal? otherwise we should have a better name.

ooq · 2016-08-25T19:23:49Z

@davies I guess there is still benefit to make it public? If the user knows that their workload would always run faster with single-level, e.g., many distinct keys. I thought about spark.sql.codegen.aggregate.map.fast.enable or spark.sql.codegen.aggregate.map.codegen.enable, but none of them captures the fact that the biggest distinction is the two level design.

clockfly · 2016-08-25T22:43:55Z

sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala

-  private var vectorizedHashMapTerm: String = _
-  private var isVectorizedHashMapEnabled: Boolean = _
+  // The name for Fast HashMap
+  private var fastHashMapTerm: String = _


Should we use a more descriptive name than "fast", there can always be faster implementation?

ooq · 2016-08-29T18:03:30Z

Thanks for the comments @clockfly ! As per discussion with @sameeragarwal, I think the plan is to give users the option to turn on/off two-level hashmap. This is why we have this first level logic for enabling two-level/fast map. We also want to keep both impls (vectorized /row-based) for a while before deleting vectorized in the future, which leads to the internal flags that pick between the two impls. If you guys decide otherwise, I'm happy to update the PR accordingly. @clockfly @sameeragarwal @davies Thanks!

sameeragarwal · 2016-08-30T23:07:06Z

@clockfly as Qifan said, the rationale for not deleting the old vectorized hashmap code in the short-term was to enable us to quickly benchmark and compare the two implementations for a wide variety of workloads.

That said, I think the high level issue is that we don't currently expose a good interface/hooks in our generated code that can be used to test custom operator implementations while running benchmarks or tests (... and given these first level aggregate hashmap are entirely generated during query compilation, injecting a class that can work for all schema types during testing isn't very straightforward).

davies · 2016-09-01T23:56:01Z

LGTM, I will merge this one to master (enable us to do more benchmarks with these two implementations).

JoshRosen · 2016-09-05T23:45:23Z

@ooq @sameeragarwal, it looks like this patch is the culprit behind some OOMs that I'm observing with random queries; see https://issues.apache.org/jira/browse/SPARK-17405

ooq · 2016-09-06T00:05:43Z

Thanks. I will take a look tonight.

On Sep 5, 2016, at 4:46 PM, Josh Rosen [email protected] wrote:

@ooq @sameeragarwal, it looks like this patch is the culprit behind some OOMs that I'm observing with random queries; see https://issues.apache.org/jira/browse/SPARK-17405

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

ooq force-pushed the rowbasedfastaggmap-pr2 branch from 461028e to 5fae053 Compare July 27, 2016 19:13

ooq force-pushed the rowbasedfastaggmap-pr2 branch from 5fae053 to 41192e8 Compare July 27, 2016 21:42

ooq added 2 commits July 27, 2016 14:43

Enable row-based fast hashmap and set it as default in auto mode.

4a5b81f

Remove the configuration flag that sets maximum column width for vect…

7194394

…orized hashmap

ooq force-pushed the rowbasedfastaggmap-pr2 branch from 41192e8 to 7194394 Compare July 27, 2016 21:44

Fix klen and vlen in FixedLengthRowBasedKeyValueBatch (making multipl…

122cf18

…es of 8).

Fix writing DecimalType bug in RowBasedHashMapGenerator.

def94cc

sameeragarwal reviewed Aug 1, 2016
View reviewed changes

ooq added 4 commits August 1, 2016 14:28

Fix comments (unfinished)

97bb7c1

Fix comments (unfinished)

b9a4268

Add explicit tests for aggregate hash maps and other minor fixs.

e67ff5d

Remove accidental/unnecessary import.

b32cb7b

sameeragarwal reviewed Aug 5, 2016
View reviewed changes

Update flag settings and tests.

a58314c

clockfly reviewed Aug 25, 2016
View reviewed changes

asfgit closed this in 03d77af Sep 1, 2016

gatorsmile mentioned this pull request Sep 23, 2016

[SPARK-17635][SQL] Remove hardcode "agg_plan" in HashAggregateExec #15199

Closed

c21 mentioned this pull request Apr 22, 2021

[SPARK-35141][SQL] Support two level of hash maps for final hash aggregation #32242

Closed

[SPARK-16525][SQL] Enable Row Based HashMap in HashAggregateExec #14176

[SPARK-16525][SQL] Enable Row Based HashMap in HashAggregateExec #14176

Uh oh!

Conversation

ooq commented Jul 13, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

ooq commented Jul 13, 2016

Uh oh!

SparkQA commented Jul 13, 2016

Uh oh!

SparkQA commented Jul 13, 2016

Uh oh!

SparkQA commented Jul 14, 2016

Uh oh!

SparkQA commented Jul 15, 2016

Uh oh!

SparkQA commented Jul 18, 2016

Uh oh!

SparkQA commented Jul 18, 2016

Uh oh!

SparkQA commented Jul 18, 2016

Uh oh!

SparkQA commented Jul 27, 2016

Uh oh!

SparkQA commented Jul 27, 2016

Uh oh!

SparkQA commented Jul 28, 2016

Uh oh!

SparkQA commented Jul 29, 2016

Uh oh!

sameeragarwal Aug 1, 2016

Choose a reason for hiding this comment

Uh oh!

ooq commented Aug 3, 2016

Uh oh!

SparkQA commented Aug 3, 2016

Uh oh!

sameeragarwal Aug 5, 2016

Choose a reason for hiding this comment

Uh oh!

sameeragarwal commented Aug 5, 2016

Uh oh!

sameeragarwal Aug 5, 2016

Choose a reason for hiding this comment

Uh oh!

davies commented Aug 5, 2016

Uh oh!

ooq commented Aug 5, 2016

Uh oh!

ooq commented Aug 25, 2016

Uh oh!

SparkQA commented Aug 25, 2016

Uh oh!

davies commented Aug 25, 2016

Uh oh!

ooq commented Aug 25, 2016

Uh oh!

clockfly Aug 25, 2016

Choose a reason for hiding this comment

Uh oh!

ooq commented Aug 29, 2016

Uh oh!

sameeragarwal commented Aug 30, 2016

Uh oh!

davies commented Sep 1, 2016

Uh oh!

JoshRosen commented Sep 5, 2016

Uh oh!

ooq commented Sep 6, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

ooq commented Jul 13, 2016 •

edited

Loading