Skip to content

Conversation

@ooq
Copy link
Contributor

@ooq ooq commented Jul 13, 2016

What changes were proposed in this pull request?

This PR is the second step for the following feature:

For hash aggregation in Spark SQL, we use a fast aggregation hashmap to act as a "cache" in order to boost aggregation performance. Previously, the hashmap is backed by a ColumnarBatch. This has performance issues when we have wide schema for the aggregation table (large number of key fields or value fields).
In this JIRA, we support another implementation of fast hashmap, which is backed by a RowBatch. We then automatically pick between the two implementations based on certain knobs.

In this second-step PR, we enable RowBasedHashMapGenerator in HashAggregateExec.

How was this patch tested?

Added tests: RowBasedAggregateHashMapSuite and VectorizedAggregateHashMapSuite
Additional micro-benchmarks tests and TPCDS results will be added in a separate PR in the series.

@ooq
Copy link
Contributor Author

ooq commented Jul 13, 2016

cc @sameeragarwal @davies @rxin

@SparkQA
Copy link

SparkQA commented Jul 13, 2016

Test build #62227 has finished for PR 14176 at commit a3360e0.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 13, 2016

Test build #62229 has finished for PR 14176 at commit 9b0b294.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 14, 2016

Test build #62349 has finished for PR 14176 at commit 225b661.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 15, 2016

Test build #62382 has finished for PR 14176 at commit a158125.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 18, 2016

Test build #62440 has finished for PR 14176 at commit ecff4ff.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 18, 2016

Test build #62452 has finished for PR 14176 at commit ce72d90.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 18, 2016

Test build #62482 has finished for PR 14176 at commit 461028e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ooq ooq force-pushed the rowbasedfastaggmap-pr2 branch from 461028e to 5fae053 Compare July 27, 2016 19:13
@SparkQA
Copy link

SparkQA commented Jul 27, 2016

Test build #62934 has finished for PR 14176 at commit 5fae053.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ooq ooq force-pushed the rowbasedfastaggmap-pr2 branch from 5fae053 to 41192e8 Compare July 27, 2016 21:42
@ooq ooq force-pushed the rowbasedfastaggmap-pr2 branch from 41192e8 to 7194394 Compare July 27, 2016 21:44
@SparkQA
Copy link

SparkQA commented Jul 27, 2016

Test build #62944 has finished for PR 14176 at commit 7194394.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 28, 2016

Test build #62952 has finished for PR 14176 at commit 122cf18.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 29, 2016

Test build #62990 has finished for PR 14176 at commit def94cc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

int keySize = 0;
int valueSize = 0;
for (String name : keySchema.fieldNames()) {
keySize += (keySchema.apply(name).dataType().defaultSize() + 7) / 8 * 8;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a small comment about this implicit ceiling logic and the reason why schema.defaultSize() doesn't work.

@ooq
Copy link
Contributor Author

ooq commented Aug 3, 2016

Added the explicit SQL tests for both hash map implementations. The test suites extend DataFrameAggregateSuite and reuse all tests there. For the two bugs that failed previous builds: the length bug would be caught by those tests; the decimal bug is tested with an added "SQL decimal test".

@SparkQA
Copy link

SparkQA commented Aug 3, 2016

Test build #63164 has finished for PR 14176 at commit b32cb7b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

super.beforeAll()
}

test("SQL decimal test") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we just add this in DataFrameAggregateSuite?

@sameeragarwal
Copy link
Member

LGTM

import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.DecimalType

abstract class AggregateHashMapSuite extends DataFrameAggregateSuite {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed offline, let's just move this in DataFrameAggregateSuite to prevent inadvertent overrides.

@davies
Copy link
Contributor

davies commented Aug 5, 2016

Let's hold on this, if we are going to have single implementation for fast hash map (based on the benchmark result in another PR), do need to merge this fancy implementation choosing. cc @rxin

@ooq
Copy link
Contributor Author

ooq commented Aug 5, 2016

@davies @sameeragarwal I updated more results in the benchmark PR #14266 .

@ooq
Copy link
Contributor Author

ooq commented Aug 25, 2016

Thanks for the comments @davies @sameeragarwal . This PR has been updated. Basically the only public boolean flag now is called spark.sql.codegen.aggregate.map.twolevel.enable. There is a separate non-public flag spark.sql.codegen.aggregate.map.vectorized.enable that allows testing and benchmarking of vectorized hashmap before we remove vectorized hashmap completely.

@SparkQA
Copy link

SparkQA commented Aug 25, 2016

Test build #64387 has finished for PR 14176 at commit a58314c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@davies
Copy link
Contributor

davies commented Aug 25, 2016

Can we make this spark.sql.codegen.aggregate.map.twolevel.enable internal? otherwise we should have a better name.

@ooq
Copy link
Contributor Author

ooq commented Aug 25, 2016

@davies I guess there is still benefit to make it public? If the user knows that their workload would always run faster with single-level, e.g., many distinct keys. I thought about spark.sql.codegen.aggregate.map.fast.enable or spark.sql.codegen.aggregate.map.codegen.enable, but none of them captures the fact that the biggest distinction is the two level design.

private var vectorizedHashMapTerm: String = _
private var isVectorizedHashMapEnabled: Boolean = _
// The name for Fast HashMap
private var fastHashMapTerm: String = _
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use a more descriptive name than "fast", there can always be faster implementation?

@ooq
Copy link
Contributor Author

ooq commented Aug 29, 2016

Thanks for the comments @clockfly ! As per discussion with @sameeragarwal, I think the plan is to give users the option to turn on/off two-level hashmap. This is why we have this first level logic for enabling two-level/fast map. We also want to keep both impls (vectorized /row-based) for a while before deleting vectorized in the future, which leads to the internal flags that pick between the two impls. If you guys decide otherwise, I'm happy to update the PR accordingly. @clockfly @sameeragarwal @davies Thanks!

@sameeragarwal
Copy link
Member

@clockfly as Qifan said, the rationale for not deleting the old vectorized hashmap code in the short-term was to enable us to quickly benchmark and compare the two implementations for a wide variety of workloads.

That said, I think the high level issue is that we don't currently expose a good interface/hooks in our generated code that can be used to test custom operator implementations while running benchmarks or tests (... and given these first level aggregate hashmap are entirely generated during query compilation, injecting a class that can work for all schema types during testing isn't very straightforward).

@davies
Copy link
Contributor

davies commented Sep 1, 2016

LGTM, I will merge this one to master (enable us to do more benchmarks with these two implementations).

@asfgit asfgit closed this in 03d77af Sep 1, 2016
@JoshRosen
Copy link
Contributor

@ooq @sameeragarwal, it looks like this patch is the culprit behind some OOMs that I'm observing with random queries; see https://issues.apache.org/jira/browse/SPARK-17405

@ooq
Copy link
Contributor Author

ooq commented Sep 6, 2016

Thanks. I will take a look tonight.

On Sep 5, 2016, at 4:46 PM, Josh Rosen [email protected] wrote:

@ooq @sameeragarwal, it looks like this patch is the culprit behind some OOMs that I'm observing with random queries; see https://issues.apache.org/jira/browse/SPARK-17405


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants