Skip to content

Conversation

@sameeragarwal
Copy link
Member

@sameeragarwal sameeragarwal commented Apr 14, 2016

What changes were proposed in this pull request?

This PR uses a better hashing algorithm while probing the AggregateHashMap:

long h = 0
h = (h ^ (0x9e3779b9)) + key_1 + (h << 6) + (h >>> 2);
h = (h ^ (0x9e3779b9)) + key_2 + (h << 6) + (h >>> 2);
h = (h ^ (0x9e3779b9)) + key_3 + (h << 6) + (h >>> 2);
...
h = (h ^ (0x9e3779b9)) + key_n + (h << 6) + (h >>> 2);
return h

Depends on: #12345

How was this patch tested?

Java HotSpot(TM) 64-Bit Server VM 1.8.0_73-b02 on Mac OS X 10.11.4
Intel(R) Core(TM) i7-4960HQ CPU @ 2.60GHz
Aggregate w keys:                   Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
-------------------------------------------------------------------------------------------
codegen = F                              2417 / 2457          8.7         115.2       1.0X
codegen = T hashmap = F                  1554 / 1581         13.5          74.1       1.6X
codegen = T hashmap = T                   877 /  929         23.9          41.8       2.8X

@SparkQA
Copy link

SparkQA commented Apr 14, 2016

Test build #55781 has finished for PR 12379 at commit bbb9663.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 15, 2016

Test build #55877 has finished for PR 12379 at commit 1fc03d1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sameeragarwal sameeragarwal force-pushed the hash branch 2 times, most recently from 68af2e4 to 657edf5 Compare April 15, 2016 04:30
@sameeragarwal
Copy link
Member Author

cc @nongli

@sameeragarwal sameeragarwal changed the title [SPARK-14620][SQL][WIP] Use/benchmark a better hash in AggregateHashMap [SPARK-14620][SQL] Use/benchmark a better hash in AggregateHashMap Apr 15, 2016
@sameeragarwal sameeragarwal changed the title [SPARK-14620][SQL] Use/benchmark a better hash in AggregateHashMap [SPARK-14620][SQL] Use/benchmark a better hash in VectorizedHashMap Apr 15, 2016
@SparkQA
Copy link

SparkQA commented Apr 15, 2016

Test build #55896 has finished for PR 12379 at commit 657edf5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does this plan look like?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WholeStageCodegen
:  +- TungstenAggregate(key=[k#3L,k#3L], functions=[(sum(id#0L),mode=Final,isDistinct=false)], output=[k#3L,k#3L,sum(id)#183L])
:     +- INPUT
+- Exchange hashpartitioning(k#3L, k#3L, 1), None
   +- WholeStageCodegen
      :  +- TungstenAggregate(key=[k#3L,k#3L], functions=[(sum(id#0L),mode=Partial,isDistinct=false)], output=[k#3L,k#3L,sum#185L])
      :     +- Project [id#0L,FLOOR((rand(-9053518532274118725) * 10000.0)) AS k#3L]
      :        +- Range 0, 1, 1, 20971520, [id#0L]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel you're spending a lot of time in rand

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Doesn't registerTempTable materialize the table before registering it? If not, is there a way to do that?

@nongli
Copy link
Contributor

nongli commented Apr 15, 2016

LGTM

@SparkQA
Copy link

SparkQA commented Apr 15, 2016

Test build #55962 has finished for PR 12379 at commit 1562d46.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@asfgit asfgit closed this in 4df6518 Apr 15, 2016
lw-lin pushed a commit to lw-lin/spark that referenced this pull request Apr 20, 2016
## What changes were proposed in this pull request?

This PR uses a better hashing algorithm while probing the AggregateHashMap:

```java
long h = 0
h = (h ^ (0x9e3779b9)) + key_1 + (h << 6) + (h >>> 2);
h = (h ^ (0x9e3779b9)) + key_2 + (h << 6) + (h >>> 2);
h = (h ^ (0x9e3779b9)) + key_3 + (h << 6) + (h >>> 2);
...
h = (h ^ (0x9e3779b9)) + key_n + (h << 6) + (h >>> 2);
return h
```

Depends on: apache#12345
## How was this patch tested?

    Java HotSpot(TM) 64-Bit Server VM 1.8.0_73-b02 on Mac OS X 10.11.4
    Intel(R) Core(TM) i7-4960HQ CPU  2.60GHz
    Aggregate w keys:                   Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    -------------------------------------------------------------------------------------------
    codegen = F                              2417 / 2457          8.7         115.2       1.0X
    codegen = T hashmap = F                  1554 / 1581         13.5          74.1       1.6X
    codegen = T hashmap = T                   877 /  929         23.9          41.8       2.8X

Author: Sameer Agarwal <[email protected]>

Closes apache#12379 from sameeragarwal/hash.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants