[SPARK-2538] [PySpark] Hash based disk spilling aggregation #1460

davies · 2014-07-17T08:17:18Z

During aggregation in Python worker, if the memory usage is above spark.executor.memory, it will do disk spilling aggregation.

It will split the aggregation into multiple stage, in each stage, it will partition the aggregated data by hash and dump them into disks. After all the data are aggregated, it will merge all the stages together (partition by partition).

SparkQA · 2014-07-17T08:22:50Z

QA tests have started for PR 1460. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16772/consoleFull

SparkQA · 2014-07-17T09:56:03Z

QA results for PR 1460:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
class Merger(object):
class AutoSerializer(FramedSerializer):

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16772/consoleFull

SparkQA · 2014-07-17T16:58:03Z

QA tests have started for PR 1460. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16782/consoleFull

mateiz · 2014-07-17T17:51:43Z

python/pyspark/rdd.py

This should actually rotate among storage directories in spark.local.dir. Check out how the DiskStore works in Java.

SparkQA · 2014-07-17T18:35:33Z

QA results for PR 1460:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
class Merger(object):
class AutoSerializer(FramedSerializer):

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16782/consoleFull

SparkQA · 2014-07-18T22:02:59Z

QA tests have started for PR 1460. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16835/consoleFull

SparkQA · 2014-07-18T23:41:35Z

QA results for PR 1460:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
class AutoSerializer(FramedSerializer):
class Merger(object):

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16835/consoleFull

mateiz · 2014-07-20T08:21:03Z

It looks like pushing a new rebased commit hid my comments, but click on them above to make sure you see them.

add spark.python.worker.memory for memory used by Python worker. Default is 512M.

SparkQA · 2014-07-21T18:48:15Z

QA tests have started for PR 1460. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16918/consoleFull

support multiple local directories

SparkQA · 2014-07-21T20:08:11Z

QA tests have started for PR 1460. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16919/consoleFull

SparkQA · 2014-07-21T20:24:37Z

QA results for PR 1460:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
class AutoSerializer(FramedSerializer):
class Merger(object):
class MapMerger(Merger):
class ExternalHashMapMerger(Merger):

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16918/consoleFull

SparkQA · 2014-07-21T21:47:54Z

QA results for PR 1460:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
class AutoSerializer(FramedSerializer):
class Merger(object):
class MapMerger(Merger):
class ExternalHashMapMerger(Merger):

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16919/consoleFull

SparkQA · 2014-07-21T21:53:13Z

QA tests have started for PR 1460. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16929/consoleFull

SparkQA · 2014-07-21T23:31:26Z

QA results for PR 1460:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
class AutoSerializer(FramedSerializer):
class Merger(object):
class MapMerger(Merger):
class ExternalHashMapMerger(Merger):

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16929/consoleFull

mateiz · 2014-07-22T02:02:59Z

docs/configuration.md

Small typo: go -> goes

SparkQA · 2014-07-23T18:48:41Z

QA tests have started for PR 1460. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17053/consoleFull

SparkQA · 2014-07-23T18:49:51Z

QA results for PR 1460:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
class AutoSerializer(FramedSerializer):
class Aggregator(object):
class SimpleAggregator(Aggregator):
class Merger(object):
class InMemoryMerger(Merger):
class ExternalMerger(Merger):

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17043/consoleFull

davies · 2014-07-23T20:05:25Z

The last commit has fixed the tests, should run it again.

mateiz · 2014-07-23T20:15:15Z

Looks like the latest tested code has an error in the test suite:

Running PySpark tests. Output is in python/unit-tests.log.
Running test: pyspark/rdd.py
  File "pyspark/rdd.py", line 1239
    def add_shuffle_key(split, iterator):
      ^
SyntaxError: invalid syntax
Had test failures; see logs.

mateiz · 2014-07-23T20:15:53Z

Ah never mind.

SparkQA · 2014-07-23T20:31:10Z

QA results for PR 1460:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
class AutoSerializer(FramedSerializer):
class Aggregator(object):
class SimpleAggregator(Aggregator):
class Merger(object):
class InMemoryMerger(Merger):
class ExternalMerger(Merger):

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17053/consoleFull

mateiz · 2014-07-24T01:10:06Z

python/pyspark/shuffle.py

Unfortunately memory_info only works in psutil 2.0. I tried the Anaconda Python distribution on Mac, which has psutil 1.2.1, and it doesn't work there. In there you have to use get_memory_info() instead.

Also, maybe don't call the process "self", it's kind of confusing since it sounds like a "this" object

SparkQA · 2014-07-24T04:48:25Z

QA tests have started for PR 1460. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17094/consoleFull

SparkQA · 2014-07-24T05:32:05Z

QA results for PR 1460:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
class AutoSerializer(FramedSerializer):
class Aggregator(object):
class SimpleAggregator(Aggregator):
class Merger(object):
class InMemoryMerger(Merger):
class ExternalMerger(Merger):

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17094/consoleFull

mateiz · 2014-07-24T07:22:31Z

python/pyspark/shuffle.py

If you want these unit tests to be run by Jenkins, you need to also call this file in python/run-tests. Seems worthwhile since there are some tests in ExternalMerger.

SparkQA · 2014-07-24T07:38:42Z

QA tests have started for PR 1460. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17110/consoleFull

SparkQA · 2014-07-24T08:25:54Z

QA results for PR 1460:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
class AutoSerializer(FramedSerializer):
class Aggregator(object):
class SimpleAggregator(Aggregator):
class Merger(object):
class InMemoryMerger(Merger):
class ExternalMerger(Merger):

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17110/consoleFull

mateiz · 2014-07-24T20:18:56Z

Hey Davies, I tried this out a bit and saw two issues / areas for improvement:

Since the ExternalMerger is used in both map tasks and reduce tasks, one problem that can happen is that the reduce task's data is already hashed modulo the # of reduce tasks, and so you get many empty buckets. For example, if you have 2 reduce tasks, task 0 gets all the values whose hash code is even, so it can only use half its buckets. If you have 64 reduce tasks, only one bucket is used.

The best way to fix this would be to hash values with a random hash function when choosing the bucket. One simple way might be to generate a random integer X for each ExternalMerger and then take hash((key, X)) instead of hash(key) when choosing the bucket. This is equivalent to salting your hash function. Maybe you have other ideas but I'd suggest trying this first.

I also noticed that sometimes maps would fill up again before the old memory was fully freed, leading to smaller spills. For example, for (Int, Int) pairs the first spill from 512 MB memory is about 68 MB of files, but later spills were only around 20 MB. I found that I could get better performance overall by adding some gc.collect() calls after every data.clear() and pdata.clear(). This freed more memory faster and allowed us to do more work in memory before spilling. The perf difference for one test job was around 30% but you should try it on your own jobs.

mateiz · 2014-07-24T20:19:42Z

BTW here's a patch that adds the GC calls I talked about above: https://gist.github.com/mateiz/297b8618ed033e7c8005

SparkQA · 2014-07-25T01:23:44Z

QA tests have started for PR 1460. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17152/consoleFull

possible.

SparkQA · 2014-07-25T01:43:43Z

QA tests have started for PR 1460. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17153/consoleFull

SparkQA · 2014-07-25T02:09:10Z

QA results for PR 1460:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
class AutoSerializer(FramedSerializer):
class Aggregator(object):
class SimpleAggregator(Aggregator):
class Merger(object):
class InMemoryMerger(Merger):
class ExternalMerger(Merger):

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17152/consoleFull

SparkQA · 2014-07-25T02:36:31Z

QA results for PR 1460:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
class AutoSerializer(FramedSerializer):
class Aggregator(object):
class SimpleAggregator(Aggregator):
class Merger(object):
class InMemoryMerger(Merger):
class ExternalMerger(Merger):

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17153/consoleFull

mateiz · 2014-07-25T05:54:06Z

Thanks Davies. I've merged this in.

davies · 2014-07-25T08:06:38Z

Awesome!

During aggregation in Python worker, if the memory usage is above spark.executor.memory, it will do disk spilling aggregation. It will split the aggregation into multiple stage, in each stage, it will partition the aggregated data by hash and dump them into disks. After all the data are aggregated, it will merge all the stages together (partition by partition). Author: Davies Liu <[email protected]> Closes apache#1460 from davies/spill and squashes the following commits: cad91bf [Davies Liu] call gc.collect() after data.clear() to release memory as much as possible. 37d71f7 [Davies Liu] balance the partitions 902f036 [Davies Liu] add shuffle.py into run-tests dcf03a9 [Davies Liu] fix memory_info() of psutil 67e6eba [Davies Liu] comment for MAX_TOTAL_PARTITIONS f6bd5d6 [Davies Liu] rollback next_limit() again, the performance difference is huge: e74b785 [Davies Liu] fix code style and change next_limit to memory_limit 400be01 [Davies Liu] address all the comments 6178844 [Davies Liu] refactor and improve docs fdd0a49 [Davies Liu] add long doc string for ExternalMerger 1a97ce4 [Davies Liu] limit used memory and size of objects in partitionBy() e6cc7f9 [Davies Liu] Merge branch 'master' into spill 3652583 [Davies Liu] address comments e78a0a0 [Davies Liu] fix style 24cec6a [Davies Liu] get local directory by SPARK_LOCAL_DIR 57ee7ef [Davies Liu] update docs 286aaff [Davies Liu] let spilled aggregation in Python configurable e9a40f6 [Davies Liu] recursive merger 6edbd1f [Davies Liu] Hash based disk spilling aggregation

Hash based disk spilling aggregation

6edbd1f

mateiz reviewed Jul 17, 2014
View reviewed changes

recursive merger

e9a40f6

let spilled aggregation in Python configurable

286aaff

add spark.python.worker.memory for memory used by Python worker. Default is 512M.

davies added 2 commits July 21, 2014 12:10

update docs

57ee7ef

get local directory by SPARK_LOCAL_DIR

24cec6a

support multiple local directories

fix style

e78a0a0

mateiz reviewed Jul 22, 2014
View reviewed changes

docs/configuration.md Outdated

Copy link

Contributor

mateiz Jul 22, 2014

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small typo: go -> goes

mateiz reviewed Jul 24, 2014
View reviewed changes

fix memory_info() of psutil

dcf03a9

mateiz reviewed Jul 24, 2014
View reviewed changes

add shuffle.py into run-tests

902f036

balance the partitions

37d71f7

call gc.collect() after data.clear() to release memory as much as

cad91bf

possible.

asfgit closed this in 14174ab Jul 25, 2014

davies deleted the spill branch July 29, 2014 00:42

[SPARK-2538] [PySpark] Hash based disk spilling aggregation #1460

[SPARK-2538] [PySpark] Hash based disk spilling aggregation #1460

Uh oh!

Conversation

davies commented Jul 17, 2014

Uh oh!

SparkQA commented Jul 17, 2014

Uh oh!

SparkQA commented Jul 17, 2014

Uh oh!

SparkQA commented Jul 17, 2014

Uh oh!

mateiz Jul 17, 2014

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 17, 2014

Uh oh!

SparkQA commented Jul 18, 2014

Uh oh!

SparkQA commented Jul 18, 2014

Uh oh!

mateiz commented Jul 20, 2014

Uh oh!

SparkQA commented Jul 21, 2014

Uh oh!

SparkQA commented Jul 21, 2014

Uh oh!

SparkQA commented Jul 21, 2014

Uh oh!

SparkQA commented Jul 21, 2014

Uh oh!

SparkQA commented Jul 21, 2014

Uh oh!

SparkQA commented Jul 21, 2014

Uh oh!

mateiz Jul 22, 2014

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 23, 2014

Uh oh!

SparkQA commented Jul 23, 2014

Uh oh!

davies commented Jul 23, 2014

Uh oh!

mateiz commented Jul 23, 2014

Uh oh!

mateiz commented Jul 23, 2014

Uh oh!

SparkQA commented Jul 23, 2014

Uh oh!

mateiz Jul 24, 2014

Choose a reason for hiding this comment

Uh oh!

mateiz Jul 24, 2014

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 24, 2014

Uh oh!

SparkQA commented Jul 24, 2014

Uh oh!

mateiz Jul 24, 2014

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 24, 2014

Uh oh!

SparkQA commented Jul 24, 2014

Uh oh!

mateiz commented Jul 24, 2014

Uh oh!

mateiz commented Jul 24, 2014

Uh oh!

SparkQA commented Jul 25, 2014

Uh oh!

SparkQA commented Jul 25, 2014

Uh oh!

SparkQA commented Jul 25, 2014

Uh oh!

SparkQA commented Jul 25, 2014

Uh oh!

mateiz commented Jul 25, 2014