-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-2538] [PySpark] Hash based disk spilling aggregation #1460
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
QA tests have started for PR 1460. This patch merges cleanly. |
|
QA results for PR 1460: |
|
QA tests have started for PR 1460. This patch merges cleanly. |
python/pyspark/rdd.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should actually rotate among storage directories in spark.local.dir. Check out how the DiskStore works in Java.
|
QA results for PR 1460: |
|
QA tests have started for PR 1460. This patch merges cleanly. |
|
QA results for PR 1460: |
|
It looks like pushing a new rebased commit hid my comments, but click on them above to make sure you see them. |
add spark.python.worker.memory for memory used by Python worker. Default is 512M.
|
QA tests have started for PR 1460. This patch merges cleanly. |
support multiple local directories
|
QA tests have started for PR 1460. This patch merges cleanly. |
|
QA results for PR 1460: |
|
QA results for PR 1460: |
|
QA tests have started for PR 1460. This patch merges cleanly. |
|
QA results for PR 1460: |
docs/configuration.md
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small typo: go -> goes
|
QA tests have started for PR 1460. This patch merges cleanly. |
|
QA results for PR 1460: |
|
The last commit has fixed the tests, should run it again. |
|
Looks like the latest tested code has an error in the test suite: |
|
Ah never mind. |
|
QA results for PR 1460: |
python/pyspark/shuffle.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately memory_info only works in psutil 2.0. I tried the Anaconda Python distribution on Mac, which has psutil 1.2.1, and it doesn't work there. In there you have to use get_memory_info() instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, maybe don't call the process "self", it's kind of confusing since it sounds like a "this" object
|
QA tests have started for PR 1460. This patch merges cleanly. |
|
QA results for PR 1460: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want these unit tests to be run by Jenkins, you need to also call this file in python/run-tests. Seems worthwhile since there are some tests in ExternalMerger.
|
QA tests have started for PR 1460. This patch merges cleanly. |
|
QA results for PR 1460: |
|
Hey Davies, I tried this out a bit and saw two issues / areas for improvement:
The best way to fix this would be to hash values with a random hash function when choosing the bucket. One simple way might be to generate a random integer X for each ExternalMerger and then take hash((key, X)) instead of hash(key) when choosing the bucket. This is equivalent to salting your hash function. Maybe you have other ideas but I'd suggest trying this first.
|
|
BTW here's a patch that adds the GC calls I talked about above: https://gist.github.com/mateiz/297b8618ed033e7c8005 |
|
QA tests have started for PR 1460. This patch merges cleanly. |
|
QA tests have started for PR 1460. This patch merges cleanly. |
|
QA results for PR 1460: |
|
QA results for PR 1460: |
|
Thanks Davies. I've merged this in. |
|
Awesome! |
During aggregation in Python worker, if the memory usage is above spark.executor.memory, it will do disk spilling aggregation. It will split the aggregation into multiple stage, in each stage, it will partition the aggregated data by hash and dump them into disks. After all the data are aggregated, it will merge all the stages together (partition by partition). Author: Davies Liu <[email protected]> Closes apache#1460 from davies/spill and squashes the following commits: cad91bf [Davies Liu] call gc.collect() after data.clear() to release memory as much as possible. 37d71f7 [Davies Liu] balance the partitions 902f036 [Davies Liu] add shuffle.py into run-tests dcf03a9 [Davies Liu] fix memory_info() of psutil 67e6eba [Davies Liu] comment for MAX_TOTAL_PARTITIONS f6bd5d6 [Davies Liu] rollback next_limit() again, the performance difference is huge: e74b785 [Davies Liu] fix code style and change next_limit to memory_limit 400be01 [Davies Liu] address all the comments 6178844 [Davies Liu] refactor and improve docs fdd0a49 [Davies Liu] add long doc string for ExternalMerger 1a97ce4 [Davies Liu] limit used memory and size of objects in partitionBy() e6cc7f9 [Davies Liu] Merge branch 'master' into spill 3652583 [Davies Liu] address comments e78a0a0 [Davies Liu] fix style 24cec6a [Davies Liu] get local directory by SPARK_LOCAL_DIR 57ee7ef [Davies Liu] update docs 286aaff [Davies Liu] let spilled aggregation in Python configurable e9a40f6 [Davies Liu] recursive merger 6edbd1f [Davies Liu] Hash based disk spilling aggregation
During aggregation in Python worker, if the memory usage is above spark.executor.memory, it will do disk spilling aggregation.
It will split the aggregation into multiple stage, in each stage, it will partition the aggregated data by hash and dump them into disks. After all the data are aggregated, it will merge all the stages together (partition by partition).