-
Notifications
You must be signed in to change notification settings - Fork 28.9k
SPARK-3211 .take() is OOM-prone with empty partitions #2117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Instead of jumping straight from 1 partition to all partitions, do exponential growth and double the number of partitions to attempt each time instead. Fix proposed by Paul Nepywoda
|
QA tests have started for PR 2117 at commit
|
|
QA tests have finished for PR 2117 at commit
|
|
Failed unit test looks unrelated:
|
|
Why would this be OOM-prone? This would only cause us to run a longer job than we need to if it turns out the first few partitions actually do contain enough results.
Edit: sorry, forgot that we only continue with unscanned partitions, this would not cause re-scanning prior partitions. This may still cause a significant slowdown on an underutilized cluster, however, as we would have log(n) waves rather than just 2 waves of computing over our RDD. It may cost as much time to scan 2 partitions as it does to scan n partitions (for n cores in our cluster). |
|
The reason it's OOM-prone is that if the results of take on the first partition is size 0, then it runs take(n) on all the remaining partitions and brings them back to the driver. Then it takes from that set of partition take()s until it fills the N results. That happens here: Between those two lines, the contents of allPartitions.map(_.take(left)) are in the driver's memory all at once. In my situation I had 10k partitions, the first one was size=0, and I did a take(300). So after the first one returned 0 results, this pulled 300 results from each of the remaining 9,999 partitions, which OOM'd the driver. This patch isn't bulletproof in the sense that it relies on the heuristic that you'll get to a doubling that contains enough rows to fulfill the take() before the onslaught of too much data causes an OOM. An alternative approach would be to evaluate the .take(n) across the cluster instead and use the new .toLocalIterator call which didn't exist when this code was first being written. This would handle your observation that 2 waves of calculation ought to be faster than log(n) waves across the cluster. I'm thinking something more like: |
|
This looks good to me as is because it only happens in the case where the first partition had 0 elements. @aarondav what do you think? This case will be pretty uncommon, and the other code in take() already does a smarter increase. @ash211 please also fix this in Python, in python/pyspark/rdd.py. The code there is almost identical. |
|
Alright, the current solution seems reasonable to me. Maybe we can increase the scale factor to something like 4 instead of 2, as a minor speedup aimed at the case where there are no results, but this is relatively unimportant. |
|
Yeah 4 sounds good. The reason this code was added in the first place was that take() on filtered datasets was super slow because it originally took only one partition at a time. |
Speedup occurs when calling take() on an RDD with most partitions empty
|
Fixed the same in Python, corrected the comment, and switched from *2 to *4 for faster take() on mostly-empty partition. Let's see how the unit tests go. |
|
QA tests have started for PR 2117 at commit
|
|
QA tests have finished for PR 2117 at commit
|
|
Failed unit test seems unrelated: |
|
@ash211 This sounds like it's related to the OOM reported in SPARK-3333, though after reporting the issue there I was unable to repro the OOM. (Though there were other issues in the mix in that JIRA report that we were able to repro.) Do you have a simple script that can cause an OOM due this empty partitions / |
|
The OOM was probably due to the Spark driver having too little memory, since a job with a huge number of tasks can have a large working set in the scheduler. I think the default is 512 megabytes. To increase the driver memory, try setting the |
|
Jenkins, test this please |
|
Sorry for the delay on this -- fix looks good but I'd like it to run through Jenkins, and Jenkins has been down today. |
|
Jenkins, test this please |
|
@nchammas I'm guessing your OOM issue is unrelated to this one. After the reduceByKey above, you'd have 24000 partitions and only 2 entries in them: (4, "NickJohn") and (3, "Bob"). This bug manifests when you have an empty partition 0 and many remaining partitions each with a large amount of data. The .take(n) gets up to the first n from each remaining partition and then takes the first n from the concatenation of those arrays. For this bug to take effect on your situation you'd have to have an empty first partition (a good 23998/24000 chance). The driver would then bring into memory 23998 empty arrays and 2 arrays of size 1 (or maybe 1 array of size 2), which I can't imagine would OOM the driver. So I don't think this is your bug. The other evidence is that you observed a regression (at least the perf numbers later in your bug) and this has been the same for quite some time. The current behavior was implemented in commit 42571d3 and was first released in version 0.9: |
|
Regarding the merge, I'm guessing this is too late to land in the Spark 1.1 release. Is it a candidate for a backport to a 1.1.x? |
|
@ash211 Thank you for explaining that. |
|
Jenkins, test this please |
|
@ash211 since this is a bug fix it seems fine to put it into 1.1.1. |
|
QA tests have started for PR 2117 at commit
|
|
QA tests have finished for PR 2117 at commit
|
|
Thanks Andrew -- merged this in. |
Instead of jumping straight from 1 partition to all partitions, do exponential growth and double the number of partitions to attempt each time instead. Fix proposed by Paul Nepywoda Author: Andrew Ash <[email protected]> Closes #2117 from ash211/SPARK-3211 and squashes the following commits: 8b2299a [Andrew Ash] Quadruple instead of double for a minor speedup e5f7e4d [Andrew Ash] Update comment to better reflect what we're doing 09a27f7 [Andrew Ash] Update PySpark to be less OOM-prone as well 3a156b8 [Andrew Ash] SPARK-3211 .take() is OOM-prone with empty partitions (cherry picked from commit ba5bcad) Signed-off-by: Matei Zaharia <[email protected]>
Instead of jumping straight from 1 partition to all partitions, do exponential
growth and double the number of partitions to attempt each time instead.
Fix proposed by Paul Nepywoda