-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-14363] Fix executor OOM due to memory leak in the Sorter #12285
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
4be2021 to
707c9bc
Compare
|
cc @davies |
|
ok to test |
|
Test build #55544 has finished for PR 12285 at commit
|
707c9bc to
c318a35
Compare
|
@andrewor14 - Thanks for taking a look. Handled the test case failures. |
|
Test build #55561 has finished for PR 12285 at commit
|
|
@andrewor14 - Seems like some transient Jenkins failure. Can we rerun the test? |
|
@sitalkedia I think this is not a memory leak, it just does not release the memory as soon as possible. What does you plan looks like? |
|
@davies - Thanks for looking into it. I agree with you that its not a memory leak because that memory may be used later. However, not reducing the pointer array size to the initial size in case of spill is causing heavy memory underutilization because the tasks are not able to get sufficient memory for the storing the records and this situation often lead to the executor OOM. Also, I don't see any reason why would we want to keep the bloated pointer array if we are spilling all data to disk and not have anything to store in the pointer array. This change is restoring the behavior of the sorter before the PR #9241 in https://github.com/apache/spark/pull/9241/files#diff-3eedc75de4787b842477138d8cc7f150L321. The physical plan looks something like this - |
|
In your case, inside sorting, the key has 4 columns, the row has 6 columns, so each pair will need about 90 bytes, the array used by sort needs 16 bytes, so the memory used by array should be 15% of all memory used by execution. In worse case, free the array finally could waste about 15% of the memory, how can it make that big difference? If your data set is huge, which requires spill multiple times, the size of spilled data should be similar each time, so the required array should be similar. If we free that finally, we don't need to grow the array in the middle or two spills (the grow require 50% more memory for array), that's the reason I changed to free the array finally. The reason your job will OOM is that the memory used by Hive UDAF I agreed that the current patch is good (try to free memory as much as it can). Just try to understand more, please correct me if something is wrong. |
|
@davies Thanks for the explanation, your calculation makes sense. You are right that freeing the array can only make a difference of 15% in ideal case. But what we are experiencing is something different. Consider the following scenario - We have total shuffle memory of 10G available for 5 tasks. So in ideal situation, each tasks should be assigned 2G of shuffle memory each. And out of those 2G, 300MB should be allocated to the pointer array and rest for storing the records. Now lets say 3 of the tasks finish at the same time and before the driver could run additional tasks on the executor, rest 2 running task aggressively expend their memory and take upto 5G of shuffle memory on the executor, resulting in the pointer array size of around 750MB. Now when the driver runs additional 3 tasks on the executor, previous 2 tasks will be forced to spill, but the pointer array of size 750MB is never freed. This will result in heavy underutilization of memory for the task and in cases where the pointer array actually grew more than the fair share memory of the task, it will result in executor OOM and causing all other tasks on the executor to die. The job we are running is processing a huge data set of size more than 50TB and we were seeing more than 5% task failure due to OOM. After this fix, we are the failure rate has come down to less than 0.01% and we gain a massive 30% cpu speedup. |
|
That make sense, thanks for the explanation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can still call it reset, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or call it shrinkMemory() and return the size of freed memory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good idea, will do.
|
Test build #2777 has finished for PR 12285 at commit
|
75a44f9 to
b102c25
Compare
|
@davies - Thanks for the review. I have addressed all the comments, please let me know how it looks. |
|
Test build #55626 has finished for PR 12285 at commit
|
|
Test build #55627 has finished for PR 12285 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry that I misread this last night, this is spillSize (the number of bytes written into disk), not the amount of freed memory, so we don't need to add the amount from inMemSorter.
Sorry again.
|
@sitalkedia Sorry for the trouble. |
|
@davies - no issues, I will change it back. |
b102c25 to
d89adf8
Compare
|
|
||
| writeSortedFile(false); | ||
| final long spillSize = freeMemory(); | ||
| inMemSorter.reset(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need to move this call?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we need to reset the pointer array only after freeing up the memory pages holding records. Otherwise it might happen that the task might not get memory for the pointer array if it is already holding a lot of memory.
| writeSortedFile(false); | ||
| final long spillSize = freeMemory(); | ||
| inMemSorter.reset(); | ||
| // Reset the in-memory sorter's pointer array only after freeing up the memory pages holding the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can move this comment into reset()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, comment in ShuffleExternalSorter makes it easier to get the context and understand. Also in future, if some one tries to move this call, he will not do so, seeing the comment. If the comment is in the reset() function, someone might inadvertently move this call without seeing the comment in reset() function. However, if you have a strong opinion about it, I would gladly move the comment into reset(). Let me know what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I don't have strong opinion on it.
|
LGTM, will merging once passing the tests. Thanks for working on it. |
|
Thanks a lot for your quick review and response :). |
|
Test build #55646 has finished for PR 12285 at commit
|
Fix memory leak in the Sorter. When the UnsafeExternalSorter spills the data to disk, it does not free up the underlying pointer array. As a result, we see a lot of executor OOM and also memory under utilization. This is a regression partially introduced in PR #9241 Tested by running a job and observed around 30% speedup after this change. Author: Sital Kedia <[email protected]> Closes #12285 from sitalkedia/executor_oom. (cherry picked from commit d187e7d) Signed-off-by: Davies Liu <[email protected]> Conflicts: core/src/main/java/org/apache/spark/shuffle/sort/ShuffleInMemorySorter.java core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java
|
Merged into master and 1.6 branch (fixed the conflicts) |
Fix memory leak in the Sorter. When the UnsafeExternalSorter spills the data to disk, it does not free up the underlying pointer array. As a result, we see a lot of executor OOM and also memory under utilization. This is a regression partially introduced in PR apache#9241 Tested by running a job and observed around 30% speedup after this change. Author: Sital Kedia <[email protected]> Closes apache#12285 from sitalkedia/executor_oom. (cherry picked from commit d187e7d) Signed-off-by: Davies Liu <[email protected]> Conflicts: core/src/main/java/org/apache/spark/shuffle/sort/ShuffleInMemorySorter.java core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java (cherry picked from commit 413d060)
Fix memory leak in the Sorter. When the UnsafeExternalSorter spills the data to disk, it does not free up the underlying pointer array. As a result, we see a lot of executor OOM and also memory under utilization. This is a regression partially introduced in PR apache#9241 Tested by running a job and observed around 30% speedup after this change. Author: Sital Kedia <[email protected]> Closes apache#12285 from sitalkedia/executor_oom. (cherry picked from commit d187e7d) Signed-off-by: Davies Liu <[email protected]> Conflicts: core/src/main/java/org/apache/spark/shuffle/sort/ShuffleInMemorySorter.java core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java
Fix memory leak in the Sorter. When the UnsafeExternalSorter spills the data to disk, it does not free up the underlying pointer array. As a result, we see a lot of executor OOM and also memory under utilization. This is a regression partially introduced in PR apache#9241 Tested by running a job and observed around 30% speedup after this change. Author: Sital Kedia <[email protected]> Closes apache#12285 from sitalkedia/executor_oom.
|
Sorry to camp this old issue. I have a similar issue : with spark 1.6.1 in scala, I've a lot of executor OOM when I try to write the content of a RDD into multiple gziped files in hadoop: It worked fine when I tried to do a rdd.saveAsTextFile, or a rdd.saveAsHadoopFile without the GzipCodec. Do you think the root cause of my issue could also be this memory leak in the Sorter? Thanks a lot for your help. |
What changes were proposed in this pull request?
Fix memory leak in the Sorter. When the UnsafeExternalSorter spills the data to disk, it does not free up the underlying pointer array. As a result, we see a lot of executor OOM and also memory under utilization.
This is a regression partially introduced in PR #9241
How was this patch tested?
Tested by running a job and observed around 30% speedup after this change.