-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-23508][CORE] Fix BlockmanagerId in case blockManagerIdCache cause oom #20667
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Ngone51
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this point, I'd like to know, can we remove an id from the cache when we delete a block from the BlockManager? Does it helps? WDYT?
| blockManagerIdCache.get(id) | ||
| val blockManagerId = blockManagerIdCache.get(id) | ||
| if (clearOldValues) { | ||
| blockManagerIdCache.clearOldValues(System.currentTimeMillis - Utils.timeStringAsMs("10d")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
10 days? I don't think time can be a judging criteria to decide whether we should remove a cached id or not, even if you set the time threshold far less/greater than '10d'. Think about a extreamly case that a block could be frequently got all the time during the app‘s running. So, it would be certainly removed from cache due to the time threshold, and recached next time we get it, and repeatedly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Ngone51 Thanks.i also though about remove when we delete a block.
In this case, it is history replaying which will trigger this problem,and we do not delete any block actually.
Maybe use weakreference better as @jiangxb1987 mentioned?WDYT?
Thanks again!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we simply use com.google.common.cache.Cache? which has a size limitation and we don't need to worry about OOM.
| } | ||
|
|
||
| val blockManagerIdCache = new ConcurrentHashMap[BlockManagerId, BlockManagerId]() | ||
| val blockManagerIdCache = new TimeStampedHashMap[BlockManagerId, BlockManagerId](true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this cause serious performance regression? TimeStampedHashMap will copy all the old values on update. cc @cloud-fan
|
Can we use WeakReference here to keep cached |
|
Why we need this cache? |
|
@cloud-fan I just find commit log below which introduce this cache |
…rIdCache cause oom" This reverts commit fc1b6a0.
|
Had a offline discussion with @cloud-fan and we feel |
|
Nice @jiangxb1987 @cloud-fan I will modify later.Thanks! |
|
Update @jiangxb1987 |
|
|
||
| val blockManagerIdCache = new ConcurrentHashMap[BlockManagerId, BlockManagerId]() | ||
| val blockManagerIdCache = CacheBuilder.newBuilder() | ||
| .maximumSize(500) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here i set 500
since blockmanagerId about 48B per object.
I do not use spark conf since it is not convenient to get spark conf for historyserver when use BlockManagerId
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm ok to hardcode it for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I‘m +1 on hardcode this, but we shall also explain in the comment the reason why this number is chosen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually i think 500 executors can handle most applications.And for historyserver it is no need to cache too much BlockManagerId.If we set this number as 50 the max size of cache will below 30KB.
Agree with that? @jiangxb1987 If ok i will update documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please feel free to update any comment, and I would set the default value to 10000 since I think cost of 500KB memory is feasible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jiangxb1987 i have updated the comment
| val blockManagerIdCache = CacheBuilder.newBuilder() | ||
| .maximumSize(500) | ||
| .build(new CacheLoader[BlockManagerId, BlockManagerId]() { | ||
| override def load(id: BlockManagerId) = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: override def load(id: BlockManagerId) = id
| } | ||
|
|
||
| val blockManagerIdCache = new ConcurrentHashMap[BlockManagerId, BlockManagerId]() | ||
| val blockManagerIdCache = CacheBuilder.newBuilder() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it thread-safe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is thread-safe which i refer from:
https://google.github.io/guava/releases/22.0/api/docs/com/google/common/cache/LoadingCache.html
and https://stackoverflow.com/questions/11124856/using-guava-for-high-performance-thread-safe-caching
|
LGTM |
|
ok to test |
|
add to whitelist |
|
|
||
| val blockManagerIdCache = new ConcurrentHashMap[BlockManagerId, BlockManagerId]() | ||
| /** | ||
| * Here we set max cache size as 10000.Since the size of a BlockManagerId object |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
The max cache size is hardcoded to 10000, since the size of a BlockManagerId object is about 48B, the total memory cost should be below 1MB which is feasible.
|
Hi, @caneGuy , sorry for my previous comment as I mixed up Back to now, I have the same question with @cloud-fan ,
though, we have a better cache way(guava cache) now. My confusions:
|
|
In case the same Since the code is added long times ago, and it's actually hard to examine the performance with/without the cache, we'd like to keep it for now. |
|
Hi, @jiangxb1987 , thanks for your kindly explanation. |
|
Test build #87746 has finished for PR 20667 at commit
|
|
Test build #87748 has finished for PR 20667 at commit
|
|
Test build #87744 has finished for PR 20667 at commit
|
|
retest this please |
|
Test build #87761 has finished for PR 20667 at commit
|
|
LGTM if we need caching. |
…use oom … cause oom ## What changes were proposed in this pull request? blockManagerIdCache in BlockManagerId will not remove old values which may cause oom `val blockManagerIdCache = new ConcurrentHashMap[BlockManagerId, BlockManagerId]()` Since whenever we apply a new BlockManagerId, it will put into this map. This patch will use guava cahce for blockManagerIdCache instead. A heap dump show in [SPARK-23508](https://issues.apache.org/jira/browse/SPARK-23508) ## How was this patch tested? Exist tests. Author: zhoukang <[email protected]> Closes #20667 from caneGuy/zhoukang/fix-history. (cherry picked from commit 6a8abe2) Signed-off-by: Wenchen Fan <[email protected]>
|
thanks, merging to master/2.3/2.2! |
…use oom … cause oom ## What changes were proposed in this pull request? blockManagerIdCache in BlockManagerId will not remove old values which may cause oom `val blockManagerIdCache = new ConcurrentHashMap[BlockManagerId, BlockManagerId]()` Since whenever we apply a new BlockManagerId, it will put into this map. This patch will use guava cahce for blockManagerIdCache instead. A heap dump show in [SPARK-23508](https://issues.apache.org/jira/browse/SPARK-23508) ## How was this patch tested? Exist tests. Author: zhoukang <[email protected]> Closes #20667 from caneGuy/zhoukang/fix-history. (cherry picked from commit 6a8abe2) Signed-off-by: Wenchen Fan <[email protected]>
|
Thanks @cloud-fan @jiangxb1987 @kiszk @Ngone51 |
…use oom … cause oom ## What changes were proposed in this pull request? blockManagerIdCache in BlockManagerId will not remove old values which may cause oom `val blockManagerIdCache = new ConcurrentHashMap[BlockManagerId, BlockManagerId]()` Since whenever we apply a new BlockManagerId, it will put into this map. This patch will use guava cahce for blockManagerIdCache instead. A heap dump show in [SPARK-23508](https://issues.apache.org/jira/browse/SPARK-23508) ## How was this patch tested? Exist tests. Author: zhoukang <[email protected]> Closes apache#20667 from caneGuy/zhoukang/fix-history. (cherry picked from commit 6a8abe2) Signed-off-by: Wenchen Fan <[email protected]>
…use oom … cause oom ## What changes were proposed in this pull request? blockManagerIdCache in BlockManagerId will not remove old values which may cause oom `val blockManagerIdCache = new ConcurrentHashMap[BlockManagerId, BlockManagerId]()` Since whenever we apply a new BlockManagerId, it will put into this map. This patch will use guava cahce for blockManagerIdCache instead. A heap dump show in [SPARK-23508](https://issues.apache.org/jira/browse/SPARK-23508) ## How was this patch tested? Exist tests. Author: zhoukang <[email protected]> Closes apache#20667 from caneGuy/zhoukang/fix-history. (cherry picked from commit 6a8abe2) Signed-off-by: Wenchen Fan <[email protected]>
… cause oom
What changes were proposed in this pull request?
blockManagerIdCache in BlockManagerId will not remove old values which may cause oom
val blockManagerIdCache = new ConcurrentHashMap[BlockManagerId, BlockManagerId]()Since whenever we apply a new BlockManagerId, it will put into this map.
This patch will use guava cahce for blockManagerIdCache instead.
A heap dump show in SPARK-23508
How was this patch tested?
Exist tests.