Skip to content

Conversation

@caneGuy
Copy link
Contributor

@caneGuy caneGuy commented Feb 24, 2018

… cause oom

What changes were proposed in this pull request?

blockManagerIdCache in BlockManagerId will not remove old values which may cause oom

val blockManagerIdCache = new ConcurrentHashMap[BlockManagerId, BlockManagerId]()
Since whenever we apply a new BlockManagerId, it will put into this map.

This patch will use guava cahce for blockManagerIdCache instead.

A heap dump show in SPARK-23508

How was this patch tested?

Exist tests.

Copy link
Member

@Ngone51 Ngone51 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point, I'd like to know, can we remove an id from the cache when we delete a block from the BlockManager? Does it helps? WDYT?

blockManagerIdCache.get(id)
val blockManagerId = blockManagerIdCache.get(id)
if (clearOldValues) {
blockManagerIdCache.clearOldValues(System.currentTimeMillis - Utils.timeStringAsMs("10d"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10 days? I don't think time can be a judging criteria to decide whether we should remove a cached id or not, even if you set the time threshold far less/greater than '10d'. Think about a extreamly case that a block could be frequently got all the time during the app‘s running. So, it would be certainly removed from cache due to the time threshold, and recached next time we get it, and repeatedly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Ngone51 Thanks.i also though about remove when we delete a block.
In this case, it is history replaying which will trigger this problem,and we do not delete any block actually.
Maybe use weakreference better as @jiangxb1987 mentioned?WDYT?
Thanks again!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we simply use com.google.common.cache.Cache? which has a size limitation and we don't need to worry about OOM.

}

val blockManagerIdCache = new ConcurrentHashMap[BlockManagerId, BlockManagerId]()
val blockManagerIdCache = new TimeStampedHashMap[BlockManagerId, BlockManagerId](true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this cause serious performance regression? TimeStampedHashMap will copy all the old values on update. cc @cloud-fan

@jiangxb1987
Copy link
Contributor

Can we use WeakReference here to keep cached BlockManagerId ?

@cloud-fan
Copy link
Contributor

Why we need this cache?

@caneGuy
Copy link
Contributor Author

caneGuy commented Feb 26, 2018

@cloud-fan I just find commit log below which introduce this cache
Modified StorageLevel and BlockManagerId to cache common objects and use cached object while deserializing.
I can't figure out why we need cache, since i think the cache miss ratio may be high?

@caneGuy caneGuy changed the title [SPARK-23508][CORE] Use timeStampedHashMap for BlockmanagerId in case blockManagerIdCache… [SPARK-23508][CORE] Use softreference for BlockmanagerId in case blockManagerIdCache… Feb 26, 2018
@jiangxb1987
Copy link
Contributor

jiangxb1987 commented Feb 26, 2018

Had a offline discussion with @cloud-fan and we feel com.google.common.cache.Cache should be used here. You can find a example at CodeGenerator.cache.

@caneGuy
Copy link
Contributor Author

caneGuy commented Feb 26, 2018

Nice @jiangxb1987 @cloud-fan I will modify later.Thanks!

@caneGuy caneGuy changed the title [SPARK-23508][CORE] Use softreference for BlockmanagerId in case blockManagerIdCache… [SPARK-23508][CORE] Fix BlockmanagerId in case blockManagerIdCache cause oom Feb 26, 2018
@caneGuy
Copy link
Contributor Author

caneGuy commented Feb 26, 2018

Update @jiangxb1987


val blockManagerIdCache = new ConcurrentHashMap[BlockManagerId, BlockManagerId]()
val blockManagerIdCache = CacheBuilder.newBuilder()
.maximumSize(500)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here i set 500
since blockmanagerId about 48B per object.
I do not use spark conf since it is not convenient to get spark conf for historyserver when use BlockManagerId

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok to hardcode it for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I‘m +1 on hardcode this, but we shall also explain in the comment the reason why this number is chosen.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually i think 500 executors can handle most applications.And for historyserver it is no need to cache too much BlockManagerId.If we set this number as 50 the max size of cache will below 30KB.
Agree with that? @jiangxb1987 If ok i will update documentation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please feel free to update any comment, and I would set the default value to 10000 since I think cost of 500KB memory is feasible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jiangxb1987 i have updated the comment

val blockManagerIdCache = CacheBuilder.newBuilder()
.maximumSize(500)
.build(new CacheLoader[BlockManagerId, BlockManagerId]() {
override def load(id: BlockManagerId) = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: override def load(id: BlockManagerId) = id

}

val blockManagerIdCache = new ConcurrentHashMap[BlockManagerId, BlockManagerId]()
val blockManagerIdCache = CacheBuilder.newBuilder()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it thread-safe?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan
Copy link
Contributor

LGTM

@cloud-fan
Copy link
Contributor

ok to test

@cloud-fan
Copy link
Contributor

add to whitelist


val blockManagerIdCache = new ConcurrentHashMap[BlockManagerId, BlockManagerId]()
/**
* Here we set max cache size as 10000.Since the size of a BlockManagerId object
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

The max cache size is hardcoded to 10000, since the size of a BlockManagerId object is about 48B, the total memory cost should be below 1MB which is feasible.

@Ngone51
Copy link
Member

Ngone51 commented Feb 28, 2018

Hi, @caneGuy , sorry for my previous comment as I mixed up BlockId with BlockManagerId, and leave some wrong comments. And thanks for your reply.

Back to now, I have the same question with @cloud-fan ,

Why we need this cache?

though, we have a better cache way(guava cache) now.

My confusions:

  • It is weird that we need to create a BlockManagerId before we get a same one from the cache.

  • And on executor side, when BlockManagerId registered to master and return with an updated BlockManagerId , the new BlockManagerId does not be updated to blockManagerIdCache. So, it seems executor side's BlockManagerId has little relevance with blockManagerIdCache.

@jiangxb1987
Copy link
Contributor

In case the same BlockManagerId being created multiple times, this cache will ensure we always use the first one that is created, which make it possible for the rest BlockManagerId instances being recycled shortly. The downside is we have to persist all the distinct BlockManagerId created.

Since the code is added long times ago, and it's actually hard to examine the performance with/without the cache, we'd like to keep it for now.

@Ngone51
Copy link
Member

Ngone51 commented Feb 28, 2018

Hi, @jiangxb1987 , thanks for your kindly explanation.

@SparkQA
Copy link

SparkQA commented Feb 28, 2018

Test build #87746 has finished for PR 20667 at commit 3379899.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 28, 2018

Test build #87748 has finished for PR 20667 at commit bf79f4d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 28, 2018

Test build #87744 has finished for PR 20667 at commit 3379899.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Feb 28, 2018

Test build #87761 has finished for PR 20667 at commit bf79f4d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk
Copy link
Member

kiszk commented Feb 28, 2018

LGTM if we need caching.

asfgit pushed a commit that referenced this pull request Feb 28, 2018
…use oom

… cause oom

## What changes were proposed in this pull request?
blockManagerIdCache in BlockManagerId will not remove old values which may cause oom

`val blockManagerIdCache = new ConcurrentHashMap[BlockManagerId, BlockManagerId]()`
Since whenever we apply a new BlockManagerId, it will put into this map.

This patch will use guava cahce for  blockManagerIdCache instead.

A heap dump show in [SPARK-23508](https://issues.apache.org/jira/browse/SPARK-23508)

## How was this patch tested?
Exist tests.

Author: zhoukang <[email protected]>

Closes #20667 from caneGuy/zhoukang/fix-history.

(cherry picked from commit 6a8abe2)
Signed-off-by: Wenchen Fan <[email protected]>
@cloud-fan
Copy link
Contributor

thanks, merging to master/2.3/2.2!

asfgit pushed a commit that referenced this pull request Feb 28, 2018
…use oom

… cause oom

## What changes were proposed in this pull request?
blockManagerIdCache in BlockManagerId will not remove old values which may cause oom

`val blockManagerIdCache = new ConcurrentHashMap[BlockManagerId, BlockManagerId]()`
Since whenever we apply a new BlockManagerId, it will put into this map.

This patch will use guava cahce for  blockManagerIdCache instead.

A heap dump show in [SPARK-23508](https://issues.apache.org/jira/browse/SPARK-23508)

## How was this patch tested?
Exist tests.

Author: zhoukang <[email protected]>

Closes #20667 from caneGuy/zhoukang/fix-history.

(cherry picked from commit 6a8abe2)
Signed-off-by: Wenchen Fan <[email protected]>
@asfgit asfgit closed this in 6a8abe2 Feb 28, 2018
@caneGuy
Copy link
Contributor Author

caneGuy commented Feb 28, 2018

Thanks @cloud-fan @jiangxb1987 @kiszk @Ngone51

@caneGuy caneGuy deleted the zhoukang/fix-history branch March 1, 2018 06:03
MatthewRBruce pushed a commit to Shopify/spark that referenced this pull request Jul 31, 2018
…use oom

… cause oom

## What changes were proposed in this pull request?
blockManagerIdCache in BlockManagerId will not remove old values which may cause oom

`val blockManagerIdCache = new ConcurrentHashMap[BlockManagerId, BlockManagerId]()`
Since whenever we apply a new BlockManagerId, it will put into this map.

This patch will use guava cahce for  blockManagerIdCache instead.

A heap dump show in [SPARK-23508](https://issues.apache.org/jira/browse/SPARK-23508)

## How was this patch tested?
Exist tests.

Author: zhoukang <[email protected]>

Closes apache#20667 from caneGuy/zhoukang/fix-history.

(cherry picked from commit 6a8abe2)
Signed-off-by: Wenchen Fan <[email protected]>
peter-toth pushed a commit to peter-toth/spark that referenced this pull request Oct 6, 2018
…use oom

… cause oom

## What changes were proposed in this pull request?
blockManagerIdCache in BlockManagerId will not remove old values which may cause oom

`val blockManagerIdCache = new ConcurrentHashMap[BlockManagerId, BlockManagerId]()`
Since whenever we apply a new BlockManagerId, it will put into this map.

This patch will use guava cahce for  blockManagerIdCache instead.

A heap dump show in [SPARK-23508](https://issues.apache.org/jira/browse/SPARK-23508)

## How was this patch tested?
Exist tests.

Author: zhoukang <[email protected]>

Closes apache#20667 from caneGuy/zhoukang/fix-history.

(cherry picked from commit 6a8abe2)
Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants