[SPARK-5523][Core][Streaming] Add a cache for hostname in TaskMetrics to decrease the memory usage and GC overhead #5064

jerryshao · 2015-03-17T07:34:29Z

Hostname in TaskMetrics will be created through deserialization, mostly the number of hostname is only the order of number of cluster node, so adding a cache layer to dedup the object could reduce the memory usage and alleviate GC overhead, especially for long-running and fast job generation applications like Spark Streaming.

SparkQA · 2015-03-17T08:55:28Z

Test build #28710 has finished for PR 5064 at commit 7bc3834.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2015-03-17T09:46:43Z

core/src/main/scala/org/apache/spark/executor/TaskMetrics.scala

Can this cache get large? meaning, should it be a weak ref map?

I think the size of the cache will at most be as large as the cluster size, so weak ref map may not be so necessary from my understanding.

I think its fine for now. It may be a problem for really long running applications with lots of node churn (executors are dead and wont be needed but still occupy this hashmap). But thats a really far fetched problem.

Yeah, this seems fine to me for now. I can imagine some pathological scenarios where this map could grow very large, but, as TD said, I think we'd only see this become a serious problem with extreme scale + duration + churn.

SparkQA · 2015-03-18T07:23:44Z

Test build #28776 has finished for PR 5064 at commit e4de2b4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tdas · 2015-04-03T06:44:23Z

@JoshRosen Can you take a look at this? I had observed a lot of string objects with hostnames once in the stack, and this might help reduce it. This especially helps streaming because of the large number of jobs.

SparkQA · 2015-04-27T18:23:52Z

Test build #31051 has started for PR 5064 at commit e4de2b4.

This patch does not merge cleanly.

andrewor14 · 2015-06-18T23:51:04Z

@tdas @jerryshao how much of a performance benefit are we observing here? I wonder if this is worth doing? Also this patch has mostly gone stale at this point, so unless we rebase to master I would recommend that we just close it for now.

jerryshao · 2015-06-19T01:43:30Z

I didn't do lots of experiments on this, I think the improvement might be reducing the minor GC effect, since we cache the host name (and this will move the old generation), but I think it is hard to measure and validate, so @tdas what's your opinion?

tdas · 2015-06-19T02:26:18Z

Can you confirm whether that without this patch, this is actually a problem? Just double checking.

And if it is a problem, then it is a good idea to do this with large number of executors and incredibly large number of short jobs, this can build up over time.

Also could you update the patch to master. If you can confirm the problem exists with the current master, it will be good to add this.

tdas · 2015-06-30T19:35:41Z

Hey @jerryshao any updates on this?

jerryshao · 2015-07-01T01:25:12Z

Hi @tdas , I'd incline to close it for now, I will test it offline and resubmit it once getting concrete conclusions.

jerryshao · 2015-07-08T09:09:44Z

Hi @tdas and @andrewor14 , I tested a lot on the memory consumption of TaskMetrics and related _hostname string.

Here I use DirectKafkaWordCount as test workload with task number to 1 as a minimal setting, also 1 master + 1 slave with standalone mode.

According to my profiling with driver processor using JProfiler, the instance number of TaskMetrics is at least around 2000 (with full GC triggered), you could refer to this chart:

so if we linearly increase the task number, say to 1000 (for a middle scale cluster), we will get at least 1000 * 2000 (2M) TaskMetrics objects, also 2M _hostname objects in the previous code, if each _hostname occupies 64 bytes, so totally 128M memory will be occupied for _hostname objects, this is proportional to the task number and TaskMetrics.

And for now in my implementation, the memory occupation of _hostname is proportional to the cluster size (no relation to the task number, numbers of TaskMetrics), say if we have 1000 nodes in cluster, the total memory occupation of _hostname will be 1000 * 64 Bytes with one additional hashmap.

So actually this change does reduce the memory consumption (though not so many), it is more evident in the long-running and large scale cases.

SparkQA · 2015-07-08T10:57:41Z

Test build #36778 has finished for PR 5064 at commit 3e2412a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

jerryshao · 2015-07-09T01:27:07Z

Jenkins, retest this please.

SparkQA · 2015-07-09T03:40:36Z

Test build #36873 has finished for PR 5064 at commit 3e2412a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2015-07-13T21:31:23Z

Note to self: get a JProfiler license, since it looks really cool 😄.

This looks good to me.

tdas · 2015-07-15T02:24:57Z

@JoshRosen any comments on this?

JoshRosen · 2015-07-15T02:38:01Z

@tdas, this looks good to me (I commented yesterday).

tdas · 2015-07-15T02:53:40Z

Sorry I missed that, being in the in-code thread. I am merging this in master.

srowen reviewed Mar 17, 2015
View reviewed changes

jerryshao closed this Jul 1, 2015

jerryshao added 2 commits July 8, 2015 10:44

Add a pool to cache the hostname

b092a81

Address the comments

3e2412a

jerryshao reopened this Jul 8, 2015

jerryshao force-pushed the SPARK-5523 branch from e4de2b4 to 3e2412a Compare July 8, 2015 09:11

asfgit closed this in bb870e7 Jul 15, 2015

[SPARK-5523][Core][Streaming] Add a cache for hostname in TaskMetrics to decrease the memory usage and GC overhead #5064

[SPARK-5523][Core][Streaming] Add a cache for hostname in TaskMetrics to decrease the memory usage and GC overhead #5064

Uh oh!

Conversation

jerryshao commented Mar 17, 2015

Uh oh!

SparkQA commented Mar 17, 2015

Uh oh!

srowen Mar 17, 2015

Choose a reason for hiding this comment

Uh oh!

jerryshao Mar 18, 2015

Choose a reason for hiding this comment

Uh oh!

tdas Jul 10, 2015

Choose a reason for hiding this comment

Uh oh!

JoshRosen Jul 13, 2015

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 18, 2015

Uh oh!

tdas commented Apr 3, 2015

Uh oh!

SparkQA commented Apr 27, 2015

Uh oh!

andrewor14 commented Jun 18, 2015

Uh oh!

jerryshao commented Jun 19, 2015

Uh oh!

tdas commented Jun 19, 2015

Uh oh!

tdas commented Jun 30, 2015

Uh oh!

jerryshao commented Jul 1, 2015

Uh oh!

jerryshao commented Jul 8, 2015

Uh oh!

SparkQA commented Jul 8, 2015

Uh oh!

jerryshao commented Jul 9, 2015

Uh oh!

SparkQA commented Jul 9, 2015

Uh oh!

JoshRosen commented Jul 13, 2015

Uh oh!

tdas commented Jul 15, 2015

Uh oh!

JoshRosen commented Jul 15, 2015

Uh oh!

tdas commented Jul 15, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants