-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-5523][Core][Streaming] Add a cache for hostname in TaskMetrics to decrease the memory usage and GC overhead #5064
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #28710 has finished for PR 5064 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this cache get large? meaning, should it be a weak ref map?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the size of the cache will at most be as large as the cluster size, so weak ref map may not be so necessary from my understanding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think its fine for now. It may be a problem for really long running applications with lots of node churn (executors are dead and wont be needed but still occupy this hashmap). But thats a really far fetched problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this seems fine to me for now. I can imagine some pathological scenarios where this map could grow very large, but, as TD said, I think we'd only see this become a serious problem with extreme scale + duration + churn.
|
Test build #28776 has finished for PR 5064 at commit
|
|
@JoshRosen Can you take a look at this? I had observed a lot of string objects with hostnames once in the stack, and this might help reduce it. This especially helps streaming because of the large number of jobs. |
|
Test build #31051 has started for PR 5064 at commit
|
|
@tdas @jerryshao how much of a performance benefit are we observing here? I wonder if this is worth doing? Also this patch has mostly gone stale at this point, so unless we rebase to master I would recommend that we just close it for now. |
|
I didn't do lots of experiments on this, I think the improvement might be reducing the minor GC effect, since we cache the host name (and this will move the old generation), but I think it is hard to measure and validate, so @tdas what's your opinion? |
|
Can you confirm whether that without this patch, this is actually a problem? Just double checking. And if it is a problem, then it is a good idea to do this with large number of executors and incredibly large number of short jobs, this can build up over time. Also could you update the patch to master. If you can confirm the problem exists with the current master, it will be good to add this. |
|
Hey @jerryshao any updates on this? |
|
Hi @tdas , I'd incline to close it for now, I will test it offline and resubmit it once getting concrete conclusions. |
|
Hi @tdas and @andrewor14 , I tested a lot on the memory consumption of Here I use According to my profiling with driver processor using JProfiler, the instance number of so if we linearly increase the task number, say to 1000 (for a middle scale cluster), we will get at least 1000 * 2000 (2M) And for now in my implementation, the memory occupation of So actually this change does reduce the memory consumption (though not so many), it is more evident in the long-running and large scale cases. |
|
Test build #36778 has finished for PR 5064 at commit
|
|
Jenkins, retest this please. |
|
Test build #36873 has finished for PR 5064 at commit
|
|
Note to self: get a JProfiler license, since it looks really cool 😄. This looks good to me. |
|
@JoshRosen any comments on this? |
|
@tdas, this looks good to me (I commented yesterday). |
|
Sorry I missed that, being in the in-code thread. I am merging this in master. |

Hostname in TaskMetrics will be created through deserialization, mostly the number of hostname is only the order of number of cluster node, so adding a cache layer to dedup the object could reduce the memory usage and alleviate GC overhead, especially for long-running and fast job generation applications like Spark Streaming.