Skip to content

Conversation

@jerryshao
Copy link
Contributor

Hostname in TaskMetrics will be created through deserialization, mostly the number of hostname is only the order of number of cluster node, so adding a cache layer to dedup the object could reduce the memory usage and alleviate GC overhead, especially for long-running and fast job generation applications like Spark Streaming.

@SparkQA
Copy link

SparkQA commented Mar 17, 2015

Test build #28710 has finished for PR 5064 at commit 7bc3834.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this cache get large? meaning, should it be a weak ref map?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the size of the cache will at most be as large as the cluster size, so weak ref map may not be so necessary from my understanding.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think its fine for now. It may be a problem for really long running applications with lots of node churn (executors are dead and wont be needed but still occupy this hashmap). But thats a really far fetched problem.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this seems fine to me for now. I can imagine some pathological scenarios where this map could grow very large, but, as TD said, I think we'd only see this become a serious problem with extreme scale + duration + churn.

@SparkQA
Copy link

SparkQA commented Mar 18, 2015

Test build #28776 has finished for PR 5064 at commit e4de2b4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tdas
Copy link
Contributor

tdas commented Apr 3, 2015

@JoshRosen Can you take a look at this? I had observed a lot of string objects with hostnames once in the stack, and this might help reduce it. This especially helps streaming because of the large number of jobs.

@SparkQA
Copy link

SparkQA commented Apr 27, 2015

Test build #31051 has started for PR 5064 at commit e4de2b4.

  • This patch does not merge cleanly.

@andrewor14
Copy link
Contributor

@tdas @jerryshao how much of a performance benefit are we observing here? I wonder if this is worth doing? Also this patch has mostly gone stale at this point, so unless we rebase to master I would recommend that we just close it for now.

@jerryshao
Copy link
Contributor Author

I didn't do lots of experiments on this, I think the improvement might be reducing the minor GC effect, since we cache the host name (and this will move the old generation), but I think it is hard to measure and validate, so @tdas what's your opinion?

@tdas
Copy link
Contributor

tdas commented Jun 19, 2015

Can you confirm whether that without this patch, this is actually a problem? Just double checking.

And if it is a problem, then it is a good idea to do this with large number of executors and incredibly large number of short jobs, this can build up over time.

Also could you update the patch to master. If you can confirm the problem exists with the current master, it will be good to add this.

@tdas
Copy link
Contributor

tdas commented Jun 30, 2015

Hey @jerryshao any updates on this?

@jerryshao
Copy link
Contributor Author

Hi @tdas , I'd incline to close it for now, I will test it offline and resubmit it once getting concrete conclusions.

@jerryshao jerryshao closed this Jul 1, 2015
@jerryshao
Copy link
Contributor Author

Hi @tdas and @andrewor14 , I tested a lot on the memory consumption of TaskMetrics and related _hostname string.

Here I use DirectKafkaWordCount as test workload with task number to 1 as a minimal setting, also 1 master + 1 slave with standalone mode.

According to my profiling with driver processor using JProfiler, the instance number of TaskMetrics is at least around 2000 (with full GC triggered), you could refer to this chart:
image

so if we linearly increase the task number, say to 1000 (for a middle scale cluster), we will get at least 1000 * 2000 (2M) TaskMetrics objects, also 2M _hostname objects in the previous code, if each _hostname occupies 64 bytes, so totally 128M memory will be occupied for _hostname objects, this is proportional to the task number and TaskMetrics.

And for now in my implementation, the memory occupation of _hostname is proportional to the cluster size (no relation to the task number, numbers of TaskMetrics), say if we have 1000 nodes in cluster, the total memory occupation of _hostname will be 1000 * 64 Bytes with one additional hashmap.

So actually this change does reduce the memory consumption (though not so many), it is more evident in the long-running and large scale cases.

@SparkQA
Copy link

SparkQA commented Jul 8, 2015

Test build #36778 has finished for PR 5064 at commit 3e2412a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jerryshao
Copy link
Contributor Author

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Jul 9, 2015

Test build #36873 has finished for PR 5064 at commit 3e2412a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@JoshRosen
Copy link
Contributor

Note to self: get a JProfiler license, since it looks really cool 😄.

This looks good to me.

@tdas
Copy link
Contributor

tdas commented Jul 15, 2015

@JoshRosen any comments on this?

@JoshRosen
Copy link
Contributor

@tdas, this looks good to me (I commented yesterday).

@tdas
Copy link
Contributor

tdas commented Jul 15, 2015

Sorry I missed that, being in the in-code thread. I am merging this in master.

@asfgit asfgit closed this in bb870e7 Jul 15, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants