Skip to content

Conversation

@zsxwing
Copy link
Member

@zsxwing zsxwing commented Jun 1, 2017

What changes were proposed in this pull request?

In this line, it uses the executorId string received from executors and finally it will go into TaskUIData. As deserializing the executorId string will always create a new instance, we have a lot of duplicated string instances.

This PR does a String interning for TaskUIData to reduce the memory usage.

How was this patch tested?

Manually test using bin/spark-shell --master local-cluster[6,1,1024]. Test codes:

for (_ <- 1 to 10) { sc.makeRDD(1 to 1000, 1000).count() }
Thread.sleep(2000)
val l = sc.getClass.getMethod("jobProgressListener").invoke(sc).asInstanceOf[org.apache.spark.ui.jobs.JobProgressListener]
org.apache.spark.util.SizeEstimator.estimate(l.stageIdToData)

This PR reduces the size of stageIdToData from 3487280 to 3009744 (86.3%) in the above case.

executorId = taskInfo.executorId,
host = taskInfo.host,
executorId = weakIntern(taskInfo.executorId),
host = weakIntern(taskInfo.host),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just intern it as well for safely, although host doesn't come from executors.

@zsxwing
Copy link
Member Author

zsxwing commented Jun 1, 2017

cc @JoshRosen

@srowen
Copy link
Member

srowen commented Jun 1, 2017

It seems reasonable but there are likely many more similar opportunities - is this a significantly large one? Or are there other easy wins?

Interning strings isn't as problematic as it used to be. Ideally the JVM string deduplication would take care of this even better and automatically though it is still an experimental flag.

@zsxwing
Copy link
Member Author

zsxwing commented Jun 1, 2017

@srowen Considering Spark keeps 1000 stages in UI, if each stage has 10000 tasks, the duplicated strings will be a lot. I observed this, about 150MB in a heap dump. Of cause, it's not significantly large but still worth to eliminate them. I agree that JVM string deduplication is better but it's not enabled by default.

@SparkQA
Copy link

SparkQA commented Jun 1, 2017

Test build #77654 has finished for PR 18177 at commit 38a4e31.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@JoshRosen
Copy link
Contributor

LGTM. This seems like an easy win based on our empirical observations of high memory usage. I'm sure that we'll continue to find and fix others as we profile and heap dump more.

@zsxwing
Copy link
Member Author

zsxwing commented Jun 2, 2017

Thanks! Merging to master and 2.2.

asfgit pushed a commit that referenced this pull request Jun 2, 2017
## What changes were proposed in this pull request?

In [this line](https://github.com/apache/spark/blob/f7cf2096fdecb8edab61c8973c07c6fc877ee32d/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L128), it uses the `executorId` string received from executors and finally it will go into `TaskUIData`. As deserializing the `executorId` string will always create a new instance, we have a lot of duplicated string instances.

This PR does a String interning for TaskUIData to reduce the memory usage.

## How was this patch tested?

Manually test using `bin/spark-shell --master local-cluster[6,1,1024]`. Test codes:
```
for (_ <- 1 to 10) { sc.makeRDD(1 to 1000, 1000).count() }
Thread.sleep(2000)
val l = sc.getClass.getMethod("jobProgressListener").invoke(sc).asInstanceOf[org.apache.spark.ui.jobs.JobProgressListener]
org.apache.spark.util.SizeEstimator.estimate(l.stageIdToData)
```
This PR reduces the size of `stageIdToData` from 3487280 to 3009744 (86.3%) in the above case.

Author: Shixiong Zhu <[email protected]>

Closes #18177 from zsxwing/SPARK-20955.

(cherry picked from commit 16186cd)
Signed-off-by: Shixiong Zhu <[email protected]>
@asfgit asfgit closed this in 16186cd Jun 2, 2017
@zsxwing zsxwing deleted the SPARK-20955 branch June 2, 2017 17:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants