Modify default YARN memory_overhead-- from an additive constant to a multiplier #1391

nishkamravi2 · 2014-07-13T05:42:03Z

Related to #894 and https://issues.apache.org/jira/browse/SPARK-2398 Experiments show that memory_overhead grows with container size. The multiplier has been experimentally obtained and can potentially be improved over time.

…extFiles The prefix "file:" is missing in the string inserted as key in HashMap

… HADOOP-10456)

…onsistent with rest of Spark

…ultiplier

nishkamravi2 · 2014-07-13T05:44:31Z

Also see: https://issues.apache.org/jira/browse/SPARK-2444

AmplabJenkins · 2014-07-13T05:46:23Z

Can one of the admins verify this patch?

mridulm · 2014-07-13T08:56:08Z

We have gone over this in the past .. it is suboptimal to make it a linear
function of executor/driver memory.
Overhead is a function of number of executors, number of opened files,
shuffle vm pressure, etc.
It is NOT a function of executor memory : which is why it is separately
configured.
On 13-Jul-2014 11:16 am, "UCB AMPLab" [email protected] wrote:

Can one of the admins verify this patch?

—
Reply to this email directly or view it on GitHub
#1391 (comment).

srowen · 2014-07-13T09:04:22Z

That makes sense, but then it doesn't explain why a constant amount works for a given job when executor memory is low, and then doesn't work when it is high. This has also been my experience and I don't have a great grasp on why it would be. More threads and open files in a busy executor? It goes indirectly with how big you need your executor to be, but not directly.

Nishkam do you have a sense of how much extra memory you had to configure to get it to work when executor memory increased? is it pretty marginal, or quite substantial?

nishkamravi2 · 2014-07-13T09:07:37Z

Yes, I'm aware of the discussion on this issue in the past. Experiments confirm that overhead is a function of executor memory. Why and how can be figured out with due diligence and analysis. It may be a function of other parameters and the function may be fairly complex. However, the proportionality is undeniable. Besides, we are only adjusting the default value and making it a bit more resilient. The memory_overhead parameter can still be configured by the developer separately. The constant additive factor makes little sense (empirically).

nishkamravi2 · 2014-07-13T09:10:51Z

Sean, the memory_overhead is fairly substantial. More than 2GB for a 30GB executor. Less than 400MB for a 2GB executor.

mridulm · 2014-07-13T09:11:18Z

The default constant is actually a lowerbound to account for other
overheads (since yarn will aggressively kill tasks)... Unfortunately we
have not sized this properly : and don't have good recommendation on how to
set it.

This is compounded by magic constants in spark for various IO ops, non
deterministic network behaviour (we should be able to estimate upper bound
here = 2x number of workers), vm memory use (shuffle output is mmapp'ed
whole ... going foul with yarn virtual men limits) and so on.

Hence sizing this is, unfortunately, app specific.
On 13-Jul-2014 2:34 pm, "Sean Owen" [email protected] wrote:

That makes sense, but then it doesn't explain why a constant amount works
for a given job when executor memory is low, and then doesn't work when it
is high. This has also been my experience and I don't have a great grasp on
why it would be. More threads and open files in a busy executor? It goes
indirectly with how big you need your executor to be, but not directly.

Nishkam do you have a sense of how much extra memory you had to configure
to get it to work when executor memory increased? is it pretty marginal, or
quite substantial?

—
Reply to this email directly or view it on GitHub
#1391 (comment).

mridulm · 2014-07-13T09:14:24Z

That would be a function of your jobs.
Other apps would have a drastically different characteristics ... Which is
why we can't generalize to a simple fraction of executor memory.
It actually buys us nothing in general case ... Jobs will continue to fail
when it is incorrect : while wasting a lot of memory
On 13-Jul-2014 2:38 pm, "nishkamravi2" [email protected] wrote:

Yes, I'm aware of the discussion on this issue in the past. Experiments
confirm that overhead is a function of executor memory. Why and how can be
figured out with due diligence and analysis. It may be a function of other
parameters and the function may be fairly complex. However, the
proportionality is undeniable. Besides, we are only adjusting the default
value and making it a bit more resilient. The memory_overhead parameter can
still be configured by the developer separately. The constant additive
factor makes little sense (empirically).

—
Reply to this email directly or view it on GitHub
#1391 (comment).

mridulm · 2014-07-13T09:16:19Z

The basic issue is you are trying to model overhead using the wrong
variable... It has no correlation on executor memory actually (other than
vm overheads as heap increases)
On 13-Jul-2014 2:44 pm, "Mridul Muralidharan" [email protected] wrote:

That would be a function of your jobs.
Other apps would have a drastically different characteristics ... Which is
why we can't generalize to a simple fraction of executor memory.
It actually buys us nothing in general case ... Jobs will continue to fail
when it is incorrect : while wasting a lot of memory
On 13-Jul-2014 2:38 pm, "nishkamravi2" [email protected] wrote:

Yes, I'm aware of the discussion on this issue in the past. Experiments
confirm that overhead is a function of executor memory. Why and how can be
figured out with due diligence and analysis. It may be a function of other
parameters and the function may be fairly complex. However, the
proportionality is undeniable. Besides, we are only adjusting the default
value and making it a bit more resilient. The memory_overhead parameter can
still be configured by the developer separately. The constant additive
factor makes little sense (empirically).

—
Reply to this email directly or view it on GitHub
#1391 (comment).

srowen · 2014-07-13T09:21:01Z

Yes of course, lots of settings' best or even usable values are ultimately app-specific. Ideally, defaults work for lots of cases. A flat value is the simplest of models, and anecdotally, the current default value does not work in medium- to large-memory YARN jobs. You can increase the default, but then the overhead gets silly for small jobs -- 1GB? And all of these are not-uncommon use cases.

None of that implies the overhead logically scales with container memory. Empirically, it may do, and that's useful. Until the magic explanatory variable is found, which one is less problematic for end users -- a flat constant that frequently has to be tuned, or an imperfect model that could get it right in more cases?

That said it is kind of a developer API change and feels like something to not keep reimagining.

Niskham can you share any anecdotal evidence about how the overhead changes. If executor memory is the only variable changing, that seems to be evidence against it being driven by other factors. but I don't know if that's what we know.

mridulm · 2014-07-13T09:22:54Z

You are lucky :-) for some of our jobs, in a 8gb container, overhead is
1.8gb !
On 13-Jul-2014 2:41 pm, "nishkamravi2" [email protected] wrote:

Sean, the memory_overhead is fairly substantial. More than 2GB for a 30GB
executor. Less than 400MB for a 2GB executor.

—
Reply to this email directly or view it on GitHub
#1391 (comment).

nishkamravi2 · 2014-07-13T09:27:03Z

Experimented with three different workloads and noticed common patterns of proportionality.
Other parameters were left unchanged and only executor size was increased. The memory-overhead ranges between 0.05-0.08 * executor_memory size.

nishkamravi2 · 2014-07-13T09:28:28Z

That's why the parameter is configurable. If you have jobs that cause 20-25% memory_overhead, default values will not help.

mridulm · 2014-07-13T09:40:40Z

You are missing my point I think ... To give unscientific anecdotal example
: our gbdt expiriments , which run on about 22 nodes need no tuning.
While our collaborative filtering expiriments, running on 300 nodes require
much higher overhead.
But QR factorization on the same 300 nodes need much lower overhead.
The values are all over the place and very app specific.

In an effort to ensure jobs always run to completion, setting overhead to
high fraction of executor memory might ensure successful completion but at
high performance loss and substandard scaling.

I would like a good default estimate of overhead ... But that is not
fraction of executor memory.
Instead of trying to model the overhead using executor memory, better would
be to look at actual parameters which influence it (as in, look at code and
figure it out; followed by validation and tuning of course) and use that as
estimate.
On 13-Jul-2014 2:58 pm, "nishkamravi2" [email protected] wrote:

That's why the parameter is configurable. If you have jobs that cause
20-25% memory_overhead, default values will not help.

—
Reply to this email directly or view it on GitHub
#1391 (comment).

nishkamravi2 · 2014-07-13T09:45:55Z

Mridul, I think you are missing the point. We understand that this parameter will in a lot of cases have to be specified by the developer, since there is no easy way to model it (that's why we are retaining it as a configurable parameter). However, the question is what would be a good default value be.

"I would like a good default estimate of overhead ... But that is not
fraction of executor memory. "

You are mistaken. It may not be a directly correlated variable, but it is most certainly indirectly correlated. And it is probably correlated to other app-specific parameters as well.

"Until the magic explanatory variable is found, which one is less problematic for end users -- a flat constant that frequently has to be tuned, or an imperfect model that could get it right in more cases?"

This is the right point of view.

mridulm · 2014-07-13T09:56:35Z

On Jul 13, 2014 3:16 PM, "nishkamravi2" [email protected] wrote:

Mridul, I think you are missing the point. We understand that this
parameter will in a lot of cases have to be specified by the developer,
since there is no easy way to model it (that's why we are retaining it as a
configurable parameter). However, the question is what would be a good
default value be.

It does not help to estimate using the wrong variable.
Any correlation which exists are incidental and app specific, as I
elaborated before.

The only actual correlation between executor memory and overhead is java vm
overheads in managing very large heaps (and that is very high as a
fraction). Other factors in spark have far higher impact than this.

"I would like a good default estimate of overhead ... But that is not
fraction of executor memory. "

You are mistaken. It may not be a directly correlated variable, but it is
most certainly indirectly correlated. And it is probably correlated to
other app-specific parameters as well.

Please see above.

"Until the magic explanatory variable is found, which one is less
problematic for end users -- a flat constant that frequently has to be
tuned, or an imperfect model that could get it right in more cases?"

This is the right point of view.

Which has been our view even in previous discussions :-)
It is unfortunate that we did not approximate this better from the start
and went with the constant from the prototype.l impl.

Note that this estimation would be very volatile to spark internals

—
Reply to this email directly or view it on GitHub.

mridulm · 2014-07-13T10:02:30Z

correction: (and that is NOT very high as a fraction).

Tying on phones can suck :-)

To add to Sean's point: we definitely need to estimate this better.
I want to ensure we do that on the right parameters to minimize memory waste while giving good out of the box behaviour

mridulm · 2014-07-13T10:06:34Z

Hmm, looks like some of my responses to Sean via mail reply have not shown up here ... Maybe mail gateway delays ?

mridulm · 2014-07-13T10:20:56Z

Since this is a recurring nightmare for our users, let me try to list down
the factors which influence overhead given current spark codebase state in
the jira when I am back at my desk ... And we can add to that and model
from there (I won't be able to lead the effort though unfortunately, so
would be great if you or Sean can).

If it so happens that end of the exercise it is linear function of memory,
I am fine with it : as long as we decide based on actual data :-)
On 13-Jul-2014 3:26 pm, "Mridul Muralidharan" [email protected] wrote:

On Jul 13, 2014 3:16 PM, "nishkamravi2" [email protected] wrote:

Mridul, I think you are missing the point. We understand that this
parameter will in a lot of cases have to be specified by the developer,
since there is no easy way to model it (that's why we are retaining it as a
configurable parameter). However, the question is what would be a good
default value be.

It does not help to estimate using the wrong variable.
Any correlation which exists are incidental and app specific, as I
elaborated before.

The only actual correlation between executor memory and overhead is java
vm overheads in managing very large heaps (and that is very high as a
fraction). Other factors in spark have far higher impact than this.

"I would like a good default estimate of overhead ... But that is not
fraction of executor memory. "

You are mistaken. It may not be a directly correlated variable, but it
is most certainly indirectly correlated. And it is probably correlated to
other app-specific parameters as well.

Please see above.

"Until the magic explanatory variable is found, which one is less
problematic for end users -- a flat constant that frequently has to be
tuned, or an imperfect model that could get it right in more cases?"

This is the right point of view.

Which has been our view even in previous discussions :-)
It is unfortunate that we did not approximate this better from the start
and went with the constant from the prototype.l impl.

Note that this estimation would be very volatile to spark internals

—
Reply to this email directly or view it on GitHub.

nishkamravi2 · 2014-07-17T18:51:48Z

Bringing the discussion back online. Thanks for all the input so far.

Ran a few experiments yday and today. Number of executors (which was the other main handle we wanted to factor in) doesn't seem to have any noticeable impact. Tried a few other parameters such as num_partitions, default_parallelism but nothing sticks. Confirmed the proportionality with container size. Have also been trying to tune the multiplier to minimize potential waste and I think 6% (as opposed to 7% as we currently have) is the lowest we should go. Modifying the PR accordingly.

tgravescs · 2014-07-18T21:10:23Z

I'll let mridul comment on this but I think adding a comment where 0.06 came from would be useful.

nishkamravi2 · 2014-07-18T21:50:49Z

6% was experimentally obtained (with the goal of keeping the bound as tight as possible without the containers crashing). Three workloads were experimented with: PageRank, WordCount and KMeans over moderate to large input datasets and configured such that the containers are optimally utilized (neither under-utilized nor over-subscribed). Based on my observations, less than 5% is a no-no. If someone would like to tune this parameter more and make a case for a higher value (keeping in mind that this is a default value that will not cover all workloads), that would be helpful.

SparkQA · 2014-09-05T23:45:15Z

Can one of the admins verify this patch?

andrewor14 · 2014-09-18T23:48:43Z

yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocationHandler.scala

line too long, here and other places

andrewor14 · 2014-09-18T23:49:27Z

What is the current state of this PR? @tgravescs @mridulm any more thoughts about the current approach? This is a related PR for mesos (#2401) and I'm wondering if we can use the same approach in both places.

nishkamravi2 · 2014-09-19T00:10:24Z

Updated as per @andrewor14 's comments

sryza · 2014-09-19T03:52:05Z

These changes look good to me. This addresses what continues to be the #1 issue that we see in Cloudera customer YARN deployments. It's worth considering boosting this when using PySpark, but that's probably work for another JIRA.

sryza · 2014-09-19T03:52:33Z

@nishkamravi2 mind resolving the merge conflicts?

nishkamravi2 · 2014-09-19T07:00:54Z

@sryza Thanks Sandy. Will do.

tgravescs · 2014-09-19T13:06:59Z

@mridulm any comments?

I'm ok with it if its a consistent problem for users. One thing we definitely need to do is document it and possibly look at including better log and error messages. We should at least log the size of the overhead it calculates. It would also be nice to log what it is when we fail to get a container large enough or it fails due to the cluster max allocation limit was hit.

nishkamravi2 · 2014-09-22T06:53:16Z

Have redone the PR against the recent master branch, which has undergone significant structural changes for Yarn. Addressed review comments and changed the multiplier back to 0.07 (to err on the conservative side, since customers are running into this issue).

sryza · 2014-09-22T08:20:13Z

If #2485 is the replacement, can we close this one out?

nishkamravi2 · 2014-09-22T09:02:15Z

Shall we let this linger on for just a bit until the other one gets merged?

nishkamravi2 · 2014-09-22T09:07:30Z

Noticed that we have a reference to this one in 2485, closing it out.

…multiplier Redone against the recent master branch (#1391) Author: Nishkam Ravi <[email protected]> Author: nravi <[email protected]> Author: nishkamravi2 <[email protected]> Closes #2485 from nishkamravi2/master_nravi and squashes the following commits: 636a9ff [nishkamravi2] Update YarnAllocator.scala 8f76c8b [Nishkam Ravi] Doc change for yarn memory overhead 35daa64 [Nishkam Ravi] Slight change in the doc for yarn memory overhead 5ac2ec1 [Nishkam Ravi] Remove out dac1047 [Nishkam Ravi] Additional documentation for yarn memory overhead issue 42c2c3d [Nishkam Ravi] Additional changes for yarn memory overhead issue 362da5e [Nishkam Ravi] Additional changes for yarn memory overhead c726bd9 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi f00fa31 [Nishkam Ravi] Improving logging for AM memoryOverhead 1cf2d1e [nishkamravi2] Update YarnAllocator.scala ebcde10 [Nishkam Ravi] Modify default YARN memory_overhead-- from an additive constant to a multiplier (redone to resolve merge conflicts) 2e69f11 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi efd688a [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark 2b630f9 [nravi] Accept memory input as "30g", "512M" instead of an int value, to be consistent with rest of Spark 3bf8fad [nravi] Merge branch 'master' of https://github.com/apache/spark 5423a03 [nravi] Merge branch 'master' of https://github.com/apache/spark eb663ca [nravi] Merge branch 'master' of https://github.com/apache/spark df2aeb1 [nravi] Improved fix for ConcurrentModificationIssue (Spark-1097, Hadoop-10456) 6b840f0 [nravi] Undo the fix for SPARK-1758 (the problem is fixed) 5108700 [nravi] Fix in Spark for the Concurrent thread modification issue (SPARK-1097, HADOOP-10456) 681b36f [nravi] Fix for SPARK-1758: failing test org.apache.spark.JavaAPISuite.wholeTextFiles

nishkamravi2 and others added 10 commits June 3, 2014 15:28

Fix for SPARK-1758: failing test org.apache.spark.JavaAPISuite.wholeT…

681b36f

…extFiles The prefix "file:" is missing in the string inserted as key in HashMap

Fix in Spark for the Concurrent thread modification issue (SPARK-1097,…

5108700

… HADOOP-10456)

Undo the fix for SPARK-1758 (the problem is fixed)

6b840f0

Improved fix for ConcurrentModificationIssue (Spark-1097, Hadoop-10456)

df2aeb1

Merge branch 'master' of https://github.com/apache/spark

eb663ca

Merge branch 'master' of https://github.com/apache/spark

5423a03

Merge branch 'master' of https://github.com/apache/spark

3bf8fad

Accept memory input as "30g", "512M" instead of an int value, to be c…

2b630f9

…onsistent with rest of Spark

Merge branch 'master' of https://github.com/apache/spark

efd688a

Modify default YARN memory_overhead: from an additive constant to a m…

715201e

…ultiplier

nishkamravi2 added 3 commits July 17, 2014 11:53

Update running-on-yarn.md

11a231c

Update YarnAllocationHandler.scala

b44845b

Update YarnAllocationHandler.scala

44ba4da

vanzin mentioned this pull request Sep 17, 2014

[SPARK-3535][Mesos] Fix resource handling. #2401

Closed

andrewor14 reviewed Sep 18, 2014
View reviewed changes

nishkamravi2 added 4 commits September 18, 2014 17:03

Update ExecutorLauncher.scala

3062d6b

Update YarnAllocationHandler.scala

b8eeefe

Update ClientBase.scala

492c33f

Update YarnAllocationHandler.scala

5cbe89c

nishkamravi2 mentioned this pull request Sep 22, 2014

Modify default YARN memory_overhead-- from an additive constant to a multiplier #2485

Closed

nishkamravi2 closed this Sep 22, 2014

Modify default YARN memory_overhead-- from an additive constant to a multiplier #1391

Modify default YARN memory_overhead-- from an additive constant to a multiplier #1391

Uh oh!

Conversation

nishkamravi2 commented Jul 13, 2014

Uh oh!

nishkamravi2 commented Jul 13, 2014

Uh oh!

AmplabJenkins commented Jul 13, 2014

Uh oh!

mridulm commented Jul 13, 2014

Uh oh!

srowen commented Jul 13, 2014

Uh oh!

nishkamravi2 commented Jul 13, 2014

Uh oh!

nishkamravi2 commented Jul 13, 2014

Uh oh!

mridulm commented Jul 13, 2014

Uh oh!

mridulm commented Jul 13, 2014

Uh oh!

mridulm commented Jul 13, 2014

Uh oh!

srowen commented Jul 13, 2014

Uh oh!

mridulm commented Jul 13, 2014

Uh oh!

nishkamravi2 commented Jul 13, 2014

Uh oh!

nishkamravi2 commented Jul 13, 2014

Uh oh!

mridulm commented Jul 13, 2014

Uh oh!

nishkamravi2 commented Jul 13, 2014

Uh oh!

mridulm commented Jul 13, 2014

Uh oh!

mridulm commented Jul 13, 2014

Uh oh!

mridulm commented Jul 13, 2014

Uh oh!

mridulm commented Jul 13, 2014

Uh oh!

nishkamravi2 commented Jul 17, 2014

Uh oh!

tgravescs commented Jul 18, 2014

Uh oh!

nishkamravi2 commented Jul 18, 2014

Uh oh!

SparkQA commented Sep 5, 2014

Uh oh!

andrewor14 Sep 18, 2014

Choose a reason for hiding this comment

Uh oh!

andrewor14 commented Sep 18, 2014

Uh oh!

nishkamravi2 commented Sep 19, 2014

Uh oh!

sryza commented Sep 19, 2014

Uh oh!

sryza commented Sep 19, 2014

Uh oh!

nishkamravi2 commented Sep 19, 2014

Uh oh!

tgravescs commented Sep 19, 2014

Uh oh!

nishkamravi2 commented Sep 22, 2014

Uh oh!

sryza commented Sep 22, 2014

Uh oh!

nishkamravi2 commented Sep 22, 2014

Uh oh!

nishkamravi2 commented Sep 22, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone