Modify default YARN memory_overhead-- from an additive constant to a multiplier #2485

nishkamravi2 · 2014-09-22T06:50:24Z

Redone against the recent master branch (#1391)

…extFiles The prefix "file:" is missing in the string inserted as key in HashMap

… HADOOP-10456)

…onsistent with rest of Spark

…nravi

…multiplier (redone to resolve merge conflicts)

SparkQA · 2014-09-22T06:52:08Z

Can one of the admins verify this patch?

sryza · 2014-09-22T08:24:48Z

yarn/common/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala

This quantities in this message may be unclear to those not familiar with the overhead. Maybe something like "each with %d memory including %d overhead"?

Also, not the fault of this PR, but "Allocate" shouldn't be capitalized.

sryza · 2014-09-22T08:27:29Z

It would also be nice to log what it is when we fail to get a container large enough or it fails due to the cluster max allocation limit was hit.

@tgravescs I believe we already print out a nasty error message when a container can't be allocated because of the max allocation limit. Are you saying we should indicate whether the overhead made the difference?

nishkamravi2 · 2014-09-22T08:55:45Z

Updated as per @sryza 's comments

tgravescs · 2014-09-22T13:19:32Z

yes it would be nice to tell the user what the overhead limit is calculated to be as I might not realize there is overhead and that its dependent upon the multiplier. ie I told it to use 15GB, why is it erroring saying max size is 16GB.

I see its already being printed for the executors in YarnAllocator so maybe just adding one more log statement in the ClientBase to print what the applicationMaster one is would be sufficient.

We could also modify this error statement to break it out:
al errorMessage = "Required AM memory (%d) is above the max threshold (%d) of this cluster."
.format(amMem, maxMem)

nishkamravi2 · 2014-09-22T23:12:01Z

Updated as per @tgravescs 's comments

sryza · 2014-09-23T06:21:50Z

This looks good to me.

tgravescs · 2014-09-23T13:52:38Z

Jenkins, test this please

tgravescs · 2014-09-23T14:21:19Z

Jenkins, retest this please.

tgravescs · 2014-09-23T14:52:07Z

@JoshRosen Any idea why Jenkins isn't running on this? Could you kick it manually?

tgravescs · 2014-09-24T17:23:58Z

@pwendell @mateiz @andrewor14 can any of you kick jenkins?

JoshRosen · 2014-09-24T17:26:58Z

I just kicked it from the spark-prs parameterized build trigger; let's wait and see if it starts...

SparkQA · 2014-09-24T17:28:20Z

QA tests have started for PR 2485 at commit f00fa31.

This patch does not merge cleanly!

tgravescs · 2014-09-24T17:30:56Z

ah sorry, looks like something conflicts now and it needs upmerged.

@nishkamravi2 can you please upmerge

SparkQA · 2014-09-24T18:55:14Z

QA tests have finished for PR 2485 at commit f00fa31.

This patch passes unit tests.
This patch does not merge cleanly!

…nravi Conflicts: yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala

nishkamravi2 · 2014-09-26T18:36:20Z

Calculate totalMemory can be differently defined for the two code paths. The overhead percentage will have to be different too. As long as they follow the same semantics/logic.

brndnmtthws · 2014-09-26T18:47:04Z

Why can't they both share the same config parameters, for example? I understand the implementation differences, but we shouldn't need to have distinct config params.

nishkamravi2 · 2014-09-26T18:51:38Z

For one, it would mean a change in the UI, which breaks existing deployments and there should be a compelling reason to do so.

brndnmtthws · 2014-09-26T18:53:18Z

So I guess there's nothing to do.

nishkamravi2 · 2014-09-26T18:58:18Z

I think PR #2401 can be modeled after this one. Instead of defining overhead as a percentage, it could (and probably should) be defined as an absolute value. Also, spark.executor.memory.overhead.minimum is redundant and adds confusion/complexity for the developer.

brndnmtthws · 2014-09-26T19:05:29Z

Naturally you wouldn't want to have to change yours.

I'll drop the .minimum thing, and prefix the config params with .mesos, like you've done for yarn.

andrewor14 · 2014-09-26T19:17:40Z

Hey I just talked to @pwendell about this. I think it's better for us to have a yarn config and a mesos config, but not generalize this to use a common spark.executor.memory.overhead.* config. This is because this memory overhead doesn't make sense for standalone mode or other cluster managers that don't launch executors in containers. I think it's fine as long as the two yarn and mesos configs have the same semantics, so the user of one mode is not confused when they switch to another.

brndnmtthws · 2014-09-26T19:19:28Z

That's fair. I'm updating the PR to make that Mesos specific now.

SparkQA · 2014-09-26T19:27:10Z

QA tests have finished for PR 2485 at commit 8f76c8b.

This patch fails unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class IDF(val minDocFreq: Int)
- class DocumentFrequencyAggregator(val minDocFreq: Int) extends Serializable
- class PStatsParam(AccumulatorParam):

AmplabJenkins · 2014-09-26T19:27:13Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20866/

andrewor14 · 2014-09-26T19:35:17Z

retest this please

SparkQA · 2014-09-26T19:39:29Z

QA tests have started for PR 2485 at commit 8f76c8b.

This patch merges cleanly.

SparkQA · 2014-09-26T20:44:01Z

QA tests have finished for PR 2485 at commit 8f76c8b.

This patch fails unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-09-26T20:44:04Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20870/

nishkamravi2 · 2014-09-26T20:52:37Z

Need some help interpreting the test results. Not clear which one is failing.

andrewor14 · 2014-09-26T22:40:40Z

It's the python ones. This is unlikely to be related to your patch. Let's retest this please.

SparkQA · 2014-09-26T22:45:18Z

QA tests have started for PR 2485 at commit 8f76c8b.

This patch merges cleanly.

SparkQA · 2014-09-26T23:54:36Z

QA tests have finished for PR 2485 at commit 8f76c8b.

This patch passes unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-09-26T23:54:40Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20887/

tgravescs · 2014-09-30T14:55:07Z

@andrewor14 did you have any further comments on this?

andrewor14 · 2014-09-30T16:28:38Z

I think this is fine. I spotted one semicolon but I'll let that go. LGTM.

nishkamravi2 · 2014-09-30T18:34:17Z

Semicolon removed (nice catch)

tgravescs · 2014-10-01T19:04:16Z

retest this please

tgravescs · 2014-10-02T18:51:30Z

I committed this. I missed there wasn't a jira here so filed https://issues.apache.org/jira/browse/SPARK-3768.

nishkamravi2 · 2014-10-02T20:57:02Z

Thanks @tgravescs

nishkamravi2 and others added 11 commits June 3, 2014 15:28

Fix for SPARK-1758: failing test org.apache.spark.JavaAPISuite.wholeT…

681b36f

…extFiles The prefix "file:" is missing in the string inserted as key in HashMap

Fix in Spark for the Concurrent thread modification issue (SPARK-1097,…

5108700

… HADOOP-10456)

Undo the fix for SPARK-1758 (the problem is fixed)

6b840f0

Improved fix for ConcurrentModificationIssue (Spark-1097, Hadoop-10456)

df2aeb1

Merge branch 'master' of https://github.com/apache/spark

eb663ca

Merge branch 'master' of https://github.com/apache/spark

5423a03

Merge branch 'master' of https://github.com/apache/spark

3bf8fad

Accept memory input as "30g", "512M" instead of an int value, to be c…

2b630f9

…onsistent with rest of Spark

Merge branch 'master' of https://github.com/apache/spark

efd688a

Merge branch 'master' of https://github.com/apache/spark into master_…

2e69f11

…nravi

Modify default YARN memory_overhead-- from an additive constant to a …

ebcde10

…multiplier (redone to resolve merge conflicts)

sryza mentioned this pull request Sep 22, 2014

Modify default YARN memory_overhead-- from an additive constant to a multiplier #1391

Closed

sryza reviewed Sep 22, 2014
View reviewed changes

Update YarnAllocator.scala

1cf2d1e

Improving logging for AM memoryOverhead

f00fa31

Merge branch 'master' of https://github.com/apache/spark into master_…

c726bd9

…nravi Conflicts: yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala

Update YarnAllocator.scala

636a9ff

asfgit closed this in b4fb7b8 Oct 2, 2014

Modify default YARN memory_overhead-- from an additive constant to a multiplier #2485

Modify default YARN memory_overhead-- from an additive constant to a multiplier #2485

Uh oh!

Conversation

nishkamravi2 commented Sep 22, 2014

Uh oh!

SparkQA commented Sep 22, 2014

Uh oh!

sryza Sep 22, 2014

Choose a reason for hiding this comment

Uh oh!

sryza commented Sep 22, 2014

Uh oh!

nishkamravi2 commented Sep 22, 2014

Uh oh!

tgravescs commented Sep 22, 2014

Uh oh!

nishkamravi2 commented Sep 22, 2014

Uh oh!

sryza commented Sep 23, 2014

Uh oh!

tgravescs commented Sep 23, 2014

Uh oh!

tgravescs commented Sep 23, 2014

Uh oh!

tgravescs commented Sep 23, 2014

Uh oh!

tgravescs commented Sep 24, 2014

Uh oh!

JoshRosen commented Sep 24, 2014

Uh oh!

SparkQA commented Sep 24, 2014

Uh oh!

tgravescs commented Sep 24, 2014

Uh oh!

SparkQA commented Sep 24, 2014

Uh oh!

nishkamravi2 commented Sep 26, 2014

Uh oh!

brndnmtthws commented Sep 26, 2014

Uh oh!

nishkamravi2 commented Sep 26, 2014

Uh oh!

brndnmtthws commented Sep 26, 2014

Uh oh!

nishkamravi2 commented Sep 26, 2014

Uh oh!

brndnmtthws commented Sep 26, 2014

Uh oh!

andrewor14 commented Sep 26, 2014

Uh oh!

brndnmtthws commented Sep 26, 2014

Uh oh!

SparkQA commented Sep 26, 2014

Uh oh!

AmplabJenkins commented Sep 26, 2014

Uh oh!

andrewor14 commented Sep 26, 2014

Uh oh!

SparkQA commented Sep 26, 2014

Uh oh!

SparkQA commented Sep 26, 2014

Uh oh!

AmplabJenkins commented Sep 26, 2014

Uh oh!

nishkamravi2 commented Sep 26, 2014

Uh oh!

andrewor14 commented Sep 26, 2014

Uh oh!

SparkQA commented Sep 26, 2014

Uh oh!

SparkQA commented Sep 26, 2014

Uh oh!

AmplabJenkins commented Sep 26, 2014

Uh oh!

tgravescs commented Sep 30, 2014

Uh oh!

andrewor14 commented Sep 30, 2014

Uh oh!

nishkamravi2 commented Sep 30, 2014