Skip to content

Conversation

@nishkamravi2
Copy link
Contributor

Redone against the recent master branch (#1391)

@SparkQA
Copy link

SparkQA commented Sep 22, 2014

Can one of the admins verify this patch?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This quantities in this message may be unclear to those not familiar with the overhead. Maybe something like "each with %d memory including %d overhead"?

Also, not the fault of this PR, but "Allocate" shouldn't be capitalized.

@sryza
Copy link
Contributor

sryza commented Sep 22, 2014

It would also be nice to log what it is when we fail to get a container large enough or it fails due to the cluster max allocation limit was hit.

@tgravescs I believe we already print out a nasty error message when a container can't be allocated because of the max allocation limit. Are you saying we should indicate whether the overhead made the difference?

@nishkamravi2
Copy link
Contributor Author

Updated as per @sryza 's comments

@tgravescs
Copy link
Contributor

yes it would be nice to tell the user what the overhead limit is calculated to be as I might not realize there is overhead and that its dependent upon the multiplier. ie I told it to use 15GB, why is it erroring saying max size is 16GB.

I see its already being printed for the executors in YarnAllocator so maybe just adding one more log statement in the ClientBase to print what the applicationMaster one is would be sufficient.

We could also modify this error statement to break it out:
al errorMessage = "Required AM memory (%d) is above the max threshold (%d) of this cluster."
.format(amMem, maxMem)

@nishkamravi2
Copy link
Contributor Author

Updated as per @tgravescs 's comments

@sryza
Copy link
Contributor

sryza commented Sep 23, 2014

This looks good to me.

@tgravescs
Copy link
Contributor

Jenkins, test this please

@tgravescs
Copy link
Contributor

Jenkins, retest this please.

@tgravescs
Copy link
Contributor

@JoshRosen Any idea why Jenkins isn't running on this? Could you kick it manually?

@tgravescs
Copy link
Contributor

@pwendell @mateiz @andrewor14 can any of you kick jenkins?

@JoshRosen
Copy link
Contributor

I just kicked it from the spark-prs parameterized build trigger; let's wait and see if it starts...

@SparkQA
Copy link

SparkQA commented Sep 24, 2014

QA tests have started for PR 2485 at commit f00fa31.

  • This patch does not merge cleanly!

@tgravescs
Copy link
Contributor

ah sorry, looks like something conflicts now and it needs upmerged.

@nishkamravi2 can you please upmerge

@SparkQA
Copy link

SparkQA commented Sep 24, 2014

QA tests have finished for PR 2485 at commit f00fa31.

  • This patch passes unit tests.
  • This patch does not merge cleanly!

…nravi

Conflicts:
	yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala
@nishkamravi2
Copy link
Contributor Author

Calculate totalMemory can be differently defined for the two code paths. The overhead percentage will have to be different too. As long as they follow the same semantics/logic.

@brndnmtthws
Copy link
Member

Why can't they both share the same config parameters, for example? I understand the implementation differences, but we shouldn't need to have distinct config params.

@nishkamravi2
Copy link
Contributor Author

For one, it would mean a change in the UI, which breaks existing deployments and there should be a compelling reason to do so.

@brndnmtthws
Copy link
Member

So I guess there's nothing to do.

@nishkamravi2
Copy link
Contributor Author

I think PR #2401 can be modeled after this one. Instead of defining overhead as a percentage, it could (and probably should) be defined as an absolute value. Also, spark.executor.memory.overhead.minimum is redundant and adds confusion/complexity for the developer.

@brndnmtthws
Copy link
Member

Naturally you wouldn't want to have to change yours.

I'll drop the .minimum thing, and prefix the config params with .mesos, like you've done for yarn.

@andrewor14
Copy link
Contributor

Hey I just talked to @pwendell about this. I think it's better for us to have a yarn config and a mesos config, but not generalize this to use a common spark.executor.memory.overhead.* config. This is because this memory overhead doesn't make sense for standalone mode or other cluster managers that don't launch executors in containers. I think it's fine as long as the two yarn and mesos configs have the same semantics, so the user of one mode is not confused when they switch to another.

@brndnmtthws
Copy link
Member

That's fair. I'm updating the PR to make that Mesos specific now.

@SparkQA
Copy link

SparkQA commented Sep 26, 2014

QA tests have finished for PR 2485 at commit 8f76c8b.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class IDF(val minDocFreq: Int)
    • class DocumentFrequencyAggregator(val minDocFreq: Int) extends Serializable
    • class PStatsParam(AccumulatorParam):

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20866/

@andrewor14
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Sep 26, 2014

QA tests have started for PR 2485 at commit 8f76c8b.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 26, 2014

QA tests have finished for PR 2485 at commit 8f76c8b.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20870/

@nishkamravi2
Copy link
Contributor Author

Need some help interpreting the test results. Not clear which one is failing.

@andrewor14
Copy link
Contributor

It's the python ones. This is unlikely to be related to your patch. Let's retest this please.

@SparkQA
Copy link

SparkQA commented Sep 26, 2014

QA tests have started for PR 2485 at commit 8f76c8b.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 26, 2014

QA tests have finished for PR 2485 at commit 8f76c8b.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20887/

@tgravescs
Copy link
Contributor

@andrewor14 did you have any further comments on this?

@andrewor14
Copy link
Contributor

I think this is fine. I spotted one semicolon but I'll let that go. LGTM.

@nishkamravi2
Copy link
Contributor Author

Semicolon removed (nice catch)

@tgravescs
Copy link
Contributor

retest this please

@asfgit asfgit closed this in b4fb7b8 Oct 2, 2014
@tgravescs
Copy link
Contributor

I committed this. I missed there wasn't a jira here so filed https://issues.apache.org/jira/browse/SPARK-3768.

@nishkamravi2
Copy link
Contributor Author

Thanks @tgravescs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants