-
Notifications
You must be signed in to change notification settings - Fork 28.9k
Modify default YARN memory_overhead-- from an additive constant to a multiplier #2485
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…extFiles The prefix "file:" is missing in the string inserted as key in HashMap
…onsistent with rest of Spark
…multiplier (redone to resolve merge conflicts)
|
Can one of the admins verify this patch? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This quantities in this message may be unclear to those not familiar with the overhead. Maybe something like "each with %d memory including %d overhead"?
Also, not the fault of this PR, but "Allocate" shouldn't be capitalized.
@tgravescs I believe we already print out a nasty error message when a container can't be allocated because of the max allocation limit. Are you saying we should indicate whether the overhead made the difference? |
|
Updated as per @sryza 's comments |
|
yes it would be nice to tell the user what the overhead limit is calculated to be as I might not realize there is overhead and that its dependent upon the multiplier. ie I told it to use 15GB, why is it erroring saying max size is 16GB. I see its already being printed for the executors in YarnAllocator so maybe just adding one more log statement in the ClientBase to print what the applicationMaster one is would be sufficient. We could also modify this error statement to break it out: |
|
Updated as per @tgravescs 's comments |
|
This looks good to me. |
|
Jenkins, test this please |
|
Jenkins, retest this please. |
|
@JoshRosen Any idea why Jenkins isn't running on this? Could you kick it manually? |
|
@pwendell @mateiz @andrewor14 can any of you kick jenkins? |
|
I just kicked it from the |
|
QA tests have started for PR 2485 at commit
|
|
ah sorry, looks like something conflicts now and it needs upmerged. @nishkamravi2 can you please upmerge |
|
QA tests have finished for PR 2485 at commit
|
…nravi Conflicts: yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala
|
Calculate totalMemory can be differently defined for the two code paths. The overhead percentage will have to be different too. As long as they follow the same semantics/logic. |
|
Why can't they both share the same config parameters, for example? I understand the implementation differences, but we shouldn't need to have distinct config params. |
|
For one, it would mean a change in the UI, which breaks existing deployments and there should be a compelling reason to do so. |
|
So I guess there's nothing to do. |
|
I think PR #2401 can be modeled after this one. Instead of defining overhead as a percentage, it could (and probably should) be defined as an absolute value. Also, spark.executor.memory.overhead.minimum is redundant and adds confusion/complexity for the developer. |
|
Naturally you wouldn't want to have to change yours. I'll drop the |
|
Hey I just talked to @pwendell about this. I think it's better for us to have a yarn config and a mesos config, but not generalize this to use a common |
|
That's fair. I'm updating the PR to make that Mesos specific now. |
|
QA tests have finished for PR 2485 at commit
|
|
Test FAILed. |
|
retest this please |
|
QA tests have started for PR 2485 at commit
|
|
QA tests have finished for PR 2485 at commit
|
|
Test FAILed. |
|
Need some help interpreting the test results. Not clear which one is failing. |
|
It's the python ones. This is unlikely to be related to your patch. Let's retest this please. |
|
QA tests have started for PR 2485 at commit
|
|
QA tests have finished for PR 2485 at commit
|
|
Test PASSed. |
|
@andrewor14 did you have any further comments on this? |
|
I think this is fine. I spotted one semicolon but I'll let that go. LGTM. |
|
Semicolon removed (nice catch) |
|
retest this please |
|
I committed this. I missed there wasn't a jira here so filed https://issues.apache.org/jira/browse/SPARK-3768. |
|
Thanks @tgravescs |
Redone against the recent master branch (#1391)