-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-3535][Mesos] Fix resource handling. #2401
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Can one of the admins verify this patch? |
|
ok to test |
|
QA tests have started for PR 2401 at commit
|
|
QA tests have finished for PR 2401 at commit
|
|
Updated diff based on coding. I've also added the same thing to the coarse scheduler, and updated the documentation accordingly. |
|
QA tests have started for PR 2401 at commit
|
|
QA tests have started for PR 2401 at commit
|
|
Hey will this have compatbility issues for existing deployments? I know many clusters where they just have Spark request the entire amount of memory on the node. With this, if a user upgrades their jobs could just starve. What if instead we just "scale down" the size of the executor based on what the user requests. I.e. if they request 20GB executors we reserve a few GB for this overhead. @andrewor14 how does this work in YARN? It might be good to have similar semantics to what they have there. |
|
YARN has a similar strategy, from what I can tell. There's a separate config value, We could go the other way, where it subtracts the overhead from the heap. To me it's all the same. If someone is filling the whole OS with heap, they're still likely to OOM, but it may not happen as quickly. |
|
QA tests have finished for PR 2401 at commit
|
|
QA tests have finished for PR 2401 at commit
|
|
Here's a somewhat-related concern: it seems like JVM overhead is unlikely to scale linearly with requested heap size, so maybe a straight-up 15% isn't a great default? (If you have hard data on how heap requirements grow with job size, I'd be interested in seeing it.) Perhaps it would make more sense to reserve whichever is smaller of 15% or some fixed but reasonable amount. |
|
That implies that as you grow the heap, you're not adding threads (or other things that use off-heap memory). I'm not familiar with Spark's execution model, but I would assume that as you scale up, the workers will need more threads to serve cache requests and what have you. |
|
Oh, and one more thing you may want to think about is the OS filesystem buffers. Again, as you scale up the heap, you may want to proportionally reserve a slice of off-heap memory just for the OS. |
|
Applying soft memory limits via cgroups is one option, for environments supporting cgroups. Reserve a certain amount of memory, and if it is exceeded then swapping starts to happen instead of hard failure |
|
When you need to provide guarantees for other services, it's better to stick to hard limits. Having other tasks get randomly OOM killed is a bad experience. |
|
Might want to take a look at #1391 also; that adds a similar thing for Yarn (a multiplier for the JVM overhead), it might be worth to merge the two code paths (or add this to some utility class) so that everybody benefits. Another thing to maybe keep in mind is Patrick's comment in SPARK-1930 about pyspark requiring even more "overhead" to be assigned. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't actually a fraction. Perhaps what you meant is multiplier or something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because it's >0? It could default to 0.15, if that's the semantics you prefer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my mind a fraction is a value between 0 and 1, but 1.15 is not in this range. Even in the new code MesosUtils.memoryOverheadFraction still returns a value outside of this range. Maybe it's simpler if this method just computes the executor memory for you, since it already takes in sc.
(Though the final value we use here will depend on #1391).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying to keep the code looking clean, but I'll change fraction to multiplier in the method, if that seems more sensible.
|
@brndnmtthws Yes it would be good to keep the approach used in #1391 in mind. It would be ideal if we could use an approach as similar as what we currently (or plan to) do for yarn as possible, including the naming and the semantics of the configs. Looks like we haven't quite settled on a merge-able approach for yarn yet, so let's defer until that one has been resolved. |
|
Updated as per @andrewor14's suggestions. |
|
Forgot to mention: I also set the executor CPUs correctly. |
|
QA tests have started for PR 2401 at commit
|
|
@andrewor14 This bothers me too, but in a slightly different way: calling the parameter “overhead” when it really refers to how to scale requested memory to accommodate anticipated overhead seems wrong. (115% overhead would be appalling indeed!) |
|
@willb in the case of Yarn, the parameter is called "overhead" because it actually sets the amount of overhead you want to add to the requested heap memory. The PR being referenced just uses a multiplier to set the default overhead value; the user-visible parameter doesn't set the multiplier. |
|
QA tests have started for PR 2401 at commit
|
|
QA tests have finished for PR 2401 at commit
|
|
Test PASSed. |
docs/configuration.md
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo in the docs here: 348 -> 384. And also below, additionaly -> additionally
|
This looks very reasonable. Counting the executor's bookkeeping core against the resources also seems much more correct than pretending it doesn't exist like before. Regarding the fraction-vs-absolute debate: if down the line users find setting the overhead with absolute numbers to be suboptimal, we can always add a new parameter named Considering Brenden's test on the cluster and experience with Mesos this seems good to merge pending extremely minor typos |
|
Fixed typos (also switch from |
|
QA tests have started for PR 2401 at commit
|
|
QA tests have finished for PR 2401 at commit
|
|
Test FAILed. |
|
That test failure appears to be unrelated. |
|
retest this please |
|
QA tests have started for PR 2401 at commit
|
docs/configuration.md
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think spark.mesos.executor.memory exists.
|
Hey @brndnmtthws I left a few more minor comments but I think this is basically ready for merge. Thanks for your changes. |
|
Test FAILed. |
|
Test FAILed. |
- Added executor/task memory overhead - Correctly specify CPUs for executor - Actually check if offer is sufficient, rather than just assuming it is
|
Done as per @andrewor14's suggestions. |
|
QA tests have started for PR 2401 at commit
|
|
QA tests have finished for PR 2401 at commit
|
|
Test PASSed. |
|
Ok, LGTM I'm merging this into master and 1.1. Thanks @brndnmtthws. |
Author: Brenden Matthews <[email protected]> Closes #2401 from brndnmtthws/master and squashes the following commits: 4abaa5d [Brenden Matthews] [SPARK-3535][Mesos] Fix resource handling. (cherry picked from commit a8c52d5) Signed-off-by: Andrew Or <[email protected]>
No description provided.