-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-16505][yarn] Optionally propagate error during shuffle service startup. #14162
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This prevents the NM from starting when something is wrong, which would lead to later errors which are confusing and harder to debug. Added a unit test to verify startup fails if something is wrong.
|
So we had specifically decided to have this behavior when this was first written. The reason is that an issue with the spark shuffle services shouldn't stop other services from running fine on the NM. ie the mapreduce shuffle services. The node still works fine for MR even if there is bug in spark shuffle service. This was definitely a concern when we first released this. That isn't as much of an issue now. We had talked about this again recently and again decided to leave this behavior, the reason is that it should fail fast, ie as soon as it registers the executor would fail and there wouldn't be any wasted work. I guess this could cause the job to fail if it kept trying to launch on some bad node. Or is it not really killing the executor? What is the case you are seeing this issue? I'm ok with changing it if we have a good reason. |
|
Test build #62190 has finished for PR 14162 at commit
|
Well, I guess "good reason" is in the eye of the beholder. :-) My argument for the change is to avoid users complaining when their app fails because of this. Instead, they'd have a clear warning that something is not right on a certain NM, and the rest of the world would keep going, using other NMs. I can see the argument for not affecting other services; but how does the MR service behave? Also, the current behavior leads not only to app failures, but also to a lot of noise in the NM's logs. So at least that part could be fixed, if people really don't want the behavior change, so that the shuffle service is really not running instead of running in a broken state. Finally, this is a single error spot that is being filtered... e.g., if there's a port conflict, the Spark shuffle service will throw an exception and prevent the NM from starting. |
|
To answer my own question: the MR service throws exceptions when there's an error during initialization. (See |
|
Correct, MR throws an exception, but again a lot of that is legacy from hadoop 1.x when it was the only thing running. The auxiliary services were added just for that originally. You could very well argue its a bug there too. So you are seeing users hit the max number of executor failures then? Without blacklisting I can definitely see this as an issue. The other question is why is it failing to start on the NM? s it just something bad happened on one node or across the cluster? |
|
I'm trying to get access to more information about the original issue to answer all the questions; I know it was caused because for some reason the YARN "local-dirs" were all read only, and the leveldb files could not be created. Which could potentially be caused by a bad disk. |
|
Some more info: you're correct in that, int this case, the application eventually fails because containers keep getting started on the "bad" node. |
|
I'd be curious if you find out what was wrong with that node. If its the leveldb file not being created, that should be fixed by aab99d3 which is supposed to use the approved recovery path and if that is bad I believe the nodemanager and all services won't come up. But ignoring the actual cause I think if we put this in we should make it configurable, with default to not throw. From a YARN point of view I don't necessarily want one bad service to take the entire cluster down. For instance, lets say we have a bug in the spark shuffle services, we try to deploy a 5000 node cluster, this change now causes none of the nodemanagers to come up. But my workload on that cluster is such that spark is only like 1%. I don't necessarily want that to block the other 99% of jobs on that cluster while I try to fix the spark shuffle handler or roll it back. This also should get better once we have the node blacklisting stuff in. |
That's great, and in my view that also means that any failure in the startup of the shuffle service should actually be caused by something wrong with the environment, and not the shuffle service code, so this change shouldn't harm anybody. :-)
To be fair if you have a bug in another part of the shuffle service that is not in the startup path, it still could take out your whole cluster. That can't be fixed until the NM runs aux services in separate processes.
Are you talking about SPARK-8425? If you are, I don't think that changes anything here, since the executor isn't even coming up, and that blacklisting is based on tasks failing. I'm not against adding an option, I just don't really see it as really necessary. But if you feel strongly that the Spark shuffle service shouldn't affect NM startup ever, then I can add it. |
Not sure what you mean by this, normally if bug in shuffle service it only affects Spark since Spark is the only one trying to access it. The other routines like initializeapplication and stopapplication called by NM all catch exceptions also. Although looking at NM code it doesn't matter because it catches them itself and just logs. Obviously if its really bad such that it causes segfault or memory leak it still can take it out, but normal exception in processing request from Spark application shouldn't take NM out. Actually now that you point it out the try/catch that you removed should really be around the rest of the code in the init function as well. But yes I would like to see config because we will run it with current behavior. If you want to run it without the catch then we need a config to be able to run in both modes. |
Memory leaks, unbounded file descriptor usage, lack of throttling, unexpected calls to But sure, I'll add a config. |
|
Test build #62261 has finished for PR 14162 at commit
|
|
|
||
| // Whether failure during service initialization should stop the NM. | ||
| @VisibleForTesting | ||
| static final String STOP_ON_FAILURE_KEY = "spark.yarn.shuffle.stopOnFailure"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be nice to document this so others now about it. There is a section in job scheduling doc about the external shuffle service, otherwise maybe just put in the yarn section. We probably should have pointer from yarn section to the job scheduler section or just a standalone section on the shuffle service, but that is a different issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked around but couldn't find any documentation specific to the YARN shuffle service, but let me look again.
|
minor suggestion on docs, otherwise looks good. |
Also move the YARN-specific setup to a section in the YARN documentation, and just link to it from the job scheduling page.
|
Test build #62281 has finished for PR 14162 at commit
|
|
+1, thanks for fixing up the docs. |
This prevents the NM from starting when something is wrong, which would
lead to later errors which are confusing and harder to debug.
Added a unit test to verify startup fails if something is wrong.