[SPARK-16505][yarn] Optionally propagate error during shuffle service startup. #14162

vanzin · 2016-07-12T20:47:15Z

This prevents the NM from starting when something is wrong, which would
lead to later errors which are confusing and harder to debug.

Added a unit test to verify startup fails if something is wrong.

This prevents the NM from starting when something is wrong, which would lead to later errors which are confusing and harder to debug. Added a unit test to verify startup fails if something is wrong.

vanzin · 2016-07-12T20:47:20Z

@tgravescs

tgravescs · 2016-07-12T21:05:35Z

So we had specifically decided to have this behavior when this was first written. The reason is that an issue with the spark shuffle services shouldn't stop other services from running fine on the NM. ie the mapreduce shuffle services. The node still works fine for MR even if there is bug in spark shuffle service. This was definitely a concern when we first released this. That isn't as much of an issue now.

We had talked about this again recently and again decided to leave this behavior, the reason is that it should fail fast, ie as soon as it registers the executor would fail and there wouldn't be any wasted work. I guess this could cause the job to fail if it kept trying to launch on some bad node. Or is it not really killing the executor?

What is the case you are seeing this issue? I'm ok with changing it if we have a good reason.

SparkQA · 2016-07-12T21:19:27Z

Test build #62190 has finished for PR 14162 at commit 61a40b1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2016-07-12T21:26:47Z

What is the case you are seeing this issue? I'm ok with changing it if we have a good reason.

Well, I guess "good reason" is in the eye of the beholder. :-) My argument for the change is to avoid users complaining when their app fails because of this. Instead, they'd have a clear warning that something is not right on a certain NM, and the rest of the world would keep going, using other NMs.

I can see the argument for not affecting other services; but how does the MR service behave?

Also, the current behavior leads not only to app failures, but also to a lot of noise in the NM's logs. So at least that part could be fixed, if people really don't want the behavior change, so that the shuffle service is really not running instead of running in a broken state.

Finally, this is a single error spot that is being filtered... e.g., if there's a port conflict, the Spark shuffle service will throw an exception and prevent the NM from starting.

vanzin · 2016-07-12T21:31:54Z

To answer my own question: the MR service throws exceptions when there's an error during initialization. (See ShuffleHandler.java, there's no swallowing of exceptions during serviceInit and serviceStart.)

tgravescs · 2016-07-12T21:43:20Z

Correct, MR throws an exception, but again a lot of that is legacy from hadoop 1.x when it was the only thing running. The auxiliary services were added just for that originally. You could very well argue its a bug there too.

So you are seeing users hit the max number of executor failures then? Without blacklisting I can definitely see this as an issue.

The other question is why is it failing to start on the NM? s it just something bad happened on one node or across the cluster?

vanzin · 2016-07-12T21:49:23Z

I'm trying to get access to more information about the original issue to answer all the questions; I know it was caused because for some reason the YARN "local-dirs" were all read only, and the leveldb files could not be created. Which could potentially be caused by a bad disk.

vanzin · 2016-07-12T22:17:33Z

Some more info: you're correct in that, int this case, the application eventually fails because containers keep getting started on the "bad" node.

tgravescs · 2016-07-12T23:39:33Z

I'd be curious if you find out what was wrong with that node.

If its the leveldb file not being created, that should be fixed by aab99d3 which is supposed to use the approved recovery path and if that is bad I believe the nodemanager and all services won't come up.

But ignoring the actual cause I think if we put this in we should make it configurable, with default to not throw. From a YARN point of view I don't necessarily want one bad service to take the entire cluster down. For instance, lets say we have a bug in the spark shuffle services, we try to deploy a 5000 node cluster, this change now causes none of the nodemanagers to come up. But my workload on that cluster is such that spark is only like 1%. I don't necessarily want that to block the other 99% of jobs on that cluster while I try to fix the spark shuffle handler or roll it back.

This also should get better once we have the node blacklisting stuff in.

vanzin · 2016-07-12T23:47:27Z

If its the leveldb file not being created, that should be fixed by aab99d3

That's great, and in my view that also means that any failure in the startup of the shuffle service should actually be caused by something wrong with the environment, and not the shuffle service code, so this change shouldn't harm anybody. :-)

For instance, lets say we have a bug in the spark shuffle services

To be fair if you have a bug in another part of the shuffle service that is not in the startup path, it still could take out your whole cluster. That can't be fixed until the NM runs aux services in separate processes.

This also should get better once we have the node blacklisting stuff in.

Are you talking about SPARK-8425? If you are, I don't think that changes anything here, since the executor isn't even coming up, and that blacklisting is based on tasks failing.

I'm not against adding an option, I just don't really see it as really necessary. But if you feel strongly that the Spark shuffle service shouldn't affect NM startup ever, then I can add it.

tgravescs · 2016-07-13T03:43:29Z

To be fair if you have a bug in another part of the shuffle service that is not in the startup path, it still could take out your whole cluster. That can't be fixed until the NM runs aux services in separate processes.

Not sure what you mean by this, normally if bug in shuffle service it only affects Spark since Spark is the only one trying to access it. The other routines like initializeapplication and stopapplication called by NM all catch exceptions also. Although looking at NM code it doesn't matter because it catches them itself and just logs. Obviously if its really bad such that it causes segfault or memory leak it still can take it out, but normal exception in processing request from Spark application shouldn't take NM out.

Actually now that you point it out the try/catch that you removed should really be around the rest of the code in the init function as well.

But yes I would like to see config because we will run it with current behavior. If you want to run it without the catch then we need a config to be able to run in both modes.

vanzin · 2016-07-13T17:23:07Z

Not sure what you mean by this, normally if bug in shuffle service it only affects Spark since Spark

Memory leaks, unbounded file descriptor usage, lack of throttling, unexpected calls to System.exit, and other things like that affect the whole NM, not just Spark. Since we're talking about hypothetical bugs...

But sure, I'll add a config.

SparkQA · 2016-07-13T18:40:29Z

Test build #62261 has finished for PR 14162 at commit 23c8131.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tgravescs · 2016-07-13T19:02:07Z

common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java


+  // Whether failure during service initialization should stop the NM.
+  @VisibleForTesting
+  static final String STOP_ON_FAILURE_KEY = "spark.yarn.shuffle.stopOnFailure";


it would be nice to document this so others now about it. There is a section in job scheduling doc about the external shuffle service, otherwise maybe just put in the yarn section. We probably should have pointer from yarn section to the job scheduler section or just a standalone section on the shuffle service, but that is a different issue.

I looked around but couldn't find any documentation specific to the YARN shuffle service, but let me look again.

tgravescs · 2016-07-13T19:02:28Z

minor suggestion on docs, otherwise looks good.

Also move the YARN-specific setup to a section in the YARN documentation, and just link to it from the job scheduling page.

SparkQA · 2016-07-13T22:15:41Z

Test build #62281 has finished for PR 14162 at commit 226a165.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tgravescs · 2016-07-14T14:41:24Z

+1, thanks for fixing up the docs.

[SPARK-16505][yarn] Propagate error during shuffle service startup.

61a40b1

This prevents the NM from starting when something is wrong, which would lead to later errors which are confusing and harder to debug. Added a unit test to verify startup fails if something is wrong.

Add config key to enable the "stop NM on failure" behavior.

23c8131

vanzin changed the title ~~[SPARK-16505][yarn] Propagate error during shuffle service startup.~~ [SPARK-16505][yarn] Optionally propagate error during shuffle service startup. Jul 13, 2016

tgravescs reviewed Jul 13, 2016
View reviewed changes

Document new option for YARN shuffle service.

226a165

Also move the YARN-specific setup to a section in the YARN documentation, and just link to it from the job scheduling page.

asfgit closed this in b7b5e17 Jul 14, 2016

vanzin deleted the SPARK-16505 branch August 2, 2016 18:34

liupc mentioned this pull request Jan 28, 2019

[SPARK-26712]Support multi directories for executor shuffle info recovery in yarn shuffle serivce #23647

Closed

panbingkun mentioned this pull request Jan 3, 2024

[SPARK-46546][DOCS] Fix the formatting of tables in running-on-yarn pages #44540

Closed

[SPARK-16505][yarn] Optionally propagate error during shuffle service startup. #14162

[SPARK-16505][yarn] Optionally propagate error during shuffle service startup. #14162

Uh oh!

Conversation

vanzin commented Jul 12, 2016

Uh oh!

vanzin commented Jul 12, 2016

Uh oh!

tgravescs commented Jul 12, 2016

Uh oh!

SparkQA commented Jul 12, 2016

Uh oh!

vanzin commented Jul 12, 2016

Uh oh!

vanzin commented Jul 12, 2016

Uh oh!

tgravescs commented Jul 12, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vanzin commented Jul 12, 2016

Uh oh!

vanzin commented Jul 12, 2016

Uh oh!

tgravescs commented Jul 12, 2016

Uh oh!

vanzin commented Jul 12, 2016

Uh oh!

tgravescs commented Jul 13, 2016

Uh oh!

vanzin commented Jul 13, 2016

Uh oh!

SparkQA commented Jul 13, 2016

Uh oh!

tgravescs Jul 13, 2016

Choose a reason for hiding this comment

Uh oh!

vanzin Jul 13, 2016

Choose a reason for hiding this comment

Uh oh!

tgravescs commented Jul 13, 2016

Uh oh!

SparkQA commented Jul 13, 2016

Uh oh!

tgravescs commented Jul 14, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tgravescs commented Jul 12, 2016 •

edited

Loading