Skip to content

Conversation

@vanzin
Copy link
Contributor

@vanzin vanzin commented Jul 12, 2016

This prevents the NM from starting when something is wrong, which would
lead to later errors which are confusing and harder to debug.

Added a unit test to verify startup fails if something is wrong.

This prevents the NM from starting when something is wrong, which would
lead to later errors which are confusing and harder to debug.

Added a unit test to verify startup fails if something is wrong.
@vanzin
Copy link
Contributor Author

vanzin commented Jul 12, 2016

@tgravescs

@tgravescs
Copy link
Contributor

So we had specifically decided to have this behavior when this was first written. The reason is that an issue with the spark shuffle services shouldn't stop other services from running fine on the NM. ie the mapreduce shuffle services. The node still works fine for MR even if there is bug in spark shuffle service. This was definitely a concern when we first released this. That isn't as much of an issue now.

We had talked about this again recently and again decided to leave this behavior, the reason is that it should fail fast, ie as soon as it registers the executor would fail and there wouldn't be any wasted work. I guess this could cause the job to fail if it kept trying to launch on some bad node. Or is it not really killing the executor?

What is the case you are seeing this issue? I'm ok with changing it if we have a good reason.

@SparkQA
Copy link

SparkQA commented Jul 12, 2016

Test build #62190 has finished for PR 14162 at commit 61a40b1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor Author

vanzin commented Jul 12, 2016

What is the case you are seeing this issue? I'm ok with changing it if we have a good reason.

Well, I guess "good reason" is in the eye of the beholder. :-) My argument for the change is to avoid users complaining when their app fails because of this. Instead, they'd have a clear warning that something is not right on a certain NM, and the rest of the world would keep going, using other NMs.

I can see the argument for not affecting other services; but how does the MR service behave?

Also, the current behavior leads not only to app failures, but also to a lot of noise in the NM's logs. So at least that part could be fixed, if people really don't want the behavior change, so that the shuffle service is really not running instead of running in a broken state.

Finally, this is a single error spot that is being filtered... e.g., if there's a port conflict, the Spark shuffle service will throw an exception and prevent the NM from starting.

@vanzin
Copy link
Contributor Author

vanzin commented Jul 12, 2016

To answer my own question: the MR service throws exceptions when there's an error during initialization. (See ShuffleHandler.java, there's no swallowing of exceptions during serviceInit and serviceStart.)

@tgravescs
Copy link
Contributor

tgravescs commented Jul 12, 2016

Correct, MR throws an exception, but again a lot of that is legacy from hadoop 1.x when it was the only thing running. The auxiliary services were added just for that originally. You could very well argue its a bug there too.

So you are seeing users hit the max number of executor failures then? Without blacklisting I can definitely see this as an issue.

The other question is why is it failing to start on the NM? s it just something bad happened on one node or across the cluster?

@vanzin
Copy link
Contributor Author

vanzin commented Jul 12, 2016

I'm trying to get access to more information about the original issue to answer all the questions; I know it was caused because for some reason the YARN "local-dirs" were all read only, and the leveldb files could not be created. Which could potentially be caused by a bad disk.

@vanzin
Copy link
Contributor Author

vanzin commented Jul 12, 2016

Some more info: you're correct in that, int this case, the application eventually fails because containers keep getting started on the "bad" node.

@tgravescs
Copy link
Contributor

I'd be curious if you find out what was wrong with that node.

If its the leveldb file not being created, that should be fixed by aab99d3 which is supposed to use the approved recovery path and if that is bad I believe the nodemanager and all services won't come up.

But ignoring the actual cause I think if we put this in we should make it configurable, with default to not throw. From a YARN point of view I don't necessarily want one bad service to take the entire cluster down. For instance, lets say we have a bug in the spark shuffle services, we try to deploy a 5000 node cluster, this change now causes none of the nodemanagers to come up. But my workload on that cluster is such that spark is only like 1%. I don't necessarily want that to block the other 99% of jobs on that cluster while I try to fix the spark shuffle handler or roll it back.

This also should get better once we have the node blacklisting stuff in.

@vanzin
Copy link
Contributor Author

vanzin commented Jul 12, 2016

If its the leveldb file not being created, that should be fixed by aab99d3

That's great, and in my view that also means that any failure in the startup of the shuffle service should actually be caused by something wrong with the environment, and not the shuffle service code, so this change shouldn't harm anybody. :-)

For instance, lets say we have a bug in the spark shuffle services

To be fair if you have a bug in another part of the shuffle service that is not in the startup path, it still could take out your whole cluster. That can't be fixed until the NM runs aux services in separate processes.

This also should get better once we have the node blacklisting stuff in.

Are you talking about SPARK-8425? If you are, I don't think that changes anything here, since the executor isn't even coming up, and that blacklisting is based on tasks failing.

I'm not against adding an option, I just don't really see it as really necessary. But if you feel strongly that the Spark shuffle service shouldn't affect NM startup ever, then I can add it.

@tgravescs
Copy link
Contributor

To be fair if you have a bug in another part of the shuffle service that is not in the startup path, it still could take out your whole cluster. That can't be fixed until the NM runs aux services in separate processes.

Not sure what you mean by this, normally if bug in shuffle service it only affects Spark since Spark is the only one trying to access it. The other routines like initializeapplication and stopapplication called by NM all catch exceptions also. Although looking at NM code it doesn't matter because it catches them itself and just logs. Obviously if its really bad such that it causes segfault or memory leak it still can take it out, but normal exception in processing request from Spark application shouldn't take NM out.

Actually now that you point it out the try/catch that you removed should really be around the rest of the code in the init function as well.

But yes I would like to see config because we will run it with current behavior. If you want to run it without the catch then we need a config to be able to run in both modes.

@vanzin
Copy link
Contributor Author

vanzin commented Jul 13, 2016

Not sure what you mean by this, normally if bug in shuffle service it only affects Spark since Spark

Memory leaks, unbounded file descriptor usage, lack of throttling, unexpected calls to System.exit, and other things like that affect the whole NM, not just Spark. Since we're talking about hypothetical bugs...

But sure, I'll add a config.

@vanzin vanzin changed the title [SPARK-16505][yarn] Propagate error during shuffle service startup. [SPARK-16505][yarn] Optionally propagate error during shuffle service startup. Jul 13, 2016
@SparkQA
Copy link

SparkQA commented Jul 13, 2016

Test build #62261 has finished for PR 14162 at commit 23c8131.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


// Whether failure during service initialization should stop the NM.
@VisibleForTesting
static final String STOP_ON_FAILURE_KEY = "spark.yarn.shuffle.stopOnFailure";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be nice to document this so others now about it. There is a section in job scheduling doc about the external shuffle service, otherwise maybe just put in the yarn section. We probably should have pointer from yarn section to the job scheduler section or just a standalone section on the shuffle service, but that is a different issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked around but couldn't find any documentation specific to the YARN shuffle service, but let me look again.

@tgravescs
Copy link
Contributor

minor suggestion on docs, otherwise looks good.

Also move the YARN-specific setup to a section in the YARN documentation,
and just link to it from the job scheduling page.
@SparkQA
Copy link

SparkQA commented Jul 13, 2016

Test build #62281 has finished for PR 14162 at commit 226a165.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tgravescs
Copy link
Contributor

+1, thanks for fixing up the docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants