[SPARK-21829][CORE] Enable config to permanently blacklist a list of nodes #19039

LucaCanali · 2017-08-24T13:33:47Z

What changes were proposed in this pull request?

With this new feature I propose to introduce a mechanism to allow users to specify a list of nodes in the cluster where executors/tasks should not run for a specific job.
The proposed implementation that I tested (see PR) uses the Spark blacklist mechanism. With the parameter spark.blacklist.alwaysBlacklistedNodes, a list of user-specified nodes is added to the blacklist at the start of the Spark Context and it is never expired.
I have tested this on a YARN cluster on a case taken from the original production problem and I confirm a performance improvement of about 5x for the specific test case I have. I imagine that there can be other cases where Spark users may want to blacklist a set of nodes. This can be used for troubleshooting, including cases where certain nodes/executors are slow for a given workload and this is caused by external agents, so the anomaly is not picked up by the cluster manager.
See also SPARK-21829

How was this patch tested?

A test has been added to the BlackListTrackerSuite.
The patch has also been successfully tested manually on a YARN cluster.

…PARK-21829

AmplabJenkins · 2017-08-24T13:37:07Z

Can one of the admins verify this patch?

jiangxb1987 · 2017-08-24T15:17:10Z

Please change the title to:

[SPARK-21829][CORE] Enable config to permanently blacklist a list of nodes

jiangxb1987

TBH I'm not sure this is a useful feature, cc @jerryshao @cloud-fan

jiangxb1987 · 2017-08-24T15:20:22Z

core/src/main/scala/org/apache/spark/scheduler/BlacklistTracker.scala

+   * The blacklist timeout is set to a large value, effectively never expiring.
+   */
+  private val permanentlyBlacklistedNodes: Seq[String] =
+    conf.get("spark.blacklist.alwaysBlacklistedNodes", "").split(',').map(_.trim).filter(_ != "")


We should make this a private function and init the val _nodeBlacklist with this.

jiangxb1987 · 2017-08-24T15:21:01Z

core/src/main/scala/org/apache/spark/scheduler/BlacklistTracker.scala

+   * Blacklists permanently the nodes listed in spark.blacklist.alwaysBlacklistedNodes
+   * The blacklist timeout is set to a large value, effectively never expiring.
+   */
+  private val permanentlyBlacklistedNodes: Seq[String] =


The configuartion should goes to internal/config

LucaCanali · 2017-08-24T20:14:32Z

Thanks @jiangxb1987 for the review. I have tried to address the comments in a new commit, in particular adding the configuration to internal/config and building a private function to handle processing of the node list in spark.blacklist.alwaysBlacklistedNodes. As for setting _nodeBlacklist I think it makes sense to use _nodeBlacklist.set(nodeIdToBlacklistExpiryTime.keySet.toSet) to keep it consistent with the rest of the code in BlacklistTracker. Also nodeIdToBlacklistExpiryTime needs to be initialized with the blacklisted nodes.

As for the usefulness of the feature, I understand your comment and I have added some comments in SPARK-21829. The need for this feature for me comes from a production issue, which I realize is not very common, but I guess can happen again in my environment and maybe in others'.
What we have is a shared YARN cluster and we have a workload that runs slow on a couple of nodes, however the nodes are fine to run other types of jobs, so we want to have them in the cluster. The actual problem comes from reading from an external file system, and apparently only for this specific workload (which is only one of many workloads that run on that cluster). What I have done as a workaround to make the job run faster so far is just killing the executors on the 2 "slow nodes" and the job could finish faster as it avoided the painfully slow long tail of execution on the affected nodes. The proposed patch/feature is an attempt to address this case in a more structured way than just going on the nodes and killing executors.

jerryshao · 2017-08-25T02:20:49Z

The changes you made in BlacklistTracker seems break the design purpose of blacklist. The blacklist in Spark as well as in MR/TEZ assumes bad nodes/executors will be back to normal in several hours, so it always has a timeout for blacklist.

In your case, the problem is not bad nodes/executors, it is that you don't want to start executors on some nodes (like slow nodes). This is more like a cluster manager problem rather than Spark problem. To summarize your problem, you want your Spark application runs on some specific nodes.

To solve your problem, for YARN you could use node label and Spark on YARN already support node label. You could google node label to know the details.

For standalone, simply you should not start worker on such nodes you don't want.

For Mesos I'm not sure, I guess it should also has similar approaches.

jiangxb1987 · 2017-08-28T20:18:55Z

@LucaCanali Does the alternative approach suggested by @jerryshao sounds good for your case?

LucaCanali · 2017-08-29T07:32:52Z

@jiangxb1987 Indeed good suggestion by @jerryshao - I have replied on SPARK-21829

Closes apache#18522 Closes apache#17722 Closes apache#18879 Closes apache#18891 Closes apache#18806 Closes apache#18948 Closes apache#18949 Closes apache#19070 Closes apache#19039 Closes apache#19142 Closes apache#18515 Closes apache#19154 Closes apache#19162 Closes apache#19187

Closes apache#18522 Closes apache#17722 Closes apache#18879 Closes apache#18891 Closes apache#18806 Closes apache#18948 Closes apache#18949 Closes apache#19070 Closes apache#19039 Closes apache#19142 Closes apache#18515 Closes apache#19154 Closes apache#19162 Closes apache#19187 Closes apache#19091 Author: Sean Owen <[email protected]> Closes apache#19203 from srowen/CloseStalePRs3.

Add feature to permanently blacklist a user-specified list of nodes, S…

4932d37

…PARK-21829

LucaCanali changed the title ~~Add feature to permanently blacklist a user-specified list of nodes, …~~ [SPARK-21519][CORE] Add feature to permanently blacklist a user-specified list of nodes, … Aug 24, 2017

LucaCanali changed the title ~~[SPARK-21519][CORE] Add feature to permanently blacklist a user-specified list of nodes, …~~ [SPARK-21829][CORE] Add feature to permanently blacklist a user-specified list of nodes, … Aug 24, 2017

LucaCanali changed the title ~~[SPARK-21829][CORE] Add feature to permanently blacklist a user-specified list of nodes, …~~ [SPARK-21829][CORE] Enable config to permanently blacklist a list of nodes Aug 24, 2017

jiangxb1987 reviewed Aug 24, 2017

View reviewed changes

Refactored following comments on PR 19039

4a4ccb0

srowen mentioned this pull request Sep 12, 2017

[BUILD] Close stale PRs #19203

Closed

asfgit closed this in dd88fa3 Sep 13, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-21829][CORE] Enable config to permanently blacklist a list of nodes #19039

[SPARK-21829][CORE] Enable config to permanently blacklist a list of nodes #19039

Uh oh!

LucaCanali commented Aug 24, 2017 •

edited

Loading

Uh oh!

AmplabJenkins commented Aug 24, 2017

Uh oh!

jiangxb1987 commented Aug 24, 2017

Uh oh!

jiangxb1987 left a comment

Uh oh!

jiangxb1987 Aug 24, 2017

Uh oh!

jiangxb1987 Aug 24, 2017

Uh oh!

LucaCanali commented Aug 24, 2017

Uh oh!

jerryshao commented Aug 25, 2017 •

edited

Loading

Uh oh!

jiangxb1987 commented Aug 28, 2017

Uh oh!

LucaCanali commented Aug 29, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-21829][CORE] Enable config to permanently blacklist a list of nodes #19039

[SPARK-21829][CORE] Enable config to permanently blacklist a list of nodes #19039

Uh oh!

Conversation

LucaCanali commented Aug 24, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

AmplabJenkins commented Aug 24, 2017

Uh oh!

jiangxb1987 commented Aug 24, 2017

Uh oh!

jiangxb1987 left a comment

Choose a reason for hiding this comment

Uh oh!

jiangxb1987 Aug 24, 2017

Choose a reason for hiding this comment

Uh oh!

jiangxb1987 Aug 24, 2017

Choose a reason for hiding this comment

Uh oh!

LucaCanali commented Aug 24, 2017

Uh oh!

jerryshao commented Aug 25, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jiangxb1987 commented Aug 28, 2017

Uh oh!

LucaCanali commented Aug 29, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

LucaCanali commented Aug 24, 2017 •

edited

Loading

jerryshao commented Aug 25, 2017 •

edited

Loading