Skip to content

Conversation

@LucaCanali
Copy link
Contributor

@LucaCanali LucaCanali commented Aug 24, 2017

What changes were proposed in this pull request?

With this new feature I propose to introduce a mechanism to allow users to specify a list of nodes in the cluster where executors/tasks should not run for a specific job.
The proposed implementation that I tested (see PR) uses the Spark blacklist mechanism. With the parameter spark.blacklist.alwaysBlacklistedNodes, a list of user-specified nodes is added to the blacklist at the start of the Spark Context and it is never expired.
I have tested this on a YARN cluster on a case taken from the original production problem and I confirm a performance improvement of about 5x for the specific test case I have. I imagine that there can be other cases where Spark users may want to blacklist a set of nodes. This can be used for troubleshooting, including cases where certain nodes/executors are slow for a given workload and this is caused by external agents, so the anomaly is not picked up by the cluster manager.
See also SPARK-21829

How was this patch tested?

A test has been added to the BlackListTrackerSuite.
The patch has also been successfully tested manually on a YARN cluster.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@LucaCanali LucaCanali changed the title Add feature to permanently blacklist a user-specified list of nodes, … [SPARK-21519][CORE] Add feature to permanently blacklist a user-specified list of nodes, … Aug 24, 2017
@LucaCanali LucaCanali changed the title [SPARK-21519][CORE] Add feature to permanently blacklist a user-specified list of nodes, … [SPARK-21829][CORE] Add feature to permanently blacklist a user-specified list of nodes, … Aug 24, 2017
@jiangxb1987
Copy link
Contributor

Please change the title to:

[SPARK-21829][CORE] Enable config to permanently blacklist a list of nodes

@LucaCanali LucaCanali changed the title [SPARK-21829][CORE] Add feature to permanently blacklist a user-specified list of nodes, … [SPARK-21829][CORE] Enable config to permanently blacklist a list of nodes Aug 24, 2017
Copy link
Contributor

@jiangxb1987 jiangxb1987 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH I'm not sure this is a useful feature, cc @jerryshao @cloud-fan

* The blacklist timeout is set to a large value, effectively never expiring.
*/
private val permanentlyBlacklistedNodes: Seq[String] =
conf.get("spark.blacklist.alwaysBlacklistedNodes", "").split(',').map(_.trim).filter(_ != "")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should make this a private function and init the val _nodeBlacklist with this.

* Blacklists permanently the nodes listed in spark.blacklist.alwaysBlacklistedNodes
* The blacklist timeout is set to a large value, effectively never expiring.
*/
private val permanentlyBlacklistedNodes: Seq[String] =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The configuartion should goes to internal/config

@LucaCanali
Copy link
Contributor Author

Thanks @jiangxb1987 for the review. I have tried to address the comments in a new commit, in particular adding the configuration to internal/config and building a private function to handle processing of the node list in spark.blacklist.alwaysBlacklistedNodes. As for setting _nodeBlacklist I think it makes sense to use _nodeBlacklist.set(nodeIdToBlacklistExpiryTime.keySet.toSet) to keep it consistent with the rest of the code in BlacklistTracker. Also nodeIdToBlacklistExpiryTime needs to be initialized with the blacklisted nodes.

As for the usefulness of the feature, I understand your comment and I have added some comments in SPARK-21829. The need for this feature for me comes from a production issue, which I realize is not very common, but I guess can happen again in my environment and maybe in others'.
What we have is a shared YARN cluster and we have a workload that runs slow on a couple of nodes, however the nodes are fine to run other types of jobs, so we want to have them in the cluster. The actual problem comes from reading from an external file system, and apparently only for this specific workload (which is only one of many workloads that run on that cluster). What I have done as a workaround to make the job run faster so far is just killing the executors on the 2 "slow nodes" and the job could finish faster as it avoided the painfully slow long tail of execution on the affected nodes. The proposed patch/feature is an attempt to address this case in a more structured way than just going on the nodes and killing executors.

@jerryshao
Copy link
Contributor

jerryshao commented Aug 25, 2017

The changes you made in BlacklistTracker seems break the design purpose of blacklist. The blacklist in Spark as well as in MR/TEZ assumes bad nodes/executors will be back to normal in several hours, so it always has a timeout for blacklist.

In your case, the problem is not bad nodes/executors, it is that you don't want to start executors on some nodes (like slow nodes). This is more like a cluster manager problem rather than Spark problem. To summarize your problem, you want your Spark application runs on some specific nodes.

To solve your problem, for YARN you could use node label and Spark on YARN already support node label. You could google node label to know the details.

For standalone, simply you should not start worker on such nodes you don't want.

For Mesos I'm not sure, I guess it should also has similar approaches.

@jiangxb1987
Copy link
Contributor

@LucaCanali Does the alternative approach suggested by @jerryshao sounds good for your case?

@LucaCanali
Copy link
Contributor Author

@jiangxb1987 Indeed good suggestion by @jerryshao - I have replied on SPARK-21829

srowen added a commit to srowen/spark that referenced this pull request Sep 12, 2017
@srowen srowen mentioned this pull request Sep 12, 2017
@asfgit asfgit closed this in dd88fa3 Sep 13, 2017
zifeif2 pushed a commit to zifeif2/spark that referenced this pull request Nov 22, 2025
Closes apache#18522
Closes apache#17722
Closes apache#18879
Closes apache#18891
Closes apache#18806
Closes apache#18948
Closes apache#18949
Closes apache#19070
Closes apache#19039
Closes apache#19142
Closes apache#18515
Closes apache#19154
Closes apache#19162
Closes apache#19187
Closes apache#19091

Author: Sean Owen <[email protected]>

Closes apache#19203 from srowen/CloseStalePRs3.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants