-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-21829][CORE] Enable config to permanently blacklist a list of nodes #19039
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-21829][CORE] Enable config to permanently blacklist a list of nodes #19039
Conversation
|
Can one of the admins verify this patch? |
|
Please change the title to: |
jiangxb1987
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH I'm not sure this is a useful feature, cc @jerryshao @cloud-fan
| * The blacklist timeout is set to a large value, effectively never expiring. | ||
| */ | ||
| private val permanentlyBlacklistedNodes: Seq[String] = | ||
| conf.get("spark.blacklist.alwaysBlacklistedNodes", "").split(',').map(_.trim).filter(_ != "") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should make this a private function and init the val _nodeBlacklist with this.
| * Blacklists permanently the nodes listed in spark.blacklist.alwaysBlacklistedNodes | ||
| * The blacklist timeout is set to a large value, effectively never expiring. | ||
| */ | ||
| private val permanentlyBlacklistedNodes: Seq[String] = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The configuartion should goes to internal/config
|
Thanks @jiangxb1987 for the review. I have tried to address the comments in a new commit, in particular adding the configuration to internal/config and building a private function to handle processing of the node list in As for the usefulness of the feature, I understand your comment and I have added some comments in SPARK-21829. The need for this feature for me comes from a production issue, which I realize is not very common, but I guess can happen again in my environment and maybe in others'. |
|
The changes you made in In your case, the problem is not bad nodes/executors, it is that you don't want to start executors on some nodes (like slow nodes). This is more like a cluster manager problem rather than Spark problem. To summarize your problem, you want your Spark application runs on some specific nodes. To solve your problem, for YARN you could use node label and Spark on YARN already support node label. You could google node label to know the details. For standalone, simply you should not start worker on such nodes you don't want. For Mesos I'm not sure, I guess it should also has similar approaches. |
|
@LucaCanali Does the alternative approach suggested by @jerryshao sounds good for your case? |
|
@jiangxb1987 Indeed good suggestion by @jerryshao - I have replied on SPARK-21829 |
Closes apache#18522 Closes apache#17722 Closes apache#18879 Closes apache#18891 Closes apache#18806 Closes apache#18948 Closes apache#18949 Closes apache#19070 Closes apache#19039 Closes apache#19142 Closes apache#18515 Closes apache#19154 Closes apache#19162 Closes apache#19187
Closes apache#18522 Closes apache#17722 Closes apache#18879 Closes apache#18891 Closes apache#18806 Closes apache#18948 Closes apache#18949 Closes apache#19070 Closes apache#19039 Closes apache#19142 Closes apache#18515 Closes apache#19154 Closes apache#19162 Closes apache#19187 Closes apache#19091 Author: Sean Owen <[email protected]> Closes apache#19203 from srowen/CloseStalePRs3.
What changes were proposed in this pull request?
With this new feature I propose to introduce a mechanism to allow users to specify a list of nodes in the cluster where executors/tasks should not run for a specific job.
The proposed implementation that I tested (see PR) uses the Spark blacklist mechanism. With the parameter spark.blacklist.alwaysBlacklistedNodes, a list of user-specified nodes is added to the blacklist at the start of the Spark Context and it is never expired.
I have tested this on a YARN cluster on a case taken from the original production problem and I confirm a performance improvement of about 5x for the specific test case I have. I imagine that there can be other cases where Spark users may want to blacklist a set of nodes. This can be used for troubleshooting, including cases where certain nodes/executors are slow for a given workload and this is caused by external agents, so the anomaly is not picked up by the cluster manager.
See also SPARK-21829
How was this patch tested?
A test has been added to the BlackListTrackerSuite.
The patch has also been successfully tested manually on a YARN cluster.