-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-2324] SparkContext should not exit directly when spark.local.dir is a list of multiple paths and one of them has error #1274
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ir is a list of multiple paths and one of them has error
|
Can one of the admins verify this patch? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Scala style thing, you can use flatMap instead of foreach here and return None in the case where directory creation failed and Some(localDir) in the case where it worked.
|
Jenkins, ok to to test. |
|
This change seems reasonable because on large clusters, we occasionally see a single disk on a single machine is failed, and this may cause the entire application to crash because the executor will keep getting restarted until the Master kills the application. It also allows a more uniform configuration for a heterogeneous cluster with different numbers of disks. The downside of this behavioral change is that a misconfiguration like mistyping one of your local dirs may go unnoticed for a while, but this will hopefully become apparent after a |
…ir is a list of multiple paths and one of them has error
|
Thank aarondav. I've modified some codes. Please help to review again. |
|
LGTM. Merging into master. |
|
@aarondav @YanTangZhai I think that this patch introduced a bug in /**
* Get a temporary directory using Spark's spark.local.dir property, if set. This will always
* return a single directory, even though the spark.local.dir property might be a list of
* multiple paths.
*/
def getLocalDir(conf: SparkConf): String = {
conf.get("spark.local.dir", System.getProperty("java.io.tmpdir")).split(',')(0)
}After this patch, the first directory might be missing, which can lead to confusing errors when we try to create files in it. How should we fix this? Maybe have the disk manager be the authoritative source of local directories and update all other code to use it rather than the raw |
…ir is a list of multiple paths and one of them has error The spark.local.dir is configured as a list of multiple paths as follows /data1/sparkenv/local,/data2/sparkenv/local. If the disk data2 of the driver node has error, the application will exit since DiskBlockManager exits directly at createLocalDirs. If the disk data2 of the worker node has error, the executor will exit either. DiskBlockManager should not exit directly at createLocalDirs if one of spark.local.dir has error. Since spark.local.dir has multiple paths, a problem should not affect the overall situation. I think DiskBlockManager could ignore the bad directory at createLocalDirs. Author: yantangzhai <[email protected]> Closes apache#1274 from YanTangZhai/SPARK-2324 and squashes the following commits: 609bf48 [yantangzhai] [SPARK-2324] SparkContext should not exit directly when spark.local.dir is a list of multiple paths and one of them has error df08673 [yantangzhai] [SPARK-2324] SparkContext should not exit directly when spark.local.dir is a list of multiple paths and one of them has error
The spark.local.dir is configured as a list of multiple paths as follows /data1/sparkenv/local,/data2/sparkenv/local. If the disk data2 of the driver node has error, the application will exit since DiskBlockManager exits directly at createLocalDirs. If the disk data2 of the worker node has error, the executor will exit either.
DiskBlockManager should not exit directly at createLocalDirs if one of spark.local.dir has error. Since spark.local.dir has multiple paths, a problem should not affect the overall situation.
I think DiskBlockManager could ignore the bad directory at createLocalDirs.