Skip to content

Conversation

@YanTangZhai
Copy link
Contributor

The spark.local.dir is configured as a list of multiple paths as follows /data1/sparkenv/local,/data2/sparkenv/local. If the disk data2 of the driver node has error, the application will exit since DiskBlockManager exits directly at createLocalDirs. If the disk data2 of the worker node has error, the executor will exit either.
DiskBlockManager should not exit directly at createLocalDirs if one of spark.local.dir has error. Since spark.local.dir has multiple paths, a problem should not affect the overall situation.
I think DiskBlockManager could ignore the bad directory at createLocalDirs.

…ir is a list of multiple paths and one of them has error
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scala style thing, you can use flatMap instead of foreach here and return None in the case where directory creation failed and Some(localDir) in the case where it worked.

@aarondav
Copy link
Contributor

aarondav commented Jul 1, 2014

Jenkins, ok to to test.

@aarondav
Copy link
Contributor

aarondav commented Jul 1, 2014

This change seems reasonable because on large clusters, we occasionally see a single disk on a single machine is failed, and this may cause the entire application to crash because the executor will keep getting restarted until the Master kills the application.

It also allows a more uniform configuration for a heterogeneous cluster with different numbers of disks.

The downside of this behavioral change is that a misconfiguration like mistyping one of your local dirs may go unnoticed for a while, but this will hopefully become apparent after a df or a look at any of the executor logs. The fail-fast approach is generally better, but current Spark does not do a good job communicating the reason for executors that crash immediately upon startup.

…ir is a list of multiple paths and one of them has error
@YanTangZhai
Copy link
Contributor Author

Thank aarondav. I've modified some codes. Please help to review again.

@aarondav
Copy link
Contributor

aarondav commented Jul 3, 2014

LGTM. Merging into master.

@JoshRosen
Copy link
Contributor

@aarondav @YanTangZhai I think that this patch introduced a bug in Utils.getLocalDir(). Currently, that function assumes that all of the directories in spark.local.dir exist and returns the first directory from that list:

  /**
   * Get a temporary directory using Spark's spark.local.dir property, if set. This will always
   * return a single directory, even though the spark.local.dir property might be a list of
   * multiple paths.
   */
  def getLocalDir(conf: SparkConf): String = {
    conf.get("spark.local.dir", System.getProperty("java.io.tmpdir")).split(',')(0)
  }

After this patch, the first directory might be missing, which can lead to confusing errors when we try to create files in it.

How should we fix this? Maybe have the disk manager be the authoritative source of local directories and update all other code to use it rather than the raw spark.local.dir property?

xiliu82 pushed a commit to xiliu82/spark that referenced this pull request Sep 4, 2014
…ir is a list of multiple paths and one of them has error

The spark.local.dir is configured as a list of multiple paths as follows /data1/sparkenv/local,/data2/sparkenv/local. If the disk data2 of the driver node has error, the application will exit since DiskBlockManager exits directly at createLocalDirs. If the disk data2 of the worker node has error, the executor will exit either.
DiskBlockManager should not exit directly at createLocalDirs if one of spark.local.dir has error. Since spark.local.dir has multiple paths, a problem should not affect the overall situation.
I think DiskBlockManager could ignore the bad directory at createLocalDirs.

Author: yantangzhai <[email protected]>

Closes apache#1274 from YanTangZhai/SPARK-2324 and squashes the following commits:

609bf48 [yantangzhai] [SPARK-2324] SparkContext should not exit directly when spark.local.dir is a list of multiple paths and one of them has error
df08673 [yantangzhai] [SPARK-2324] SparkContext should not exit directly when spark.local.dir is a list of multiple paths and one of them has error
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants