Skip to content
This repository was archived by the owner on Jan 9, 2020. It is now read-only.

Conversation

@ash211
Copy link

@ash211 ash211 commented Feb 10, 2017

Fixes #92

This brings spark.jars and spark.files in line with the behavior of YARN and
other cluster managers.

Specifically now, the following schemes are supported:

  • local:// is a container-local file assumed to be present on both driver and
    executor containers
  • container:// is a synonym for local://
  • file:// is a submitter-local file that's uploaded to the driver
  • a no-scheme path is treated as if it had the file:// scheme

Filenames of spark.files are required to be unique since they are all placed
in the current working directory of the driver and executors. spark.jars does
not have this restriction -- they are given a unique suffix and placed in a
separate folder from the current working directory and added to the driver
classpath.

This brings spark.jars and spark.files in line with the behavior of YARN and
other cluster managers.

Specifically now, the following schemes are supported:

- local:// is a container-local file assumed to be present on both driver and
  executor containers
- container:// is a synonym for local://
- file:// is a submitter-local file that's uploaded to the driver
- a no-scheme path is treated as if it had the file:// scheme

Filenames of spark.files are required to be unique since they are all placed
in the current working directory of the driver and executors.  spark.jars does
not have this restriction -- they are given a unique suffix and placed in a
separate folder from the current working directory and added to the driver
classpath.
val originalFiles = sparkProperties.get("spark.files")
val nonUploadedFiles = sparkProperties.get("spark.files")
.map(_.split(","))
.map(_.filter(Utils.resolveURI(_).getScheme match {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a common enough pattern that we could think about separating this out to a separate class.

Also, some other places in core Spark have used this paradigm:

Option(Utils.resolveURI(file).getScheme).getOrElse("file") match ...

Copy link

@mccheah mccheah Feb 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the above we can shorthand a lot of things and avoid the case-match entirely. For example:

Option(Utils.resolveURI(file).getScheme).getOrElse("file") != "file"

}))
.getOrElse(Array.empty[String])
val resolvedJars = writtenJars ++ originalJars ++ Array(appResourcePath)
val resolvedJars = writtenJars ++ nonUploadedJars ++ Array(appResourcePath)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we're missing adding jars with the scheme local to the driver's classpath. Note that while we want to add said jars to the classpath using just the raw path, they should still be put in spark.jars with the full URI so that executors pick them up from their local disks, as opposed to having the driver upload them.

@mccheah
Copy link

mccheah commented Feb 11, 2017

Update docs/running-on-kubernetes.md also!

}

private def compressFiles(maybeFilePaths: Option[String]): Option[TarGzippedData] = {
private def compressUploadableFiles(maybeFilePaths: Option[String]): Option[TarGzippedData] = {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely sure if it's worth keeping TarGzippedData fields as Options anymore. Sure, we save a few bytes of extra data to upload if there aren't any jars, but it's an extra layer of indirection that has to be semantically understood.

serverSparkVersion = SPARK_VERSION
}

object AppResource {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mark the object as private[spark]

@mccheah
Copy link

mccheah commented Feb 11, 2017

I don't think we should support both local and container, simply because the latter is specific to Kubernetes but has the exact same semantics as the former. We should just use local.

@ash211
Copy link
Author

ash211 commented Feb 16, 2017

After spending some time comparing this implementation with the one in #107 I've decided that one is better (covers several things that this one did not, including docs, SparkSubmit option updates, and remapping paths in those two options).

Closing this PR and continuing work on the other. Sorry for the double-review folks!

@ash211 ash211 closed this Feb 16, 2017
@ash211 ash211 deleted the remove-nonstandard-upload-jars branch February 16, 2017 01:29
ifilonenko pushed a commit to ifilonenko/spark that referenced this pull request Feb 25, 2019
…nto-palantir-spark

Bring first version of apache-spark-on-k8s into Palantir Spark
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants