-
Notifications
You must be signed in to change notification settings - Fork 117
[WIP] Use spark.jars and spark.files instead of k8s-specific config #104
[WIP] Use spark.jars and spark.files instead of k8s-specific config #104
Conversation
This brings spark.jars and spark.files in line with the behavior of YARN and other cluster managers. Specifically now, the following schemes are supported: - local:// is a container-local file assumed to be present on both driver and executor containers - container:// is a synonym for local:// - file:// is a submitter-local file that's uploaded to the driver - a no-scheme path is treated as if it had the file:// scheme Filenames of spark.files are required to be unique since they are all placed in the current working directory of the driver and executors. spark.jars does not have this restriction -- they are given a unique suffix and placed in a separate folder from the current working directory and added to the driver classpath.
| val originalFiles = sparkProperties.get("spark.files") | ||
| val nonUploadedFiles = sparkProperties.get("spark.files") | ||
| .map(_.split(",")) | ||
| .map(_.filter(Utils.resolveURI(_).getScheme match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a common enough pattern that we could think about separating this out to a separate class.
Also, some other places in core Spark have used this paradigm:
Option(Utils.resolveURI(file).getScheme).getOrElse("file") match ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using the above we can shorthand a lot of things and avoid the case-match entirely. For example:
Option(Utils.resolveURI(file).getScheme).getOrElse("file") != "file"
| })) | ||
| .getOrElse(Array.empty[String]) | ||
| val resolvedJars = writtenJars ++ originalJars ++ Array(appResourcePath) | ||
| val resolvedJars = writtenJars ++ nonUploadedJars ++ Array(appResourcePath) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we're missing adding jars with the scheme local to the driver's classpath. Note that while we want to add said jars to the classpath using just the raw path, they should still be put in spark.jars with the full URI so that executors pick them up from their local disks, as opposed to having the driver upload them.
|
Update |
| } | ||
|
|
||
| private def compressFiles(maybeFilePaths: Option[String]): Option[TarGzippedData] = { | ||
| private def compressUploadableFiles(maybeFilePaths: Option[String]): Option[TarGzippedData] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not entirely sure if it's worth keeping TarGzippedData fields as Options anymore. Sure, we save a few bytes of extra data to upload if there aren't any jars, but it's an extra layer of indirection that has to be semantically understood.
| serverSparkVersion = SPARK_VERSION | ||
| } | ||
|
|
||
| object AppResource { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mark the object as private[spark]
|
I don't think we should support both |
|
After spending some time comparing this implementation with the one in #107 I've decided that one is better (covers several things that this one did not, including docs, SparkSubmit option updates, and remapping paths in those two options). Closing this PR and continuing work on the other. Sorry for the double-review folks! |
…nto-palantir-spark Bring first version of apache-spark-on-k8s into Palantir Spark
Fixes #92
This brings spark.jars and spark.files in line with the behavior of YARN and
other cluster managers.
Specifically now, the following schemes are supported:
executor containers
Filenames of spark.files are required to be unique since they are all placed
in the current working directory of the driver and executors. spark.jars does
not have this restriction -- they are given a unique suffix and placed in a
separate folder from the current working directory and added to the driver
classpath.