Skip to content
This repository was archived by the owner on Jan 9, 2020. It is now read-only.

Conversation

@mccheah
Copy link

@mccheah mccheah commented Feb 1, 2017

Similarly to how jars can be uploaded from the client to the driver, arbitrary files should also be allowed to be mounted. This is similar to how Spark supports sending files via spark.files as well as jars.

Note that the files are added to the working directory of the driver because it would be necessary for the user to find these files in their application, and this matches the semantics of spark.files. However, if we de-duplicate the files with identical names, it would be difficult for the user to find the resolved files. Thus unlike the case with jars, we explicitly throw an error if the user gives two files with the same name.

@mccheah
Copy link
Author

mccheah commented Feb 1, 2017

Forgot documentation - will add that shortly

Copy link

@ash211 ash211 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Happy to see the test verifying that it's placed on the resulting driver pod too.

Some minor nitpicky stuff but I saw nothing major on this diff

.doc("""
| Comma-separated list of files to send to the driver and
| all executors when submitting the application in cluster
| mode.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note that these are placed in the CWD of the driver and executors once running

*/
def createTarGzip(paths: Iterable[String]): TarGzippedData = {
def createTarGzip(paths: Iterable[String]):
TarGzippedData = {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

format-only change -- is this for style?

| Comma-separated list of jars to sent to the driver and
| Comma-separated list of jars to send to the driver and
| all executors when submitting the application in cluster
| mode.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and stored in a jars subdirectory in each process's CWD

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The jars are actually stored in temp space, which makes it basically impossible to find by the user.

val driverClasspath = driverExtraClasspath ++
resolvedJars ++
sparkJars ++
Array(appResourcePath)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this line removed? should it not have been there before?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sparkJars shouldn't be removed - that was a bad merge. appResourcePath, however, is already added above.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually never mind, misread the diff. appResourcePath is not necessary here.

.map(_.split(","))
.getOrElse(Array.empty[String])
val resolvedFiles = originalFiles ++ writtenFiles
resolvedSparkProperties("spark.files") = resolvedFiles.mkString
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this need (",") as a parameter to the mkString ?

"--upload-jars", HELPER_JAR,
"--upload-files", TEST_EXISTENCE_FILE.getAbsolutePath,
"--class", FILE_EXISTENCE_MAIN_CLASS,
"--conf", "spark.ui.enabled=true",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you don't need the ui, disabling saves a decent amount of startup time


// Because uploaded files should be added to the working directory of the driver, they
// need to not have duplicate file names. They are added to the working directory so the
// user can reliably locate them in their application.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note that this is the same pattern as YARN? I believe it is

private def validateNoDuplicateUploadFileNames(): Unit = {
uploadedFiles.foreach { unsplitPaths =>
val splitPaths = unsplitPaths.split(",")
val allPathsByFileName = splitPaths.groupBy(new File(_).getName)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you're not using splitPaths and allPathsByFileName except on the way to creating pathsWithDuplicateNames I think it makes sense to not give them identifiers, and instead just chain straight through

private val uploadedJars = sparkConf.get(KUBERNETES_DRIVER_UPLOAD_JARS)
private val uploadedJars = sparkConf.get(KUBERNETES_DRIVER_UPLOAD_JARS).filter(_.nonEmpty)
private val uploadedFiles = sparkConf.get(KUBERNETES_DRIVER_UPLOAD_FILES).filter(_.nonEmpty)
validateNoDuplicateUploadFileNames()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems a little weird -- should we have that method take an input and return an output instead of relying on the state of the instance vars being set already?

@mccheah
Copy link
Author

mccheah commented Feb 2, 2017

Documentation is added, addressed @aash's comments

Copy link

@ash211 ash211 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great to me! @iyanuobidele / @foxish any last comments before merging? I'd like to get this in mid-afternoon if possible.

@foxish
Copy link
Member

foxish commented Feb 2, 2017

LGTM

1 similar comment
@iyanuobidele
Copy link

LGTM

@ash211 ash211 merged commit 42819f7 into k8s-support-alternate-incremental Feb 2, 2017
@ash211 ash211 deleted the submit-arbitrary-files branch February 2, 2017 20:22
@ash211
Copy link

ash211 commented Feb 2, 2017

Cheers, merged!

ash211 pushed a commit that referenced this pull request Feb 8, 2017
* Allow adding arbitrary files

* Address comments and add documentation
ash211 pushed a commit that referenced this pull request Mar 8, 2017
* Allow adding arbitrary files

* Address comments and add documentation
foxish pushed a commit that referenced this pull request Jul 24, 2017
* Allow adding arbitrary files

* Address comments and add documentation
ifilonenko pushed a commit to ifilonenko/spark that referenced this pull request Feb 25, 2019
* Allow adding arbitrary files

* Address comments and add documentation
puneetloya pushed a commit to puneetloya/spark that referenced this pull request Mar 11, 2019
* Allow adding arbitrary files

* Address comments and add documentation
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants