-
Notifications
You must be signed in to change notification settings - Fork 117
Allow adding arbitrary files #71
Allow adding arbitrary files #71
Conversation
|
Forgot documentation - will add that shortly |
ash211
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Happy to see the test verifying that it's placed on the resulting driver pod too.
Some minor nitpicky stuff but I saw nothing major on this diff
| .doc(""" | ||
| | Comma-separated list of files to send to the driver and | ||
| | all executors when submitting the application in cluster | ||
| | mode. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note that these are placed in the CWD of the driver and executors once running
| */ | ||
| def createTarGzip(paths: Iterable[String]): TarGzippedData = { | ||
| def createTarGzip(paths: Iterable[String]): | ||
| TarGzippedData = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
format-only change -- is this for style?
| | Comma-separated list of jars to sent to the driver and | ||
| | Comma-separated list of jars to send to the driver and | ||
| | all executors when submitting the application in cluster | ||
| | mode. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and stored in a jars subdirectory in each process's CWD
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The jars are actually stored in temp space, which makes it basically impossible to find by the user.
| val driverClasspath = driverExtraClasspath ++ | ||
| resolvedJars ++ | ||
| sparkJars ++ | ||
| Array(appResourcePath) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this line removed? should it not have been there before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sparkJars shouldn't be removed - that was a bad merge. appResourcePath, however, is already added above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually never mind, misread the diff. appResourcePath is not necessary here.
| .map(_.split(",")) | ||
| .getOrElse(Array.empty[String]) | ||
| val resolvedFiles = originalFiles ++ writtenFiles | ||
| resolvedSparkProperties("spark.files") = resolvedFiles.mkString |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this need (",") as a parameter to the mkString ?
| "--upload-jars", HELPER_JAR, | ||
| "--upload-files", TEST_EXISTENCE_FILE.getAbsolutePath, | ||
| "--class", FILE_EXISTENCE_MAIN_CLASS, | ||
| "--conf", "spark.ui.enabled=true", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you don't need the ui, disabling saves a decent amount of startup time
|
|
||
| // Because uploaded files should be added to the working directory of the driver, they | ||
| // need to not have duplicate file names. They are added to the working directory so the | ||
| // user can reliably locate them in their application. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note that this is the same pattern as YARN? I believe it is
| private def validateNoDuplicateUploadFileNames(): Unit = { | ||
| uploadedFiles.foreach { unsplitPaths => | ||
| val splitPaths = unsplitPaths.split(",") | ||
| val allPathsByFileName = splitPaths.groupBy(new File(_).getName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you're not using splitPaths and allPathsByFileName except on the way to creating pathsWithDuplicateNames I think it makes sense to not give them identifiers, and instead just chain straight through
| private val uploadedJars = sparkConf.get(KUBERNETES_DRIVER_UPLOAD_JARS) | ||
| private val uploadedJars = sparkConf.get(KUBERNETES_DRIVER_UPLOAD_JARS).filter(_.nonEmpty) | ||
| private val uploadedFiles = sparkConf.get(KUBERNETES_DRIVER_UPLOAD_FILES).filter(_.nonEmpty) | ||
| validateNoDuplicateUploadFileNames() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems a little weird -- should we have that method take an input and return an output instead of relying on the state of the instance vars being set already?
|
Documentation is added, addressed @aash's comments |
ash211
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great to me! @iyanuobidele / @foxish any last comments before merging? I'd like to get this in mid-afternoon if possible.
|
LGTM |
1 similar comment
|
LGTM |
|
Cheers, merged! |
* Allow adding arbitrary files * Address comments and add documentation
* Allow adding arbitrary files * Address comments and add documentation
* Allow adding arbitrary files * Address comments and add documentation
* Allow adding arbitrary files * Address comments and add documentation
* Allow adding arbitrary files * Address comments and add documentation
Similarly to how jars can be uploaded from the client to the driver, arbitrary files should also be allowed to be mounted. This is similar to how Spark supports sending files via
spark.filesas well as jars.Note that the files are added to the working directory of the driver because it would be necessary for the user to find these files in their application, and this matches the semantics of
spark.files. However, if we de-duplicate the files with identical names, it would be difficult for the user to find the resolved files. Thus unlike the case with jars, we explicitly throw an error if the user gives two files with the same name.