-
Notifications
You must be signed in to change notification settings - Fork 117
Added files should be in the working directories. #294
Conversation
ash211
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. It seems like we should be able to have the init container write files directly into the working directory of the driver, as configured via an environment variable on the init container. The contract then is that the init container must download the jars and place them in the dir specified by the envvar.
Do you feel that's an opaque contract?
| if ! [ -z ${SPARK_MOUNTED_CLASSPATH+x} ]; then SPARK_CLASSPATH="$SPARK_MOUNTED_CLASSPATH:$SPARK_CLASSPATH"; fi && \ | ||
| if ! [ -z ${SPARK_SUBMIT_EXTRA_CLASSPATH+x} ]; then SPARK_CLASSPATH="$SPARK_SUBMIT_EXTRA_CLASSPATH:$SPARK_CLASSPATH"; fi && \ | ||
| if ! [ -z ${SPARK_EXTRA_CLASSPATH+x} ]; then SPARK_CLASSPATH="$SPARK_EXTRA_CLASSPATH:$SPARK_CLASSPATH"; fi && \ | ||
| if ! [ -z ${SPARK_MOUNTED_FILES_DIR} ]; then cp -R "$SPARK_MOUNTED_FILES_DIR/." .; fi && \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe can use ln instead of cp to reduce disk churn?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't want to symbolic link the directory itself, but we can create a link to each file in the directory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, could do some unix-foo to make the files appear in the other directory. Something like:
if ! [ -z ${SPARK_MOUNTED_FILES_DIR} ]; then find "$SPARK_MOUNTED_FILES_DIR/" -type f -exec ln -s {} . \; ; fi && \
This is something we can do after this PR though -- would you rather revisit later if profiling reveals the copy is slow? It should only affect large files. If we push, I can file an issue to remind us to come back to this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah - spark.files is meant to be for small things, usually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
The problem is the submission client doesn't know what the working directory of the driver is. In our current model we've set the working directory in the Dockerfile and the submission client does not know many details of it. |
|
Ah, I thought Given that, I think this is the right approach. |
|
Planning to merge when integration test passes |
* Added files should be in the working directories. * Revert unintentional changes * Fix test
…s#294) * Added files should be in the working directories. * Revert unintentional changes * Fix test
Closes #292.
The implementation here unfortunately requires more logic in the Dockerfile's command, which further increases the surface area of the API between the Docker image and the submission client. Nevertheless this seems like the best approach among the sub-optimal alternatives. For example, we could have made the init-container copy the files into its working directory, but this introduces an opaque contract where the init-container has to have the same working directory as the driver; since the working directories are defined by the Docker images, custom drivers may not be able to satisfy this guarantee.
The approach given here at least ensures that if files are added by any means to the given directory, that those files are at least consistent in being added to its own working directory, regardless of the state of the component that mounted those files in the first place.