Allow "spark.files" to be shipped through secrets or configmaps

Many applications will include all of their binaries in Docker images, but will require setting configurations dynamically upon their submission. There is a separate discussion to have about allowing arbitrary secret and configmap mounts to be pushed onto the driver and executor pods. However, application deployment strategies that are being ported over from YARN, Mesos, or Standalone mode will expect these files to be easily provided through `spark.files`.

Currently, Spark applications need to submit their local files through the resource staging server. Given the use case described above, however, it would be more convenient if application submitters did not need to use the resource staging server to ship their configuration files through. This is further confirmed by the general impression from the Spark community that data shipped through `spark.files` is intended to be small.

The proposed scheme to consider all of these factors is as follows:

1. If a resource staging server URI is provided, ship all dependencies through the resource staging server.
2. If a resource staging server is not provided, then prevent local jars from being submitted. We make this simplification because we assume that jars are generally going to be larger than files, and also that generally we expect a larger number of jars to be sent per-submission than files would be.
3. If a resource staging server is not provided, the size of each local file sent in `spark.files` is examined. We provide a configuration option called `spark.kubernetes.files.maxSize` (there's probably a better name for this to denote that we're submitting through a Kubernetes secret). If any file exceeds the max size then we fail the submission. The maximum size has a reasonable default that ensures that users do not accidentally overload etcd, but it can be adjusted if the submitter is aware of the potential consequences.
4. Assuming every file passes the check described in (3), the files are pushed into a Kubernetes secret where every secret key-value pair corresponds to a single submitted file. We use secrets instead of ConfigMaps here in order to support binary data and not just textual files, although the files need not be sensitive data by any means.
5. The main container for the driver and executor, prior to starting the JVM, copies the files from the secret mount point into the working directory. This is to satisfy the contract that files sent through `spark.files` are in the working directory of the driver and executors, but it has to be through a copy since secret mounts cannot be in the working directory of the container itself.

Again - this is strictly independent of the discussion on how custom volume mounts can be provided for the containers. This is a simpler scheme that basically makes `spark.files` easier to manage in the absence of a resource staging server. More complex use cases that require arbitrary mount points for arbitrary volume types should use something like a pod preset.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Allow "spark.files" to be shipped through secrets or configmaps #393

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Allow "spark.files" to be shipped through secrets or configmaps #393

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions