Skip to content
This repository was archived by the owner on Jan 9, 2020. It is now read-only.
This repository was archived by the owner on Jan 9, 2020. It is now read-only.

Allow "spark.files" to be shipped through secrets or configmaps #393

@mccheah

Description

@mccheah

Many applications will include all of their binaries in Docker images, but will require setting configurations dynamically upon their submission. There is a separate discussion to have about allowing arbitrary secret and configmap mounts to be pushed onto the driver and executor pods. However, application deployment strategies that are being ported over from YARN, Mesos, or Standalone mode will expect these files to be easily provided through spark.files.

Currently, Spark applications need to submit their local files through the resource staging server. Given the use case described above, however, it would be more convenient if application submitters did not need to use the resource staging server to ship their configuration files through. This is further confirmed by the general impression from the Spark community that data shipped through spark.files is intended to be small.

The proposed scheme to consider all of these factors is as follows:

  1. If a resource staging server URI is provided, ship all dependencies through the resource staging server.
  2. If a resource staging server is not provided, then prevent local jars from being submitted. We make this simplification because we assume that jars are generally going to be larger than files, and also that generally we expect a larger number of jars to be sent per-submission than files would be.
  3. If a resource staging server is not provided, the size of each local file sent in spark.files is examined. We provide a configuration option called spark.kubernetes.files.maxSize (there's probably a better name for this to denote that we're submitting through a Kubernetes secret). If any file exceeds the max size then we fail the submission. The maximum size has a reasonable default that ensures that users do not accidentally overload etcd, but it can be adjusted if the submitter is aware of the potential consequences.
  4. Assuming every file passes the check described in (3), the files are pushed into a Kubernetes secret where every secret key-value pair corresponds to a single submitted file. We use secrets instead of ConfigMaps here in order to support binary data and not just textual files, although the files need not be sensitive data by any means.
  5. The main container for the driver and executor, prior to starting the JVM, copies the files from the secret mount point into the working directory. This is to satisfy the contract that files sent through spark.files are in the working directory of the driver and executors, but it has to be through a copy since secret mounts cannot be in the working directory of the container itself.

Again - this is strictly independent of the discussion on how custom volume mounts can be provided for the containers. This is a simpler scheme that basically makes spark.files easier to manage in the absence of a resource staging server. More complex use cases that require arbitrary mount points for arbitrary volume types should use something like a pod preset.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions