Skip to content
This repository was archived by the owner on Jan 9, 2020. It is now read-only.
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/_layouts/global.html
Original file line number Diff line number Diff line change
Expand Up @@ -99,6 +99,7 @@
<li><a href="spark-standalone.html">Spark Standalone</a></li>
<li><a href="running-on-mesos.html">Mesos</a></li>
<li><a href="running-on-yarn.html">YARN</a></li>
<li><a href="running-on-kubernetes.html">Kubernetes</a></li>
</ul>
</li>

Expand Down
1 change: 1 addition & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -113,6 +113,7 @@ options for deployment:
* [Mesos](running-on-mesos.html): deploy a private cluster using
[Apache Mesos](http://mesos.apache.org)
* [YARN](running-on-yarn.html): deploy Spark on top of Hadoop NextGen (YARN)
* [Kubernetes](running-on-kubernetes.html): deploy Spark on top of Kubernetes

**Other Documents:**

Expand Down
224 changes: 224 additions & 0 deletions docs/running-on-kubernetes.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,224 @@
---
layout: global
title: Running Spark on Kubernetes
---

Support for running on [Kubernetes](https://kubernetes.io/) is available in experimental status. The feature set is
currently limited and not well-tested. This should not be used in production environments.

## Setting Up Docker Images

Kubernetes requires users to supply images that can be deployed into containers within pods. The images are built to
be run in a container runtime environment that Kubernetes supports. Docker is a container runtime environment that is
frequently used with Kubernetes, so Spark provides some support for working with Docker to get started quickly.

To use Spark on Kubernetes with Docker, images for the driver and the executors need to built and published to an
accessible Docker registry. Spark distributions include the Docker files for the driver and the executor at
`dockerfiles/driver/Dockerfile` and `docker/executor/Dockerfile`, respectively. Use these Docker files to build the
Docker images, and then tag them with the registry that the images should be sent to. Finally, push the images to the
registry.

For example, if the registry host is `registry-host` and the registry is listening on port 5000:

cd $SPARK_HOME

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cd $SPARK_HOME/dist ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is documentation under the assumption that Spark was unpacked from a tarball, such as when it is downloaded from the Spark website.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also record dev-workflow docs somewhere, these aren't included in this PR just yet.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I totally agree.

docker build -t registry-host:5000/spark-driver:latest -f dockerfiles/driver/Dockerfile .
docker build -t registry-host:5000/spark-executor:latest -f dockerfiles/executor/Dockerfile .
docker push registry-host:5000/spark-driver:latest
docker push registry-host:5000/spark-executor:latest

## Submitting Applications to Kubernetes

Kubernetes applications can be executed via `spark-submit`. For example, to compute the value of pi, assuming the images
are set up as described above:

bin/spark-submit
--deploy-mode cluster
--class org.apache.spark.examples.SparkPi
--master k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port>
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw is the https necessary?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is for now, later we want to remove it and make https the default, but optionally the user can fill in http.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now these are supported:

  • k8s://https://<host>:<port>
  • k8s://http://<host>:<port>

In #19 we want to additionally make this supported:

  • k8s://<host>:<port> which would be equivalent to k8s://https://<host>:<port>

--kubernetes-namespace default
--conf spark.executor.instances=5
--conf spark.app.name=spark-pi
--conf spark.kubernetes.driver.docker.image=registry-host:5000/spark-driver:latest
--conf spark.kubernetes.executor.docker.image=registry-host:5000/spark-executor:latest
examples/jars/spark_2.11-2.2.0.jar

<!-- TODO master should default to https if no scheme is specified -->
The Spark master, specified either via passing the `--master` command line argument to `spark-submit` or by setting
`spark.master` in the application's configuration, must be a URL with the format `k8s://<api_server_url>`. Prefixing the
master string with `k8s://` will cause the Spark application to launch on the Kubernetes cluster, with the API server
being contacted at `api_server_url`. The HTTP protocol must also be specified.

Note that applications can currently only be executed in cluster mode, where the driver and its executors are running on
the cluster.

### Adding Other JARs

Spark allows users to provide dependencies that are bundled into the driver's Docker image, or that are on the local
disk of the submitter's machine. These two types of dependencies are specified via different configuration options to
`spark-submit`:

* Local jars provided by specifying the `--jars` command line argument to `spark-submit`, or by setting `spark.jars` in
the application's configuration, will be treated as jars that are located on the *disk of the driver Docker
container*. This only applies to jar paths that do not specify a scheme or that have the scheme `file://`. Paths with
other schemes are fetched from their appropriate locations.
* Local jars provided by specifying the `--upload-jars` command line argument to `spark-submit`, or by setting
`spark.kubernetes.driver.uploads.jars` in the application's configuration, will be treated as jars that are located on
the *disk of the submitting machine*. These jars are uploaded to the driver docker container before executing the
application.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"and are placed on the driver's classpath"

<!-- TODO support main resource bundled in the Docker image -->
* A main application resource path that does not have a scheme or that has the scheme `file://` is assumed to be on the
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to be able to specify a main application resource on the container's disk as well. The main trouble here is how to specify that: do we create a new custom file scheme, like docker:// to denote that the file is in the docker image?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for sake of usability, I believe we should support the docker:// scheme, since 'no scheme' == file://

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep - it's not particularly great that the API is to expect a magical prefix but I don't see a better option.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docker isn't the only runtime that is supported (although it is the most common), so, we could opt for something neutral like container://, or pod://.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point @foxish -- I like container:// better, since in my understanding a k8s pod can have multiple containers each with independent filesystems so pod:// isn't precise enough

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we were to include container:// urls, we could also use this kind of URL scheme for the uploaded jars, and remove spark.kubernetes.driver.uploads.jars. I kind of like having the partitioning into two settings in this case though since spark.jars has preconceived expectations in all of the cluster managers, and there is some dissonance in making Kubernetes handle spark.jars in a special way. However this then makes specifying the main resource jar inconsistent with specifying the other uploaded jars.

*disk of the submitting machine*. This resource is uploaded to the driver docker container before executing the
application. A remote path can still be specified and the resource will be fetched from the appropriate location.

In all of these cases, the jars are placed on the driver's classpath, and are also sent to the executors. Below are some
examples of providing application dependencies.

To submit an application with both the main resource and two other jars living on the submitting user's machine:

bin/spark-submit
--deploy-mode cluster
--class com.example.applications.SampleApplication
--master k8s://https://192.168.99.100
--kubernetes-namespace default
--upload-jars /home/exampleuser/exampleapplication/dep1.jar,/home/exampleuser/exampleapplication/dep2.jar
--conf spark.kubernetes.driver.docker.image=registry-host:5000/spark-driver:latest
--conf spark.kubernetes.executor.docker.image=registry-host:5000/spark-executor:latest
/home/exampleuser/exampleapplication/main.jar

Note that since passing the jars through the `--upload-jars` command line argument is equivalent to setting the
`spark.kubernetes.driver.uploads.jars` Spark property, the above will behave identically to this command:

bin/spark-submit
--deploy-mode cluster
--class com.example.applications.SampleApplication
--master k8s://https://192.168.99.100
--kubernetes-namespace default
--conf spark.kubernetes.driver.uploads.jars=/home/exampleuser/exampleapplication/dep1.jar,/home/exampleuser/exampleapplication/dep2.jar
--conf spark.kubernetes.driver.docker.image=registry-host:5000/spark-driver:latest
--conf spark.kubernetes.executor.docker.image=registry-host:5000/spark-executor:latest
/home/exampleuser/exampleapplication/main.jar

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new paragraph:

note that --upload-jars is equivalent to --conf spark.kubernetes.driver.uploads.jars

If we can bold those lines in both indentations that would be nice too

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tricky to do both preformatted text and bolding in the same line; we'd have to invoke HTML for that. I think this is fine as is really.

To specify a main application resource that can be downloaded from an HTTP service, and if a plugin for that application
is located in the jar `/opt/spark-plugins/app-plugin.jar` on the docker image's disk:

bin/spark-submit
--deploy-mode cluster
--class com.example.applications.PluggableApplication
--master k8s://https://192.168.99.100
--kubernetes-namespace default
--jars /opt/spark-plugins/app-plugin.jar
--conf spark.kubernetes.driver.docker.image=registry-host:5000/spark-driver-custom:latest
--conf spark.kubernetes.executor.docker.image=registry-host:5000/spark-executor:latest
http://example.com:8080/applications/sparkpluggable/app.jar

Note that since passing the jars through the `--jars` command line argument is equivalent to setting the `spark.jars`
Spark property, the above will behave identically to this command:

bin/spark-submit
--deploy-mode cluster
--class com.example.applications.PluggableApplication
--master k8s://https://192.168.99.100
--kubernetes-namespace default
--conf spark.jars=file:///opt/spark-plugins/app-plugin.jar
--conf spark.kubernetes.driver.docker.image=registry-host:5000/spark-driver-custom:latest
--conf spark.kubernetes.executor.docker.image=registry-host:5000/spark-executor:latest
http://example.com:8080/applications/sparkpluggable/app.jar

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same callout paragraph afterwards:

--jars is the same as --conf spark.jars and bold that line

### Spark Properties

Below are some other common properties that are specific to Kubernetes. Most of the other configurations are the same
from the other deployment modes. See the [configuration page](configuration.html) for more information on those.

<table class="table">
<tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
<tr>
<td><code>spark.kubernetes.namespace</code></td>
<!-- TODO set default to "default" -->
<td>(none)</td>
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do the other docs use (none) for required docs too? I'd be inclined to put (required) here instead

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably just default to default however so that won't matter much in the longer term.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 We should default to the default namespace.

<td>
The namespace that will be used for running the driver and executor pods. Must be specified. When using
<code>spark-submit</code> in cluster mode, this can also be passed to <code>spark-submit</code> via the
<code>--kubernetes-namespace</code> command line argument.
</td>
</tr>
<tr>
<td><code>spark.kubernetes.driver.docker.image</code></td>
<td><code>spark-driver:2.2.0</code></td>
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's change this from 2.2.0 to something like 2.2.0-latest for now in both image versions. I want to make sure we never accidentally publish a release that we'd later try to replace with something else in the same name. -latest at least makes it clear the image can change

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently the code refers to SPARK_VERSION to get the version here... which may actually be 2.2.0-SNAPSHOT in this case. Not sure how to keep these docs in sync with the specific version string, but 2.2.0-SNAPSHOT seems like the actual correct default here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with 2.2.0 for now but would be happy to hear suggestions in general. I think this will strictly depend on what we end up doing for publishing Docker images, if we publish at all. If we don't publish at all though then spark-driver:latest and spark-executor:latest seems reasonable for defaults, but defaults would rarely be used in practice because everyone would tag their own images slightly differently (spark-driver:latest, spark_driver:2.2.0, sparkDriver:2.2.0, spark-driver:2.2.0, spark-driver:2.2, spark-app:2.2.0...), not to mention sometimes needing to specify the Docker registry host in the tag.

<td>
Docker image to use for the driver. Specify this using the standard
<a href="https://docs.docker.com/engine/reference/commandline/tag/">Docker tag</a> format.
</td>
</tr>
<tr>
<td><code>spark.kubernetes.executor.docker.image</code></td>
<td><code>spark-executor:2.2.0</code></td>
<td>
Docker image to use for the executors. Specify this using the standard
<a href="https://docs.docker.com/engine/reference/commandline/tag/">Docker tag</a> format.
</td>
</tr>
<tr>
<td><code>spark.kubernetes.submit.caCertFile</code></td>
<td>(none)</td>
<td>
CA cert file for connecting to Kubernetes over SSL. This file should be located on the submitting machine's disk.
</td>
</tr>
<tr>
<td><code>spark.kubernetes.submit.clientKeyFile</code></td>
<td>(none)</td>
<td>
Client key file for authenticating against the Kubernetes API server. This file should be located on the submitting
machine's disk.
</td>
</tr>
<tr>
<td><code>spark.kubernetes.submit.clientCertFile</code></td>
<td>(none)</td>
<td>
Client cert file for authenticating against the Kubernetes API server. This file should be located on the submitting
machine's disk.
</td>
</tr>
<tr>
<td><code>spark.kubernetes.submit.serviceAccountName</code></td>
<td><code>default</code></td>
<td>
Service account that is used when running the driver pod. The driver pod uses this service account when requesting
executor pods from the API server.
</td>
</tr>
<tr>
<td><code>spark.kubernetes.driver.uploads.jars</code></td>
<td>(none)</td>
<td>
Comma-separated list of jars to sent to the driver and all executors when submitting the application in cluster
mode. Refer to <a href="running-on-kubernetes.html#adding-other-jars">adding other jars</a> for more information.
</td>
</tr>
<tr>
<!-- TODO remove this functionality -->
<td><code>spark.kubernetes.driver.uploads.driverExtraClasspath</code></td>
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably shouldn't have this, I don't know how common using this will be.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is fine.

<td>(none)</td>
<td>
Comma-separated list of jars to be sent to the driver only when submitting the application in cluster mode.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is that bit about only when submitting the application in cluster mode relevant? k8s only supports cluster mode now

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In client mode when we do end up supporting it, this won't really apply there. I think this is fine even if redundant as we don't have to change this part of the docs down the road.

</td>
</tr>
<tr>
<td><code>spark.kubernetes.executor.memoryOverhead</code></td>
<td>executorMemory * 0.10, with minimum of 384 </td>
<td>
The amount of off-heap memory (in megabytes) to be allocated per executor. This is memory that accounts for things
like VM overheads, interned strings, other native overheads, etc. This tends to grow with the executor size
(typically 6-10%).
</td>
</tr>
</table>

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put an example here where we try to simplify the spark-submit command as much possible. Something like:


For example, the first example above can be rewritten from:

   bin/spark-submit
      --deploy-mode cluster
      --class com.example.applications.SampleApplication
      --master k8s://https://192.168.99.100
      --kubernetes-namespace spark.kubernetes.namespace=default 
      --upload-jars /home/exampleuser/exampleapplication/dep1.jar,/home/exampleuser/exampleapplication/dep2.jar
      --conf spark.kubernetes.driver.docker.image=registry-host:5000/spark-driver:latest
      --conf spark.kubernetes.executor.docker.image=registry-host:5000/spark-executor:latest
      /home/exampleuser/exampleapplication/main.jar

to

   bin/spark-submit
      --class com.example.applications.SampleApplication
      --upload-jars /home/exampleuser/exampleapplication/dep1.jar,/home/exampleuser/exampleapplication/dep2.jar
      /home/exampleuser/exampleapplication/main.jar

with these contents in spark-defaults.conf:

spark.master k8s://https://192.168.99.100
spark.submit.deployMode cluster
spark.kubernetes.namespace default
spark.kubernetes.driver.docker.image registry-host:5000/spark-driver:latest
spark.kubernetes.executor.docker.image registry-host:5000/spark-executor:latest

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I'm not sure what this extra example gets us that would be helpful for usage that we don't have already - can you elaborate here what the goal is? Having to include spark-defaults.conf as well only makes it seem even more confusing. I mainly modeled the examples off of here: https://spark.apache.org/docs/latest/submitting-applications.html

## Current Limitations

Running Spark on Kubernetes is currently an experimental feature. Some restrictions on the current implementation that
should be lifted in the future include:
* Applications can only use a fixed number of executors. Dynamic allocation is not supported.
* Applications can only run in cluster mode.
* Only Scala and Java applications can be run.