Skip to content

Commit 73c4242

Browse files
mccheahash211
authored andcommitted
Documentation for the current state of the world (#16)
* Documentation for the current state of the world. * Adding navigation links from other pages * Address comments, add TODO for things that should be fixed * Address comments, mostly making images section clearer * Virtual runtime -> container runtime
1 parent 323e0d7 commit 73c4242

File tree

3 files changed

+226
-0
lines changed

3 files changed

+226
-0
lines changed

docs/_layouts/global.html

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -99,6 +99,7 @@
9999
<li><a href="spark-standalone.html">Spark Standalone</a></li>
100100
<li><a href="running-on-mesos.html">Mesos</a></li>
101101
<li><a href="running-on-yarn.html">YARN</a></li>
102+
<li><a href="running-on-kubernetes.html">Kubernetes</a></li>
102103
</ul>
103104
</li>
104105

docs/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -113,6 +113,7 @@ options for deployment:
113113
* [Mesos](running-on-mesos.html): deploy a private cluster using
114114
[Apache Mesos](http://mesos.apache.org)
115115
* [YARN](running-on-yarn.html): deploy Spark on top of Hadoop NextGen (YARN)
116+
* [Kubernetes](running-on-kubernetes.html): deploy Spark on top of Kubernetes
116117

117118
**Other Documents:**
118119

docs/running-on-kubernetes.md

Lines changed: 224 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,224 @@
1+
---
2+
layout: global
3+
title: Running Spark on Kubernetes
4+
---
5+
6+
Support for running on [Kubernetes](https://kubernetes.io/) is available in experimental status. The feature set is
7+
currently limited and not well-tested. This should not be used in production environments.
8+
9+
## Setting Up Docker Images
10+
11+
Kubernetes requires users to supply images that can be deployed into containers within pods. The images are built to
12+
be run in a container runtime environment that Kubernetes supports. Docker is a container runtime environment that is
13+
frequently used with Kubernetes, so Spark provides some support for working with Docker to get started quickly.
14+
15+
To use Spark on Kubernetes with Docker, images for the driver and the executors need to built and published to an
16+
accessible Docker registry. Spark distributions include the Docker files for the driver and the executor at
17+
`dockerfiles/driver/Dockerfile` and `docker/executor/Dockerfile`, respectively. Use these Docker files to build the
18+
Docker images, and then tag them with the registry that the images should be sent to. Finally, push the images to the
19+
registry.
20+
21+
For example, if the registry host is `registry-host` and the registry is listening on port 5000:
22+
23+
cd $SPARK_HOME
24+
docker build -t registry-host:5000/spark-driver:latest -f dockerfiles/driver/Dockerfile .
25+
docker build -t registry-host:5000/spark-executor:latest -f dockerfiles/executor/Dockerfile .
26+
docker push registry-host:5000/spark-driver:latest
27+
docker push registry-host:5000/spark-executor:latest
28+
29+
## Submitting Applications to Kubernetes
30+
31+
Kubernetes applications can be executed via `spark-submit`. For example, to compute the value of pi, assuming the images
32+
are set up as described above:
33+
34+
bin/spark-submit
35+
--deploy-mode cluster
36+
--class org.apache.spark.examples.SparkPi
37+
--master k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port>
38+
--kubernetes-namespace default
39+
--conf spark.executor.instances=5
40+
--conf spark.app.name=spark-pi
41+
--conf spark.kubernetes.driver.docker.image=registry-host:5000/spark-driver:latest
42+
--conf spark.kubernetes.executor.docker.image=registry-host:5000/spark-executor:latest
43+
examples/jars/spark_2.11-2.2.0.jar
44+
45+
<!-- TODO master should default to https if no scheme is specified -->
46+
The Spark master, specified either via passing the `--master` command line argument to `spark-submit` or by setting
47+
`spark.master` in the application's configuration, must be a URL with the format `k8s://<api_server_url>`. Prefixing the
48+
master string with `k8s://` will cause the Spark application to launch on the Kubernetes cluster, with the API server
49+
being contacted at `api_server_url`. The HTTP protocol must also be specified.
50+
51+
Note that applications can currently only be executed in cluster mode, where the driver and its executors are running on
52+
the cluster.
53+
54+
### Adding Other JARs
55+
56+
Spark allows users to provide dependencies that are bundled into the driver's Docker image, or that are on the local
57+
disk of the submitter's machine. These two types of dependencies are specified via different configuration options to
58+
`spark-submit`:
59+
60+
* Local jars provided by specifying the `--jars` command line argument to `spark-submit`, or by setting `spark.jars` in
61+
the application's configuration, will be treated as jars that are located on the *disk of the driver Docker
62+
container*. This only applies to jar paths that do not specify a scheme or that have the scheme `file://`. Paths with
63+
other schemes are fetched from their appropriate locations.
64+
* Local jars provided by specifying the `--upload-jars` command line argument to `spark-submit`, or by setting
65+
`spark.kubernetes.driver.uploads.jars` in the application's configuration, will be treated as jars that are located on
66+
the *disk of the submitting machine*. These jars are uploaded to the driver docker container before executing the
67+
application.
68+
<!-- TODO support main resource bundled in the Docker image -->
69+
* A main application resource path that does not have a scheme or that has the scheme `file://` is assumed to be on the
70+
*disk of the submitting machine*. This resource is uploaded to the driver docker container before executing the
71+
application. A remote path can still be specified and the resource will be fetched from the appropriate location.
72+
73+
In all of these cases, the jars are placed on the driver's classpath, and are also sent to the executors. Below are some
74+
examples of providing application dependencies.
75+
76+
To submit an application with both the main resource and two other jars living on the submitting user's machine:
77+
78+
bin/spark-submit
79+
--deploy-mode cluster
80+
--class com.example.applications.SampleApplication
81+
--master k8s://https://192.168.99.100
82+
--kubernetes-namespace default
83+
--upload-jars /home/exampleuser/exampleapplication/dep1.jar,/home/exampleuser/exampleapplication/dep2.jar
84+
--conf spark.kubernetes.driver.docker.image=registry-host:5000/spark-driver:latest
85+
--conf spark.kubernetes.executor.docker.image=registry-host:5000/spark-executor:latest
86+
/home/exampleuser/exampleapplication/main.jar
87+
88+
Note that since passing the jars through the `--upload-jars` command line argument is equivalent to setting the
89+
`spark.kubernetes.driver.uploads.jars` Spark property, the above will behave identically to this command:
90+
91+
bin/spark-submit
92+
--deploy-mode cluster
93+
--class com.example.applications.SampleApplication
94+
--master k8s://https://192.168.99.100
95+
--kubernetes-namespace default
96+
--conf spark.kubernetes.driver.uploads.jars=/home/exampleuser/exampleapplication/dep1.jar,/home/exampleuser/exampleapplication/dep2.jar
97+
--conf spark.kubernetes.driver.docker.image=registry-host:5000/spark-driver:latest
98+
--conf spark.kubernetes.executor.docker.image=registry-host:5000/spark-executor:latest
99+
/home/exampleuser/exampleapplication/main.jar
100+
101+
To specify a main application resource that can be downloaded from an HTTP service, and if a plugin for that application
102+
is located in the jar `/opt/spark-plugins/app-plugin.jar` on the docker image's disk:
103+
104+
bin/spark-submit
105+
--deploy-mode cluster
106+
--class com.example.applications.PluggableApplication
107+
--master k8s://https://192.168.99.100
108+
--kubernetes-namespace default
109+
--jars /opt/spark-plugins/app-plugin.jar
110+
--conf spark.kubernetes.driver.docker.image=registry-host:5000/spark-driver-custom:latest
111+
--conf spark.kubernetes.executor.docker.image=registry-host:5000/spark-executor:latest
112+
http://example.com:8080/applications/sparkpluggable/app.jar
113+
114+
Note that since passing the jars through the `--jars` command line argument is equivalent to setting the `spark.jars`
115+
Spark property, the above will behave identically to this command:
116+
117+
bin/spark-submit
118+
--deploy-mode cluster
119+
--class com.example.applications.PluggableApplication
120+
--master k8s://https://192.168.99.100
121+
--kubernetes-namespace default
122+
--conf spark.jars=file:///opt/spark-plugins/app-plugin.jar
123+
--conf spark.kubernetes.driver.docker.image=registry-host:5000/spark-driver-custom:latest
124+
--conf spark.kubernetes.executor.docker.image=registry-host:5000/spark-executor:latest
125+
http://example.com:8080/applications/sparkpluggable/app.jar
126+
127+
### Spark Properties
128+
129+
Below are some other common properties that are specific to Kubernetes. Most of the other configurations are the same
130+
from the other deployment modes. See the [configuration page](configuration.html) for more information on those.
131+
132+
<table class="table">
133+
<tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
134+
<tr>
135+
<td><code>spark.kubernetes.namespace</code></td>
136+
<!-- TODO set default to "default" -->
137+
<td>(none)</td>
138+
<td>
139+
The namespace that will be used for running the driver and executor pods. Must be specified. When using
140+
<code>spark-submit</code> in cluster mode, this can also be passed to <code>spark-submit</code> via the
141+
<code>--kubernetes-namespace</code> command line argument.
142+
</td>
143+
</tr>
144+
<tr>
145+
<td><code>spark.kubernetes.driver.docker.image</code></td>
146+
<td><code>spark-driver:2.2.0</code></td>
147+
<td>
148+
Docker image to use for the driver. Specify this using the standard
149+
<a href="https://docs.docker.com/engine/reference/commandline/tag/">Docker tag</a> format.
150+
</td>
151+
</tr>
152+
<tr>
153+
<td><code>spark.kubernetes.executor.docker.image</code></td>
154+
<td><code>spark-executor:2.2.0</code></td>
155+
<td>
156+
Docker image to use for the executors. Specify this using the standard
157+
<a href="https://docs.docker.com/engine/reference/commandline/tag/">Docker tag</a> format.
158+
</td>
159+
</tr>
160+
<tr>
161+
<td><code>spark.kubernetes.submit.caCertFile</code></td>
162+
<td>(none)</td>
163+
<td>
164+
CA cert file for connecting to Kubernetes over SSL. This file should be located on the submitting machine's disk.
165+
</td>
166+
</tr>
167+
<tr>
168+
<td><code>spark.kubernetes.submit.clientKeyFile</code></td>
169+
<td>(none)</td>
170+
<td>
171+
Client key file for authenticating against the Kubernetes API server. This file should be located on the submitting
172+
machine's disk.
173+
</td>
174+
</tr>
175+
<tr>
176+
<td><code>spark.kubernetes.submit.clientCertFile</code></td>
177+
<td>(none)</td>
178+
<td>
179+
Client cert file for authenticating against the Kubernetes API server. This file should be located on the submitting
180+
machine's disk.
181+
</td>
182+
</tr>
183+
<tr>
184+
<td><code>spark.kubernetes.submit.serviceAccountName</code></td>
185+
<td><code>default</code></td>
186+
<td>
187+
Service account that is used when running the driver pod. The driver pod uses this service account when requesting
188+
executor pods from the API server.
189+
</td>
190+
</tr>
191+
<tr>
192+
<td><code>spark.kubernetes.driver.uploads.jars</code></td>
193+
<td>(none)</td>
194+
<td>
195+
Comma-separated list of jars to sent to the driver and all executors when submitting the application in cluster
196+
mode. Refer to <a href="running-on-kubernetes.html#adding-other-jars">adding other jars</a> for more information.
197+
</td>
198+
</tr>
199+
<tr>
200+
<!-- TODO remove this functionality -->
201+
<td><code>spark.kubernetes.driver.uploads.driverExtraClasspath</code></td>
202+
<td>(none)</td>
203+
<td>
204+
Comma-separated list of jars to be sent to the driver only when submitting the application in cluster mode.
205+
</td>
206+
</tr>
207+
<tr>
208+
<td><code>spark.kubernetes.executor.memoryOverhead</code></td>
209+
<td>executorMemory * 0.10, with minimum of 384 </td>
210+
<td>
211+
The amount of off-heap memory (in megabytes) to be allocated per executor. This is memory that accounts for things
212+
like VM overheads, interned strings, other native overheads, etc. This tends to grow with the executor size
213+
(typically 6-10%).
214+
</td>
215+
</tr>
216+
</table>
217+
218+
## Current Limitations
219+
220+
Running Spark on Kubernetes is currently an experimental feature. Some restrictions on the current implementation that
221+
should be lifted in the future include:
222+
* Applications can only use a fixed number of executors. Dynamic allocation is not supported.
223+
* Applications can only run in cluster mode.
224+
* Only Scala and Java applications can be run.

0 commit comments

Comments
 (0)