Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions src/jekyll/contribute.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ This is a collaborative effort completely in the open source community, involvin
different companies who are interested in seeing this feature successfully completed and merged.
Companies currently active in this project include (alphabetically):

* Bloomberg
* Google
* Haiwen
* Hyperpilot
Expand Down
70 changes: 69 additions & 1 deletion src/jekyll/running-on-kubernetes.md
Original file line number Diff line number Diff line change
Expand Up @@ -115,6 +115,74 @@ Finally, notice that in the above example we specify a jar with a specific URI w
the location of the example jar that is already in the Docker image. Using dependencies that are on your machine's local
disk is discussed below.

## Python Support

With the ever growing support for Python by data scientists, we have supported the submission of PySpark applications.
These applications follow the general syntax that you would expect from other cluster managers. The submission of a PySpark
job is similar to the submission of Java/Scala applications except you do not supply a class, as expected.
Here is how you would execute a Spark-Pi example:

bin/spark-submit \
--deploy-mode cluster \
--master k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port> \
--kubernetes-namespace <k8s-namespace> \
--conf spark.executor.instances=5 \
--conf spark.app.name=spark-pi \
--conf spark.kubernetes.driver.docker.image=kubespark/driver-py:v2.1.0-kubernetes-0.3.0 \
--conf spark.kubernetes.executor.docker.image=kubespark/executor-py:v2.1.0-kubernetes-0.3.0 \
--conf spark.kubernetes.initcontainer.docker.image=kubespark/spark-init:v2.1.0-kubernetes-0.3.0 \
--jars local:///opt/spark/examples/jars/spark-examples_2.11-2.1.0-k8s-0.3.0-SNAPSHOT.jar \
local:///opt/spark/examples/src/main/python/pi.py 10

With Python support it is expected to distribute `.egg`, `.zip` and `.py` libraries to executors via the `--py-files` option.
We support this as well, as seen with the following example:

bin/spark-submit \
--deploy-mode cluster \
--master k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port> \
--kubernetes-namespace <k8s-namespace> \
--conf spark.executor.instances=5 \
--conf spark.app.name=spark-pi \
--conf spark.kubernetes.driver.docker.image=kubespark/driver-py:v2.1.0-kubernetes-0.3.0 \
--conf spark.kubernetes.executor.docker.image=kubespark/executor-py:v2.1.0-kubernetes-0.3.0 \
--conf spark.kubernetes.initcontainer.docker.image=kubespark/spark-init:v2.1.0-kubernetes-0.3.0 \
--jars local:///opt/spark/examples/jars/spark-examples_2.11-2.1.0-k8s-0.3.0-SNAPSHOT.jar \
--py-files local:///opt/spark/examples/src/main/python/sort.py \
local:///opt/spark/examples/src/main/python/pi.py 10


You may also customize your Docker images to use different `pip` packages that suit your use-case. As you can see
with the current `driver-py` Docker image we have commented out the current pip module support that you can uncomment
to use:

...
ADD examples /opt/spark/examples
ADD python /opt/spark/python

RUN apk add --no-cache python && \
python -m ensurepip && \
rm -r /usr/lib/python*/ensurepip && \
pip install --upgrade pip setuptools && \
rm -r /root/.cache
# UNCOMMENT THE FOLLOWING TO START PIP INSTALLING PYTHON PACKAGES
# RUN apk add --update alpine-sdk python-dev
# RUN pip install numpy
...

And bake into your docker image whichever PySpark files you wish to include by merely appending to the following exec
command with your appropriate file (i.e. MY_SPARK_FILE)

...
CMD SPARK_CLASSPATH="${SPARK_HOME}/jars/*" && \
if ! [ -z ${SPARK_MOUNTED_CLASSPATH+x} ]; then SPARK_CLASSPATH="$SPARK_MOUNTED_CLASSPATH:$SPARK_CLASSPATH"; fi && \
if ! [ -z ${SPARK_SUBMIT_EXTRA_CLASSPATH+x} ]; then SPARK_CLASSPATH="$SPARK_SUBMIT_EXTRA_CLASSPATH:$SPARK_CLASSPATH"; fi && \
if ! [ -z ${SPARK_EXTRA_CLASSPATH+x} ]; then SPARK_CLASSPATH="$SPARK_EXTRA_CLASSPATH:$SPARK_CLASSPATH"; fi && \
if ! [ -z ${SPARK_MOUNTED_FILES_DIR} ]; then cp -R "$SPARK_MOUNTED_FILES_DIR/." .; fi && \
exec /sbin/tini -- ${JAVA_HOME}/bin/java $SPARK_DRIVER_JAVA_OPTS -cp $SPARK_CLASSPATH \
-Xms$SPARK_DRIVER_MEMORY -Xmx$SPARK_DRIVER_MEMORY \
$SPARK_DRIVER_CLASS $PYSPARK_PRIMARY MY_PYSPARK_FILE,$PYSPARK_FILES $SPARK_DRIVER_ARGS


## Dependency Management

Application dependencies that are being submitted from your machine need to be sent to a **resource staging server**
Expand Down Expand Up @@ -686,4 +754,4 @@ from the other deployment modes. See the [configuration page](configuration.html
Running Spark on Kubernetes is currently an experimental feature. Some restrictions on the current implementation that
should be lifted in the future include:
* Applications can only run in cluster mode.
* Only Scala and Java applications can be run.
* Only Scala, Java, and Python applications can be run.