This repository was archived by the owner on Jan 9, 2020. It is now read-only.

Description
I extended the docker image using the recent spark-2.2.0-k8s-0.4.0-bin-2.7.3 release to add the GCS (Google Cloud Storage) connector.
Observed:
It works great for scala jobs / jars with a gs://<bucket>/ prefix - I see it creates the init container and does populate the spark-files from what was already in GCS. However, when I try to submit a python job (or use --py-files), the spark-submit client does not allow the gs:// prefix and refuses the job.
Error: Only local python files are supported: gs://<my_bucket_name>/pi.py
Run with --help for usage help or --verbose for debug output
Expected:
The job to be allowed by spark-submit, the relevant files populated in an initcontainer, and available for the spark-driver-py and spark-executor-py to use successfully.
(FYI To add the GCS connector, I added these lines to spark-base Dockerfile:)
ENV hadoop_ver 2.7.4
# Add Hadoop 2.x native libs
ADD http://www.us.apache.org/dist/hadoop/common/hadoop-${hadoop_ver}/hadoop-${hadoop_ver}.tar.gz /opt/
RUN cd /opt/ && \
tar xf hadoop-${hadoop_ver}.tar.gz && \
ln -s hadoop-${hadoop_ver} hadoop
# Add the GCS connector.
ADD https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-latest-hadoop2.jar ${SPARK_HOME}/jars/