[SPARK-33530][CORE] Support --archives and spark.archives option natively #30486

HyukjinKwon · 2020-11-24T14:20:05Z

What changes were proposed in this pull request?

TL;DR:

This PR completes the support of archives in Spark itself instead of Yarn-only
- It makes --archives option work in other cluster modes too and adds spark.archives configuration.

After this PR, PySpark users can leverage Conda to ship Python packages together as below:

conda create -y -n pyspark_env -c conda-forge pyarrow==2.0.0 pandas==1.1.4 conda-pack==0.5.0
conda activate pyspark_env
conda pack -f -o pyspark_env.tar.gz
PYSPARK_DRIVER_PYTHON=python PYSPARK_PYTHON=./environment/bin/python pyspark --archives pyspark_env.tar.gz#environment

Issue a warning that undocumented and hidden behavior of partial archive handling in spark.files / SparkContext.addFile will be deprecated, and users can use spark.archives and SparkContext.addArchive.

This PR proposes to add Spark's native --archives in Spark submit, and spark.archives configuration. Currently, both are supported only in Yarn mode:

./bin/spark-submit --help

Options:
...
 Spark on YARN only:
  --queue QUEUE_NAME          The YARN queue to submit to (Default: "default").
  --archives ARCHIVES         Comma separated list of archives to be extracted into the
                              working directory of each executor.

This archives feature is useful often when you have to ship a directory and unpack into executors. One example is native libraries to use e.g. JNI. Another example is to ship Python packages together by Conda environment.

Especially for Conda, PySpark currently does not have a nice way to ship a package that works in general, please see also https://hyukjin-spark.readthedocs.io/en/stable/user_guide/python_packaging.html#using-zipped-virtual-environment (PySpark new documentation demo for 3.1.0).

The neatest way is arguably to use Conda environment by shipping zipped Conda environment but this is currently dependent on this archive feature. NOTE that we are able to use spark.files by relying on its undocumented behaviour that untars tar.gz but I don't think we should document such ways and promote people to more rely on it.

Also, note that this PR does not target to add the feature parity of spark.files.overwrite, spark.files.useFetchCache, etc. yet. I documented that this is an experimental feature as well.

Why are the changes needed?

To complete the feature parity, and to provide a better support of shipping Python libraries together with Conda env.

Does this PR introduce any user-facing change?

Yes, this makes --archives works in Spark instead of Yarn-only, and adds a new configuration spark.archives.

How was this patch tested?

I added unittests. Also, manually tested in standalone cluster, local-cluster, and local modes.

HyukjinKwon · 2020-11-24T14:23:02Z

core/src/main/scala/org/apache/spark/SparkContext.scala

Here we cannot rely on new Path(path).toUri. it makes the fragment (#) in URI as the part of path. Utils.resolveURI is used for spark.yarn.dist.archives as well.

HyukjinKwon · 2020-11-24T14:24:21Z

core/src/main/scala/org/apache/spark/SparkContext.scala

Archive is not supposed to be a directory.

HyukjinKwon · 2020-11-24T14:24:56Z

core/src/main/scala/org/apache/spark/SparkContext.scala

For the same reason of keeping the fragment, it uses URI when it's archive.

HyukjinKwon · 2020-11-24T14:27:21Z

@tgravescs, @mridulm, @Ngone51, can you take a look when you guys find some time?

HyukjinKwon · 2020-11-24T14:28:52Z

cc @zero323 and @fhoering too FYI. This is related to the docs and shipping 3rd party Python packages in PySpark apps.

HyukjinKwon · 2020-11-24T14:33:57Z

core/src/main/scala/org/apache/spark/util/Utils.scala

Our spark.files and SparkContext.addFile have a sort of undocumented and hidden behaviour. Only in executor side, it untars if the files are .tar.gz or tgz. I think it makes sense to deprecate this behaviour and encourage users to use explicit archive handling.

Also, I believe it's a good practice to avoid relying on external programs anyway.

core/src/main/scala/org/apache/spark/SparkContext.scala

core/src/main/scala/org/apache/spark/internal/config/package.scala

core/src/test/scala/org/apache/spark/SparkContextSuite.scala

dongjoon-hyun

Thank you for generalizing this, @HyukjinKwon . It's great. I left a few comments.

One more question. Can we remove spark.yarn.dist.archives by making it the alternative of spark.archives?

HyukjinKwon · 2020-11-24T23:53:34Z

I think it's fine to avoid removing spark.yarn.dist.archives out yet - maybe we could think about removing out once this feature becomes stable (?). Yarn also has spark.yarn.dist.files and spark.files can work together as far as I know.

maropu

Nice feature! I left minor comments.

core/src/main/scala/org/apache/spark/SparkContext.scala

core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala

core/src/main/scala/org/apache/spark/executor/Executor.scala

core/src/main/scala/org/apache/spark/internal/config/package.scala

core/src/main/scala/org/apache/spark/util/Utils.scala

docs/configuration.md

HyukjinKwon · 2020-11-25T05:37:30Z

Thanks @maropu and @dongjoon-hyun. I believe I addressed the comments.

HyukjinKwon · 2020-11-25T06:53:03Z

@mcg1969 too FYI conda-pack. With this change, users can use conda-pack in other cluster modes not only Yarn.

dongjoon-hyun · 2020-11-25T07:24:31Z

+1, LGTM (Pending CI)

HyukjinKwon · 2020-11-25T07:26:08Z

Thank you @dongjoon-hyun!

dongjoon-hyun · 2020-11-25T09:03:50Z

Retest this please.

dongjoon-hyun · 2020-11-25T09:04:02Z

The R failure is a flaky one.

core/src/main/scala/org/apache/spark/SparkContext.scala

HyukjinKwon · 2020-11-25T10:41:58Z

I pushed some more changes to fix some nits which are all virtually non-code change (5b1d1c3).

mridulm · 2020-11-25T16:42:27Z

Thanks for working on this @HyukjinKwon !
I have not taken a very detailed look, but wanted to understand the interaction with use of distributed cache in yarn.
How does this coexist with that ? Or will it end up causing issues ? (archives coming in via yarn and spark executor ?)

HyukjinKwon · 2020-11-26T02:31:09Z

It will be exactly same as spark.files and spark.yarn.dist.files since I am reusing spark.files code path (spark.jars vs spark.yarn.dist.jars too). To be honest, I am not exactly sure how they resolve the conflict to each other but both can work together as far as I know.

HyukjinKwon · 2020-11-26T22:15:03Z

@mridulm dose it make sense? Ill go ahead if there are not other comments :-).

mridulm · 2020-11-28T10:05:37Z

Thanks for the details @HyukjinKwon !

So my understanding is that the PR is adding functionality which is existing in yarn for archives into a general support for other resource managers too - with spark on yarn continuing to rely on distributed cache (and there should be no functionality change in yarn mode). That sounds fine to me - but it would be good if @tgravescs can also take a look, I am a bit rusty on some of these now - and iirc there are a bunch of corner cases in this.

HyukjinKwon · 2020-11-28T12:34:28Z

Yeah, that's correct. One thing is though, if there's anything wrong in terms of conflict between Yarn distributed cach (spark.yarn.dist.* vs spark.* like spark.files), I would say this is a separate issue to handle since I am reusing the existing code path

Ngone51

LGTM after taking another look.

core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala

SparkQA · 2020-11-30T08:08:50Z

Test build #131961 has finished for PR 30486 at commit e35e94c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tgravescs

yeah the main thing is yarn still uses the distributed cache and not the spark deploy mechanism, just like it does for the files now. Doing a quick look I think this is fine.

Did you manually test it on k8s and yarn at all? that would be nice to make sure nothing unexpected.

tgravescs · 2020-11-30T15:24:09Z

core/src/test/scala/org/apache/spark/SparkContextSuite.scala

+      val file4 = File.createTempFile("someprefix4", "somesuffix4", dir)
+
+      val jarFile = new File(dir, "test!@$jar.jar")
+      val zipFile = new File(dir, "test-zip.zip")


are we relying on existing test for tar.gz and tar?

I manually tested tar.gz and tgz cases, and the cases without extensions too. The unpack method is a copy from Hadoop so I did not exhaustively test but I can add if that looks better to add the case.

HyukjinKwon · 2020-11-30T23:14:07Z

I haven't tested in K8S yet it would take me a while. I plan to add an integration test though.

Hope I can proceed it separately given that the code freeze is coming and I would like to get this in for Spark 3.1.0.

tgravescs · 2020-11-30T23:16:47Z

yeah its not a blocker

HyukjinKwon · 2020-12-01T04:42:27Z

Thanks all @dongjoon-hyun @maropu @Ngone51 @mridulm and @tgravescs. Let me merge this in.
I will try to have some time to prepare an IT test with K8S which hopefully will be added before Spark 3.1.0 release.

Merged to master.

HyukjinKwon · 2020-12-01T04:46:52Z

Oh, maybe I will use tar.gz and tgz in the integration test. That will address #30486 (comment) together.

I filed a JIRA at SPARK-33615

### What changes were proposed in this pull request? This PR proposes to add `SparkContext.addArchive` in PySpark side that's added in #30486. ### Why are the changes needed? To have the same API parity with the Scala side. ### Does this PR introduce _any_ user-facing change? Yes, this PR exposes an API (`SparkContext.addArchive`) that exists in Scala side. ### How was this patch tested? Doctest was added. Closes #35603 from HyukjinKwon/python-addArchive. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Kousuke Saruta <[email protected]>

HyukjinKwon commented Nov 24, 2020

View reviewed changes

core/src/main/scala/org/apache/spark/SparkContext.scala Outdated

Copy link

Member Author

HyukjinKwon Nov 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Archive is not supposed to be a directory.

HyukjinKwon commented Nov 24, 2020

View reviewed changes