Skip to content

Conversation

@HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented Nov 24, 2020

What changes were proposed in this pull request?

TL;DR:

  • This PR completes the support of archives in Spark itself instead of Yarn-only
    • It makes --archives option work in other cluster modes too and adds spark.archives configuration.
  • After this PR, PySpark users can leverage Conda to ship Python packages together as below:
    conda create -y -n pyspark_env -c conda-forge pyarrow==2.0.0 pandas==1.1.4 conda-pack==0.5.0
    conda activate pyspark_env
    conda pack -f -o pyspark_env.tar.gz
    PYSPARK_DRIVER_PYTHON=python PYSPARK_PYTHON=./environment/bin/python pyspark --archives pyspark_env.tar.gz#environment
  • Issue a warning that undocumented and hidden behavior of partial archive handling in spark.files / SparkContext.addFile will be deprecated, and users can use spark.archives and SparkContext.addArchive.

This PR proposes to add Spark's native --archives in Spark submit, and spark.archives configuration. Currently, both are supported only in Yarn mode:

./bin/spark-submit --help
Options:
...
 Spark on YARN only:
  --queue QUEUE_NAME          The YARN queue to submit to (Default: "default").
  --archives ARCHIVES         Comma separated list of archives to be extracted into the
                              working directory of each executor.

This archives feature is useful often when you have to ship a directory and unpack into executors. One example is native libraries to use e.g. JNI. Another example is to ship Python packages together by Conda environment.

Especially for Conda, PySpark currently does not have a nice way to ship a package that works in general, please see also https://hyukjin-spark.readthedocs.io/en/stable/user_guide/python_packaging.html#using-zipped-virtual-environment (PySpark new documentation demo for 3.1.0).

The neatest way is arguably to use Conda environment by shipping zipped Conda environment but this is currently dependent on this archive feature. NOTE that we are able to use spark.files by relying on its undocumented behaviour that untars tar.gz but I don't think we should document such ways and promote people to more rely on it.

Also, note that this PR does not target to add the feature parity of spark.files.overwrite, spark.files.useFetchCache, etc. yet. I documented that this is an experimental feature as well.

Why are the changes needed?

To complete the feature parity, and to provide a better support of shipping Python libraries together with Conda env.

Does this PR introduce any user-facing change?

Yes, this makes --archives works in Spark instead of Yarn-only, and adds a new configuration spark.archives.

How was this patch tested?

I added unittests. Also, manually tested in standalone cluster, local-cluster, and local modes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we cannot rely on new Path(path).toUri. it makes the fragment (#) in URI as the part of path. Utils.resolveURI is used for spark.yarn.dist.archives as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Archive is not supposed to be a directory.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the same reason of keeping the fragment, it uses URI when it's archive.

@HyukjinKwon
Copy link
Member Author

@tgravescs, @mridulm, @Ngone51, can you take a look when you guys find some time?

@HyukjinKwon
Copy link
Member Author

cc @zero323 and @fhoering too FYI. This is related to the docs and shipping 3rd party Python packages in PySpark apps.

@SparkQA

This comment has been minimized.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our spark.files and SparkContext.addFile have a sort of undocumented and hidden behaviour. Only in executor side, it untars if the files are .tar.gz or tgz. I think it makes sense to deprecate this behaviour and encourage users to use explicit archive handling.

Also, I believe it's a good practice to avoid relying on external programs anyway.

@SparkQA

This comment has been minimized.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for generalizing this, @HyukjinKwon . It's great. I left a few comments.

One more question. Can we remove spark.yarn.dist.archives by making it the alternative of spark.archives?

@HyukjinKwon
Copy link
Member Author

I think it's fine to avoid removing spark.yarn.dist.archives out yet - maybe we could think about removing out once this feature becomes stable (?). Yarn also has spark.yarn.dist.files and spark.files can work together as far as I know.

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

Copy link
Member

@maropu maropu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice feature! I left minor comments.

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@HyukjinKwon
Copy link
Member Author

Thanks @maropu and @dongjoon-hyun. I believe I addressed the comments.

@HyukjinKwon
Copy link
Member Author

@mcg1969 too FYI conda-pack. With this change, users can use conda-pack in other cluster modes not only Yarn.

@dongjoon-hyun
Copy link
Member

+1, LGTM (Pending CI)

@HyukjinKwon
Copy link
Member Author

Thank you @dongjoon-hyun!

@SparkQA

This comment has been minimized.

@dongjoon-hyun
Copy link
Member

Retest this please.

@dongjoon-hyun
Copy link
Member

The R failure is a flaky one.

@HyukjinKwon
Copy link
Member Author

I pushed some more changes to fix some nits which are all virtually non-code change (5b1d1c3).

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@mridulm
Copy link
Contributor

mridulm commented Nov 25, 2020

Thanks for working on this @HyukjinKwon !
I have not taken a very detailed look, but wanted to understand the interaction with use of distributed cache in yarn.
How does this coexist with that ? Or will it end up causing issues ? (archives coming in via yarn and spark executor ?)

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Nov 26, 2020

It will be exactly same as spark.files and spark.yarn.dist.files since I am reusing spark.files code path (spark.jars vs spark.yarn.dist.jars too). To be honest, I am not exactly sure how they resolve the conflict to each other but both can work together as far as I know.

@HyukjinKwon
Copy link
Member Author

@mridulm dose it make sense? Ill go ahead if there are not other comments :-).

@mridulm
Copy link
Contributor

mridulm commented Nov 28, 2020

Thanks for the details @HyukjinKwon !

So my understanding is that the PR is adding functionality which is existing in yarn for archives into a general support for other resource managers too - with spark on yarn continuing to rely on distributed cache (and there should be no functionality change in yarn mode). That sounds fine to me - but it would be good if @tgravescs can also take a look, I am a bit rusty on some of these now - and iirc there are a bunch of corner cases in this.

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Nov 28, 2020

Yeah, that's correct. One thing is though, if there's anything wrong in terms of conflict between Yarn distributed cach (spark.yarn.dist.* vs spark.* like spark.files), I would say this is a separate issue to handle since I am reusing the existing code path

Copy link
Member

@Ngone51 Ngone51 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM after taking another look.

@SparkQA
Copy link

SparkQA commented Nov 30, 2020

Test build #131961 has finished for PR 30486 at commit e35e94c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@tgravescs tgravescs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah the main thing is yarn still uses the distributed cache and not the spark deploy mechanism, just like it does for the files now. Doing a quick look I think this is fine.

Did you manually test it on k8s and yarn at all? that would be nice to make sure nothing unexpected.

val file4 = File.createTempFile("someprefix4", "somesuffix4", dir)

val jarFile = new File(dir, "test!@$jar.jar")
val zipFile = new File(dir, "test-zip.zip")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we relying on existing test for tar.gz and tar?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I manually tested tar.gz and tgz cases, and the cases without extensions too. The unpack method is a copy from Hadoop so I did not exhaustively test but I can add if that looks better to add the case.

@HyukjinKwon
Copy link
Member Author

I haven't tested in K8S yet it would take me a while. I plan to add an integration test though.

Hope I can proceed it separately given that the code freeze is coming and I would like to get this in for Spark 3.1.0.

@tgravescs
Copy link
Contributor

yeah its not a blocker

@HyukjinKwon
Copy link
Member Author

Thanks all @dongjoon-hyun @maropu @Ngone51 @mridulm and @tgravescs. Let me merge this in.
I will try to have some time to prepare an IT test with K8S which hopefully will be added before Spark 3.1.0 release.

Merged to master.

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Dec 1, 2020

Oh, maybe I will use tar.gz and tgz in the integration test. That will address #30486 (comment) together.

I filed a JIRA at SPARK-33615

@HyukjinKwon HyukjinKwon deleted the native-archive branch December 7, 2020 02:05
sarutak pushed a commit that referenced this pull request Feb 22, 2022
### What changes were proposed in this pull request?

This PR proposes to add `SparkContext.addArchive` in PySpark side that's added in #30486.

### Why are the changes needed?

To have the same API parity with the Scala side.

### Does this PR introduce _any_ user-facing change?

Yes, this PR exposes an API (`SparkContext.addArchive`) that exists in Scala side.

### How was this patch tested?

Doctest was added.

Closes #35603 from HyukjinKwon/python-addArchive.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Kousuke Saruta <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants