-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-33530][CORE] Support --archives and spark.archives option natively #30486
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we cannot rely on new Path(path).toUri. it makes the fragment (#) in URI as the part of path. Utils.resolveURI is used for spark.yarn.dist.archives as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Archive is not supposed to be a directory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the same reason of keeping the fragment, it uses URI when it's archive.
|
@tgravescs, @mridulm, @Ngone51, can you take a look when you guys find some time? |
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our spark.files and SparkContext.addFile have a sort of undocumented and hidden behaviour. Only in executor side, it untars if the files are .tar.gz or tgz. I think it makes sense to deprecate this behaviour and encourage users to use explicit archive handling.
Also, I believe it's a good practice to avoid relying on external programs anyway.
This comment has been minimized.
This comment has been minimized.
core/src/main/scala/org/apache/spark/internal/config/package.scala
Outdated
Show resolved
Hide resolved
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for generalizing this, @HyukjinKwon . It's great. I left a few comments.
One more question. Can we remove spark.yarn.dist.archives by making it the alternative of spark.archives?
|
I think it's fine to avoid removing |
This comment has been minimized.
This comment has been minimized.
7b95396 to
c87577e
Compare
c87577e to
59b4355
Compare
This comment has been minimized.
This comment has been minimized.
maropu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice feature! I left minor comments.
core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/org/apache/spark/internal/config/package.scala
Outdated
Show resolved
Hide resolved
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
Thanks @maropu and @dongjoon-hyun. I believe I addressed the comments. |
|
@mcg1969 too FYI conda-pack. With this change, users can use conda-pack in other cluster modes not only Yarn. |
|
+1, LGTM (Pending CI) |
|
Thank you @dongjoon-hyun! |
This comment has been minimized.
This comment has been minimized.
|
Retest this please. |
|
The R failure is a flaky one. |
|
I pushed some more changes to fix some nits which are all virtually non-code change (5b1d1c3). |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
Thanks for working on this @HyukjinKwon ! |
|
It will be exactly same as |
|
@mridulm dose it make sense? Ill go ahead if there are not other comments :-). |
|
Thanks for the details @HyukjinKwon ! So my understanding is that the PR is adding functionality which is existing in yarn for archives into a general support for other resource managers too - with spark on yarn continuing to rely on distributed cache (and there should be no functionality change in yarn mode). That sounds fine to me - but it would be good if @tgravescs can also take a look, I am a bit rusty on some of these now - and iirc there are a bunch of corner cases in this. |
|
Yeah, that's correct. One thing is though, if there's anything wrong in terms of conflict between Yarn distributed cach (spark.yarn.dist.* vs spark.* like spark.files), I would say this is a separate issue to handle since I am reusing the existing code path |
Ngone51
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM after taking another look.
core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala
Outdated
Show resolved
Hide resolved
|
Test build #131961 has finished for PR 30486 at commit
|
tgravescs
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah the main thing is yarn still uses the distributed cache and not the spark deploy mechanism, just like it does for the files now. Doing a quick look I think this is fine.
Did you manually test it on k8s and yarn at all? that would be nice to make sure nothing unexpected.
| val file4 = File.createTempFile("someprefix4", "somesuffix4", dir) | ||
|
|
||
| val jarFile = new File(dir, "test!@$jar.jar") | ||
| val zipFile = new File(dir, "test-zip.zip") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we relying on existing test for tar.gz and tar?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I manually tested tar.gz and tgz cases, and the cases without extensions too. The unpack method is a copy from Hadoop so I did not exhaustively test but I can add if that looks better to add the case.
|
I haven't tested in K8S yet it would take me a while. I plan to add an integration test though. Hope I can proceed it separately given that the code freeze is coming and I would like to get this in for Spark 3.1.0. |
|
yeah its not a blocker |
|
Thanks all @dongjoon-hyun @maropu @Ngone51 @mridulm and @tgravescs. Let me merge this in. Merged to master. |
|
Oh, maybe I will use tar.gz and tgz in the integration test. That will address #30486 (comment) together. I filed a JIRA at SPARK-33615 |
### What changes were proposed in this pull request? This PR proposes to add `SparkContext.addArchive` in PySpark side that's added in #30486. ### Why are the changes needed? To have the same API parity with the Scala side. ### Does this PR introduce _any_ user-facing change? Yes, this PR exposes an API (`SparkContext.addArchive`) that exists in Scala side. ### How was this patch tested? Doctest was added. Closes #35603 from HyukjinKwon/python-addArchive. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Kousuke Saruta <[email protected]>
What changes were proposed in this pull request?
TL;DR:
--archivesoption work in other cluster modes too and addsspark.archivesconfiguration.spark.files/SparkContext.addFilewill be deprecated, and users can usespark.archivesandSparkContext.addArchive.This PR proposes to add Spark's native
--archivesin Spark submit, andspark.archivesconfiguration. Currently, both are supported only in Yarn mode:This
archivesfeature is useful often when you have to ship a directory and unpack into executors. One example is native libraries to use e.g. JNI. Another example is to ship Python packages together by Conda environment.Especially for Conda, PySpark currently does not have a nice way to ship a package that works in general, please see also https://hyukjin-spark.readthedocs.io/en/stable/user_guide/python_packaging.html#using-zipped-virtual-environment (PySpark new documentation demo for 3.1.0).
The neatest way is arguably to use Conda environment by shipping zipped Conda environment but this is currently dependent on this archive feature. NOTE that we are able to use
spark.filesby relying on its undocumented behaviour that untarstar.gzbut I don't think we should document such ways and promote people to more rely on it.Also, note that this PR does not target to add the feature parity of
spark.files.overwrite,spark.files.useFetchCache, etc. yet. I documented that this is an experimental feature as well.Why are the changes needed?
To complete the feature parity, and to provide a better support of shipping Python libraries together with Conda env.
Does this PR introduce any user-facing change?
Yes, this makes
--archivesworks in Spark instead of Yarn-only, and adds a new configurationspark.archives.How was this patch tested?
I added unittests. Also, manually tested in standalone cluster, local-cluster, and local modes.