-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-1267][PYSPARK] Adds pip installer for pyspark #8318
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Can one of the admins verify this patch? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to source this from some existing place? That way we don't have to update the version string in multiple places. I forget where, but there should already be a central place where the version is set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not seeing any version that's specific to pyspark, only a version for spark as a whole. I agree that we don't want to set a version in multiple places, but I think the one I introduced is the only version unique to pyspark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An alternative, but trickier, idea would be to have mvn's pom.xml version be the authoritative one, but during the build process, it somehow adds or modifies that file to match the version (maybe using mvn resource filtering?). This would break being able to just "pip install -e python" in development mode, since people would remember to have to run the mvn command to sync the file over, but at least there is no risk of them going out of sync in the build.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I entirely follow. Are you suggesting that when Spark is built, Maven creates this pyspark_version file as a part of the build process? If so, how does this affect a user who installs from PyPI?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We still need to build a sdist and wheel, so we can just make sure that whatever process we use adds that file in. Not sure if it's really worth the complexity at this moment, but my team does something internally such that our python and java code both get semantic versions based off of the latest tag and the git hash.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's error-prone to have multiple copy of version in different places, if someone forget to update his, PySpark will break (even within the repo).
I'd vote for generate the version during generating PyPI package. If PySpark came along with Spark, we don't need this check (at least it shouldn't fail or slow).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we remove the version checks entirely in the bundled version, and include them for the package uploaded to PyPI? I agree that this reduces the chance for maintainer error, but I'm worried about users upgrading versions of Spark. A user could install a bundled version of pyspark, and then later point their SPARK_HOME at a newer version of Spark. There would then be a version mismatch that wouldn't be detected.
Maybe a middle ground could be to include the version checks in both bundled and pip installations, but to include a check during PyPI package generation that the version has been properly set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is the version number specified for the scala side now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure. Could someone with more experience with that side of the project chime in?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am in favor of pyspark packaging the corresponding version of spark. As a user experience, this is cleaner, requires less steps, and is more natural/inline with other pip installable libraries. I have experience in packaging jars with python libraries in platform independent ways and would be happy to help if wanted.
|
@justinuang and @nchammas thanks for the feedback; I've made the suggested changes. |
python/setup.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is maybe asking for too much, but in Sparkling Pandas we install our own assembly jar*, would it maybe make sense to do that as part of this process?
(*and getting it working has been painful, but doable).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not familiar with assembly jars, so please correct me if I'm wrong, but I think that we shouldn't need one for pyspark as it is entirely python code. Wouldn't we only need an assembly jar if we were also looking to package scala or java code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So by assembly JAR in this case I'd be refering to the Spark assembly jar (which we would want to package as an artifact along with submit scripts if we wanted to put this on pypi, but that might not be an immediate goal).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if SPARK_HOME was set, it would use that spark installation, and default to the packaged JAR otherwise? Depending on the size of the assembly JAR I be in favor of this as it makes installation very easy for those who only want to interact with Spark through pyspark, but the discussion on the mailing list seemed to intentionally shy away from too large of a PyPI package. I'll bring up your suggestion to see if there's wider support, and I encourage you to join the discussion here: http://apache-spark-developers-list.1001551.n3.nabble.com/PySpark-on-PyPi-td12626.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As was discussed on the list, I think it makes sense to hold off on the jar at first. It's definitely worth revisiting down the line though.
|
@holdenk , thanks for working on this! Do we have plans to set up PyPI publishing? |
python/pyspark/__init__.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about starting to add some of the logic from findspark to autodetect SPARK_HOME?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
findspark repo: https://github.com/minrk/findspark
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, the only autodetection logic that findspark has is to check where homebrew installs Spark on OSX. I think that for pyspark this is overly specific and brittle, and can lead to confusion if the user wanted to use pyspark with a different version than the one installed by homebrew. Having the user set SPARK_HOME themself makes the process unambiguous.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @alope107. In addition, if people are using spark-submit, then this isn't necessary right? spark-submit sets up SPARK_HOME automatically.
Are people launching python apps frequently without using spark-submit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of the common use cases for pyspark without spark-submit is running from a notebook environment. I think there is a decent number of people that do this, and that more will once it's easier to do so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's possible via the following:
PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS='notebook' spark-1.4.0-bin-hadoop2.4/bin/pyspark
Not completely discoverable, but it works =)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really understand this part of the code, so it would be nice to get some core devs to chime in, but it looks like
./core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala
does a lot of logic, especially when deploying against YARN that seems important.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to clarify, it's been fairly easy to get the pyspark launcher to launch via a notebook, either doing what @justinuang said, or just setting IPYTHON=1 (which basically does the same).
But there are scenarios where it's useful to forgo the pyspark launcher entirely, i.e. launch with ipython and then do all the Spark-related stuff. One key use case is in containerized notebook deployments (like tmpnb) where we want a way to launch/deploy notebooks in a generic way (e.g. with ipython), but still let someone import and launch a SparkContext after the fact.
This PR is a great step, we could get closer to that goal by adding more autodetection / path setting logic (as pointed out by @Carreau ). But I agree with @alope107 that it might be too brittle, and it would definitely be some work to support all the config / deployment modes that SparkSubmit handles now (which themselves change across releases). I suspect that's why the core devs have tried to force the language APIs to go through a common launcher, but would be good to get more input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if os.environ.get("SPARK_HOME") is None:
|
👍 very excited about this I'm assuming for deployments one will pin both the pyspark version from PyPI as well as the Spark version they're using? |
|
@rgbkrk Yes, as the pyspark and Spark versions must match each other exactly, it makes sense for deployments to pin both. |
|
@davies, what is this blocking on? |
|
@justinuang We can work on this now, will review it this week. |
|
Thanks! Sorry for being demanding, was just hoping to get this into 1.6.0! |
python/pyspark/__init__.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no pom file inside the released bin package, I think we should look for another way to find out the Spark version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch; I hadn't noticed that. The only other standard way I'm seeing to get the version is by instantiating a Java Spark Context and querying its version. Is this acceptable, or is there a more lightweight solution?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be an version number in the path of assembly jar, could we use that?
|
Looks like @alope107 added fixes for the last few reviews. How does this stand now? |
|
Added check for version number in assembly jar if pom.xml is not present. @davies is this what you had in mind? |
|
@davies Could we get a review on this? I'd like to know what is left blocking this review. |
python/setup.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is now out of date (we're up to 0.9 :))
|
@alope107 , could you update the py4j dependency? I would really like to see this merged as well =) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we will always need this branch, can we remove the other one (always find the version from assembly jar)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alope107 Can we go ahead and find the version only from the spark-assembly jar?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alope107 , would you mind updating this PR to remove the pom_xml_file_path branch? Thanks!
|
@davies - Does this PR have a realistic chance of making it in for Spark 2.0? If not, are we held up by implementation details or by a more fundamental problem with the idea of packaging PySpark? FYI, someone earlier here mentioned using findspark as workaround to not being able to pip install PySpark. I wonder if anyone's used it and how helpful it's been for their use cases. |
|
I'm do not have bandwidth to work on this (also don't think this is high priority), someone could take over this to move forward. |
|
No worries. I just wanted to make sure that the idea was sound since there were concerns early on about whether we should even try to package PySpark independently. |
|
Pyspark being pip installable would be useful to many users. I've packaged Making this change is going to create a new step in the build and Is Spark supported/expected to work on Windows? I'm confident that the On Tue, Apr 12, 2016 at 3:14 PM, Nicholas Chammas [email protected]
|
|
@jhlch - I think this will be a tough feature to get in, honestly, but if you want to take a fresh stab at it then I'm interested in helping with review and testing. First, I think for the record we need someone to lay out the proposed benefits clearly since the JIRA for this PR, SPARK-1267, doesn't do that, and committers will want to weigh the proposed benefits against any ongoing maintenance cost they're going to be asked to bear.
This sounds good to me.
This, I believe, will be the toughest part of getting a feature like this in: Committer buy-in. Packaging Spark for PyPI means committers have extra work to do for every release, and more things that could go wrong that they have to worry about. Any Python packaging proposal will have to add little to no committer overhead (like requiring them to update version strings in more places), and perhaps include some tests to guarantee that the packaging won't silently break. Next, we will have to figure out the details of who will own the PyPI account and coordinate with the ASF if they need to be in the picture. We will also likely need to reach out to the PyPI admins for a special limit increase on the size of the package we will be allowed to upload, or instrument some machinery to get the pip installation to automatically download large artifacts from somewhere else. As for who the relevant committers might be, I think they would be @davies and @JoshRosen for Python, and @rxin and @srowen for packaging: Hey committers, are there any circumstances under which Python-specific packaging could become part of the regular Spark release process? If so, are there any prerequisites we haven't brought up here that you want to see met? I'm just trying gauge whether this has a realistic chance of ever making it in, or whether we just don't want to do this.
Yes, Spark is supported on Windows. (Though now that you mention it, this isn't spelled out clearly anywhere in the official docs.) |
|
I think it's a good thing to do if we enable "pip install spark" for local modes. As you said, minimizing overhead would be great. |
|
Is there committer interest in seeing something like this move forward now that we are past 2.0? |
|
cc @mateiz are you interested in seeing something like this move forward? |
|
Something like this would be great IMO. A few questions though:
|
|
BTW the other change now is that we don't make an assembly JAR by default anymore, though we could build one for this. We just need a build script for this that's solid, produces a release-policy-compliant artifact, and can be tested automatically (or else it will bit rot). |
They have to deal with normal Python packaging semantics. Right now, not making it pip installable and importable actually makes it harder for us. We then rely on findspark to resolve the package (plus some amount of ritual to start the JVM...) In case you're wondering, yes I use Spark against a real live large cluster and so do users I support.
You can. However, it's easier to give access rights to each individual on PyPI.
Yes, you can GPG sign them.
|
|
Cool, good to know that there's another ASF project that does it. We should go for it then. |
|
I've got a branch that has a solid first pass at making pyspark pip installable. A few questions are:
I've got too much on my plate to see this to the finish line in the next few months, but I do want to see this happen. Is someone else willing to take it from here? If not, I'll come back to it in Dec/Jan. |
|
I'd be happy to take it from where @jhlch is at - I've got some bandwidth available to work on additional PySpark stuff and it seems like the interest on the committer side is here now so I'd love to help make this happen :) |
|
Yes, it would be great to get this done. Just make sure that we have a good way to test it. Can you also document how a user is supposed to switch to a different pyspark (if they do have Spark installed locally somewhere)? |
|
@mateiz: When you mean switch to a different PySpark do you mean they have different versions in different virtual envs or different traditionally downloaded PySparks or something else? |
|
Probably switching from the PySpark in PyPI to a version you installed locally by downloading Spark. |
|
Ah that makes more sense. Sort of the default way of going about this would make it so that pip installing PySpark puts pyspark on the path of the user, but if they explicitly call a different ./bin/pyspark or ./bin/spark-submit we prepend our own Python path before the existing ones so that one will have precedence over the pip installed one. It seems like the is the behaviour most people would expect (and its also probably the easiest to implement) :) |
Closes apache#14537. Closes apache#16181. Closes apache#8318. Closes apache#6848. Closes apache#7265. Closes apache#9543.
Adds a setup.py so that pyspark can be installed and packaged for pip. This allows for easier setup, and declaration of dependencies. Please see this discussion for more of the rationale behind this PR:
http://apache-spark-developers-list.1001551.n3.nabble.com/PySpark-on-PyPi-td12626.html
It is enforced at runtime that there must be a valid SPARK_HOME set, and that the version of pyspark and spark must match exactly.
To be used with pip, the package will need to be registered and uploaded to PyPi, see:
https://docs.python.org/2/distutils/packageindex.html
This code is based on a PR by @prabinb that I've updated due to renewed interest, see:
#464