[SPARK-1267][PYSPARK] Adds pip installer for pyspark #8318

alope107 · 2015-08-19T19:18:07Z

Adds a setup.py so that pyspark can be installed and packaged for pip. This allows for easier setup, and declaration of dependencies. Please see this discussion for more of the rationale behind this PR:
http://apache-spark-developers-list.1001551.n3.nabble.com/PySpark-on-PyPi-td12626.html

It is enforced at runtime that there must be a valid SPARK_HOME set, and that the version of pyspark and spark must match exactly.

To be used with pip, the package will need to be registered and uploaded to PyPi, see:
https://docs.python.org/2/distutils/packageindex.html

This code is based on a PR by @prabinb that I've updated due to renewed interest, see:
#464

AmplabJenkins · 2015-08-19T19:22:52Z

Can one of the admins verify this patch?

nchammas · 2015-08-19T19:28:39Z

python/pyspark/pyspark_version.py

Is there a way to source this from some existing place? That way we don't have to update the version string in multiple places. I forget where, but there should already be a central place where the version is set.

I'm not seeing any version that's specific to pyspark, only a version for spark as a whole. I agree that we don't want to set a version in multiple places, but I think the one I introduced is the only version unique to pyspark.

An alternative, but trickier, idea would be to have mvn's pom.xml version be the authoritative one, but during the build process, it somehow adds or modifies that file to match the version (maybe using mvn resource filtering?). This would break being able to just "pip install -e python" in development mode, since people would remember to have to run the mvn command to sync the file over, but at least there is no risk of them going out of sync in the build.

I'm not sure I entirely follow. Are you suggesting that when Spark is built, Maven creates this pyspark_version file as a part of the build process? If so, how does this affect a user who installs from PyPI?

We still need to build a sdist and wheel, so we can just make sure that whatever process we use adds that file in. Not sure if it's really worth the complexity at this moment, but my team does something internally such that our python and java code both get semantic versions based off of the latest tag and the git hash.

I think it's error-prone to have multiple copy of version in different places, if someone forget to update his, PySpark will break (even within the repo).

I'd vote for generate the version during generating PyPI package. If PySpark came along with Spark, we don't need this check (at least it shouldn't fail or slow).

So we remove the version checks entirely in the bundled version, and include them for the package uploaded to PyPI? I agree that this reduces the chance for maintainer error, but I'm worried about users upgrading versions of Spark. A user could install a bundled version of pyspark, and then later point their SPARK_HOME at a newer version of Spark. There would then be a version mismatch that wouldn't be detected.
Maybe a middle ground could be to include the version checks in both bundled and pip installations, but to include a check during PyPI package generation that the version has been properly set.

How is the version number specified for the scala side now?

I'm not sure. Could someone with more experience with that side of the project chime in?

I am in favor of pyspark packaging the corresponding version of spark. As a user experience, this is cleaner, requires less steps, and is more natural/inline with other pip installable libraries. I have experience in packaging jars with python libraries in platform independent ways and would be happy to help if wanted.

alope107 · 2015-08-19T20:34:17Z

@justinuang and @nchammas thanks for the feedback; I've made the suggested changes.

holdenk · 2015-08-20T00:33:17Z

python/setup.py

This is maybe asking for too much, but in Sparkling Pandas we install our own assembly jar*, would it maybe make sense to do that as part of this process?

(*and getting it working has been painful, but doable).

I'm not familiar with assembly jars, so please correct me if I'm wrong, but I think that we shouldn't need one for pyspark as it is entirely python code. Wouldn't we only need an assembly jar if we were also looking to package scala or java code?

So by assembly JAR in this case I'd be refering to the Spark assembly jar (which we would want to package as an artifact along with submit scripts if we wanted to put this on pypi, but that might not be an immediate goal).

So if SPARK_HOME was set, it would use that spark installation, and default to the packaged JAR otherwise? Depending on the size of the assembly JAR I be in favor of this as it makes installation very easy for those who only want to interact with Spark through pyspark, but the discussion on the mailing list seemed to intentionally shy away from too large of a PyPI package. I'll bring up your suggestion to see if there's wider support, and I encourage you to join the discussion here: http://apache-spark-developers-list.1001551.n3.nabble.com/PySpark-on-PyPi-td12626.html

As was discussed on the list, I think it makes sense to hold off on the jar at first. It's definitely worth revisiting down the line though.

justinuang · 2015-08-20T12:53:00Z

@holdenk , thanks for working on this! Do we have plans to set up PyPI publishing?

ellisonbg · 2015-08-20T17:04:35Z

python/pyspark/__init__.py

What about starting to add some of the logic from findspark to autodetect SPARK_HOME?

findspark repo: https://github.com/minrk/findspark

Currently, the only autodetection logic that findspark has is to check where homebrew installs Spark on OSX. I think that for pyspark this is overly specific and brittle, and can lead to confusion if the user wanted to use pyspark with a different version than the one installed by homebrew. Having the user set SPARK_HOME themself makes the process unambiguous.

I agree with @alope107. In addition, if people are using spark-submit, then this isn't necessary right? spark-submit sets up SPARK_HOME automatically.

Are people launching python apps frequently without using spark-submit?

One of the common use cases for pyspark without spark-submit is running from a notebook environment. I think there is a decent number of people that do this, and that more will once it's easier to do so.

That's possible via the following:

PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS='notebook' spark-1.4.0-bin-hadoop2.4/bin/pyspark

Not completely discoverable, but it works =)

I don't really understand this part of the code, so it would be nice to get some core devs to chime in, but it looks like

./core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala

does a lot of logic, especially when deploying against YARN that seems important.

Just to clarify, it's been fairly easy to get the pyspark launcher to launch via a notebook, either doing what @justinuang said, or just setting IPYTHON=1 (which basically does the same).

But there are scenarios where it's useful to forgo the pyspark launcher entirely, i.e. launch with ipython and then do all the Spark-related stuff. One key use case is in containerized notebook deployments (like tmpnb) where we want a way to launch/deploy notebooks in a generic way (e.g. with ipython), but still let someone import and launch a SparkContext after the fact.

This PR is a great step, we could get closer to that goal by adding more autodetection / path setting logic (as pointed out by @Carreau ). But I agree with @alope107 that it might be too brittle, and it would definitely be some work to support all the config / deployment modes that SparkSubmit handles now (which themselves change across releases). I suspect that's why the core devs have tried to force the language APIs to go through a common launcher, but would be good to get more input.

if os.environ.get("SPARK_HOME") is None:

rgbkrk · 2015-08-20T17:54:57Z

👍 very excited about this

I'm assuming for deployments one will pin both the pyspark version from PyPI as well as the Spark version they're using?

alope107 · 2015-08-20T18:10:50Z

@rgbkrk Yes, as the pyspark and Spark versions must match each other exactly, it makes sense for deployments to pin both.

justinuang · 2015-09-16T19:44:16Z

@davies, what is this blocking on?

davies · 2015-09-16T20:07:37Z

@justinuang We can work on this now, will review it this week.

justinuang · 2015-09-16T20:10:25Z

Thanks! Sorry for being demanding, was just hoping to get this into 1.6.0!

davies · 2015-10-05T17:53:41Z

python/pyspark/__init__.py

There is no pom file inside the released bin package, I think we should look for another way to find out the Spark version.

Good catch; I hadn't noticed that. The only other standard way I'm seeing to get the version is by instantiating a Java Spark Context and querying its version. Is this acceptable, or is there a more lightweight solution?

There should be an version number in the path of assembly jar, could we use that?

rgbkrk · 2015-10-08T21:37:27Z

Looks like @alope107 added fixes for the last few reviews. How does this stand now?

alope107 · 2015-10-10T21:49:34Z

Added check for version number in assembly jar if pom.xml is not present. @davies is this what you had in mind?

mnazbro · 2015-10-28T21:48:04Z

@davies Could we get a review on this? I'd like to know what is left blocking this review.

holdenk · 2015-11-02T08:26:44Z

python/setup.py

So this is now out of date (we're up to 0.9 :))

gracew · 2015-11-10T18:11:53Z

@alope107 , could you update the py4j dependency? I would really like to see this merged as well =)

alope107 · 2015-11-10T18:33:07Z

@holdenk @gracew
Thanks, I bumped py4j version.

davies · 2015-11-11T18:08:44Z

python/pyspark/__init__.py

Since we will always need this branch, can we remove the other one (always find the version from assembly jar)?

@alope107 Can we go ahead and find the version only from the spark-assembly jar?

@alope107 , would you mind updating this PR to remove the pom_xml_file_path branch? Thanks!

nchammas · 2016-04-12T21:40:53Z

@davies - Does this PR have a realistic chance of making it in for Spark 2.0? If not, are we held up by implementation details or by a more fundamental problem with the idea of packaging PySpark?

FYI, someone earlier here mentioned using findspark as workaround to not being able to pip install PySpark. I wonder if anyone's used it and how helpful it's been for their use cases.

davies · 2016-04-12T21:45:25Z

I'm do not have bandwidth to work on this (also don't think this is high priority), someone could take over this to move forward.

nchammas · 2016-04-12T22:13:03Z

No worries. I just wanted to make sure that the idea was sound since there were concerns early on about whether we should even try to package PySpark independently.

hougs · 2016-04-14T13:58:49Z

Pyspark being pip installable would be useful to many users. I've packaged
jars in pip installable python modules before, and will take a stab at
here. I'm going to take a different approach than the original author. it
seems to me that an already installed Spark jar and validating version
compatibility will be error prone and will likely cause trouble for users.
I'm intending to package the spark jar as part of the module. This will
mean that there is an order of operations for the deployment to PyPi to be
successful. I am going to assume that mvn clean package/install has
already happened in the build and that the requisite jar is in target/.

Making this change is going to create a new step in the build and
publishing process. Who is responsible for that for Spark?

Is Spark supported/expected to work on Windows? I'm confident that the
packaging I'm planning to do will work on unix-like systems but have never
tested it on Windows and cross platform compatibility for this packaging
step can be tricky/not guaranteed.

On Tue, Apr 12, 2016 at 3:14 PM, Nicholas Chammas [email protected]
wrote:

No worries. I just wanted to make sure that the idea was sound since there
were concerns early on about whether we should even try to package PySpark
independently.

—
You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#8318 (comment)

nchammas · 2016-04-15T03:07:10Z

@jhlch - I think this will be a tough feature to get in, honestly, but if you want to take a fresh stab at it then I'm interested in helping with review and testing.

First, I think for the record we need someone to lay out the proposed benefits clearly since the JIRA for this PR, SPARK-1267, doesn't do that, and committers will want to weigh the proposed benefits against any ongoing maintenance cost they're going to be asked to bear.

I'm intending to package the spark jar as part of the module. This will
mean that there is an order of operations for the deployment to PyPi to be
successful.

This sounds good to me.

Making this change is going to create a new step in the build and
publishing process. Who is responsible for that for Spark?

This, I believe, will be the toughest part of getting a feature like this in: Committer buy-in.

Packaging Spark for PyPI means committers have extra work to do for every release, and more things that could go wrong that they have to worry about. Any Python packaging proposal will have to add little to no committer overhead (like requiring them to update version strings in more places), and perhaps include some tests to guarantee that the packaging won't silently break.

Next, we will have to figure out the details of who will own the PyPI account and coordinate with the ASF if they need to be in the picture. We will also likely need to reach out to the PyPI admins for a special limit increase on the size of the package we will be allowed to upload, or instrument some machinery to get the pip installation to automatically download large artifacts from somewhere else.

As for who the relevant committers might be, I think they would be @davies and @JoshRosen for Python, and @rxin and @srowen for packaging: Hey committers, are there any circumstances under which Python-specific packaging could become part of the regular Spark release process? If so, are there any prerequisites we haven't brought up here that you want to see met? I'm just trying gauge whether this has a realistic chance of ever making it in, or whether we just don't want to do this.

Is Spark supported/expected to work on Windows?

Yes, Spark is supported on Windows. (Though now that you mention it, this isn't spelled out clearly anywhere in the official docs.)

rxin · 2016-04-15T04:05:31Z

I think it's a good thing to do if we enable "pip install spark" for local modes. As you said, minimizing overhead would be great.

holdenk · 2016-08-03T21:18:13Z

Is there committer interest in seeing something like this move forward now that we are past 2.0?

holdenk · 2016-10-07T20:08:09Z

cc @mateiz are you interested in seeing something like this move forward?

mateiz · 2016-10-07T20:31:50Z

Something like this would be great IMO. A few questions though:

How will it work if users want to run a different version of PySpark from a different version of Spark (maybe something they installed locally)? How can they easily swap that out? We don't want this making it harder to use Spark against a real cluster because the version you got from pip is wrong.
What are the mechanics of publishing to PyPI? Can we make an account that's shared by all the committers somehow? Can we sign releases? Note that there is a release policy at the ASF that we need to make sure this follows. In particular, does anyone have examples of other ASF projects that publish to PyPI?
What features will and won't work out of the box in the current implementation -- e.g. can you use it to access existing Hadoop clusters or S3, or is it just for local mode?
How do we automatically test this?

mateiz · 2016-10-07T20:36:15Z

BTW the other change now is that we don't make an assembly JAR by default anymore, though we could build one for this. We just need a build script for this that's solid, produces a release-policy-compliant artifact, and can be tested automatically (or else it will bit rot).

rgbkrk · 2016-10-08T03:39:21Z

How will it work if users want to run a different version of PySpark from a different version of Spark (maybe something they installed locally)? How can they easily swap that out? We don't want this making it harder to use Spark against a real cluster because the version you got from pip is wrong.

They have to deal with normal Python packaging semantics. Right now, not making it pip installable and importable actually makes it harder for us. We then rely on findspark to resolve the package (plus some amount of ritual to start the JVM...) In case you're wondering, yes I use Spark against a real live large cluster and so do users I support.

Can we make an account that's shared by all the committers somehow?

You can. However, it's easier to give access rights to each individual on PyPI.

Can we sign releases?

Yes, you can GPG sign them.

In particular, does anyone have examples of other ASF projects that publish to PyPI?

libcloud

mateiz · 2016-10-09T05:48:20Z

Cool, good to know that there's another ASF project that does it. We should go for it then.

hougs · 2016-10-12T06:22:07Z

I've got a branch that has a solid first pass at making pyspark pip installable. A few questions are:

How does this integrate with the typical build? Once the jar is built it needs to be put in a location pointed to by setup.py and MANIFEST.in.
What version requirements are there for numpy and pandas? I'm not confident that the one I list are correct or as specific as they could be.
Setup automated testing:
- run-tests and run-tests.py should use environments where pyspark has been pip installed and remove the 'find jars' etc thing it currently does.
- testpypi exists and could be useful in CI to make sure packaging and distribution never break. CI python envs could be initialized using pip install --extra-index-url https://testpypi.python.org/pypi pyspark

I've got too much on my plate to see this to the finish line in the next few months, but I do want to see this happen. Is someone else willing to take it from here? If not, I'll come back to it in Dec/Jan.

holdenk · 2016-10-12T06:27:56Z

I'd be happy to take it from where @jhlch is at - I've got some bandwidth available to work on additional PySpark stuff and it seems like the interest on the committer side is here now so I'd love to help make this happen :)

mateiz · 2016-10-17T05:32:09Z

Yes, it would be great to get this done. Just make sure that we have a good way to test it. Can you also document how a user is supposed to switch to a different pyspark (if they do have Spark installed locally somewhere)?

holdenk · 2016-10-18T15:45:35Z

@mateiz: When you mean switch to a different PySpark do you mean they have different versions in different virtual envs or different traditionally downloaded PySparks or something else?

mateiz · 2016-10-18T16:47:05Z

Probably switching from the PySpark in PyPI to a version you installed locally by downloading Spark.

holdenk · 2016-10-19T22:40:49Z

Ah that makes more sense. Sort of the default way of going about this would make it so that pip installing PySpark puts pyspark on the path of the user, but if they explicitly call a different ./bin/pyspark or ./bin/spark-submit we prepend our own Python path before the existing ones so that one will have precedence over the pip installed one. It seems like the is the behaviour most people would expect (and its also probably the easiest to implement) :)

holdenk · 2016-11-01T15:03:38Z

For those following along I've made a pip installable PR over at #15659 and I'd appreciate review from a committer (cc @mateiz / @davies ). That version also copies JARs so it can be self contained.

holdenk · 2016-11-29T18:42:23Z

Since #15659 got merged, would you be ok with closing this @alope107? Thanks for your work on this :)

Closes apache#14537. Closes apache#16181. Closes apache#8318. Closes apache#6848. Closes apache#7265. Closes apache#9543.

[SPARK-1267][PYSPARK] Adds pip installer for pyspark

a288923

nchammas reviewed Aug 19, 2015
View reviewed changes

Auberon Lopez added 2 commits August 19, 2015 12:59

fixing style

5e02ee1

fixing path munging

c71fb5a

holdenk reviewed Aug 20, 2015
View reviewed changes

ellisonbg reviewed Aug 20, 2015
View reviewed changes

remove data_files option

7ad1a8d

davies reviewed Oct 5, 2015
View reviewed changes

alope107 added 2 commits October 5, 2015 16:50

env check fix

55d7602

extras_require

ff43762

check assembly jar for version

3b864ce

holdenk reviewed Nov 2, 2015
View reviewed changes

python/setup.py Outdated

Copy link

Contributor

holdenk Nov 2, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is now out of date (we're up to 0.9 :))

upped py4j version

794daf2

davies reviewed Nov 11, 2015
View reviewed changes

drudim mentioned this pull request Mar 6, 2016

IMP add setup.py for pyspark #11548

Closed

MechCoder mentioned this pull request Aug 9, 2016

[SYSTEMML-234] [SYSTEMML-208] Added mllearn library to support scikit-learn and MLPipeline apache/systemds#204

Closed

holdenk mentioned this pull request Oct 27, 2016

[SPARK-1267][SPARK-18129] Allow PySpark to be pip installed #15659

Closed

asfgit closed this in 08d6441 Dec 7, 2016

zifeif2 pushed a commit to zifeif2/spark that referenced this pull request Nov 22, 2025

Closes stale & invalid pull requests.

a5f529b

Closes apache#14537. Closes apache#16181. Closes apache#8318. Closes apache#6848. Closes apache#7265. Closes apache#9543.

[SPARK-1267][PYSPARK] Adds pip installer for pyspark #8318

[SPARK-1267][PYSPARK] Adds pip installer for pyspark #8318

Uh oh!

Conversation

alope107 commented Aug 19, 2015

Uh oh!

AmplabJenkins commented Aug 19, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alope107 commented Aug 19, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

justinuang commented Aug 20, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rgbkrk commented Aug 20, 2015

Uh oh!

alope107 commented Aug 20, 2015

Uh oh!

justinuang commented Sep 16, 2015

Uh oh!

davies commented Sep 16, 2015

Uh oh!

justinuang commented Sep 16, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rgbkrk commented Oct 8, 2015

Uh oh!