Skip to content

Conversation

@alope107
Copy link

Adds a setup.py so that pyspark can be installed and packaged for pip. This allows for easier setup, and declaration of dependencies. Please see this discussion for more of the rationale behind this PR:
http://apache-spark-developers-list.1001551.n3.nabble.com/PySpark-on-PyPi-td12626.html

It is enforced at runtime that there must be a valid SPARK_HOME set, and that the version of pyspark and spark must match exactly.

To be used with pip, the package will need to be registered and uploaded to PyPi, see:
https://docs.python.org/2/distutils/packageindex.html

This code is based on a PR by @prabinb that I've updated due to renewed interest, see:
#464

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to source this from some existing place? That way we don't have to update the version string in multiple places. I forget where, but there should already be a central place where the version is set.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not seeing any version that's specific to pyspark, only a version for spark as a whole. I agree that we don't want to set a version in multiple places, but I think the one I introduced is the only version unique to pyspark.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An alternative, but trickier, idea would be to have mvn's pom.xml version be the authoritative one, but during the build process, it somehow adds or modifies that file to match the version (maybe using mvn resource filtering?). This would break being able to just "pip install -e python" in development mode, since people would remember to have to run the mvn command to sync the file over, but at least there is no risk of them going out of sync in the build.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I entirely follow. Are you suggesting that when Spark is built, Maven creates this pyspark_version file as a part of the build process? If so, how does this affect a user who installs from PyPI?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still need to build a sdist and wheel, so we can just make sure that whatever process we use adds that file in. Not sure if it's really worth the complexity at this moment, but my team does something internally such that our python and java code both get semantic versions based off of the latest tag and the git hash.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's error-prone to have multiple copy of version in different places, if someone forget to update his, PySpark will break (even within the repo).

I'd vote for generate the version during generating PyPI package. If PySpark came along with Spark, we don't need this check (at least it shouldn't fail or slow).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we remove the version checks entirely in the bundled version, and include them for the package uploaded to PyPI? I agree that this reduces the chance for maintainer error, but I'm worried about users upgrading versions of Spark. A user could install a bundled version of pyspark, and then later point their SPARK_HOME at a newer version of Spark. There would then be a version mismatch that wouldn't be detected.
Maybe a middle ground could be to include the version checks in both bundled and pip installations, but to include a check during PyPI package generation that the version has been properly set.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is the version number specified for the scala side now?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure. Could someone with more experience with that side of the project chime in?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am in favor of pyspark packaging the corresponding version of spark. As a user experience, this is cleaner, requires less steps, and is more natural/inline with other pip installable libraries. I have experience in packaging jars with python libraries in platform independent ways and would be happy to help if wanted.

@alope107
Copy link
Author

@justinuang and @nchammas thanks for the feedback; I've made the suggested changes.

python/setup.py Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is maybe asking for too much, but in Sparkling Pandas we install our own assembly jar*, would it maybe make sense to do that as part of this process?

(*and getting it working has been painful, but doable).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with assembly jars, so please correct me if I'm wrong, but I think that we shouldn't need one for pyspark as it is entirely python code. Wouldn't we only need an assembly jar if we were also looking to package scala or java code?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So by assembly JAR in this case I'd be refering to the Spark assembly jar (which we would want to package as an artifact along with submit scripts if we wanted to put this on pypi, but that might not be an immediate goal).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if SPARK_HOME was set, it would use that spark installation, and default to the packaged JAR otherwise? Depending on the size of the assembly JAR I be in favor of this as it makes installation very easy for those who only want to interact with Spark through pyspark, but the discussion on the mailing list seemed to intentionally shy away from too large of a PyPI package. I'll bring up your suggestion to see if there's wider support, and I encourage you to join the discussion here: http://apache-spark-developers-list.1001551.n3.nabble.com/PySpark-on-PyPi-td12626.html

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As was discussed on the list, I think it makes sense to hold off on the jar at first. It's definitely worth revisiting down the line though.

@justinuang
Copy link

@holdenk , thanks for working on this! Do we have plans to set up PyPI publishing?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about starting to add some of the logic from findspark to autodetect SPARK_HOME?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, the only autodetection logic that findspark has is to check where homebrew installs Spark on OSX. I think that for pyspark this is overly specific and brittle, and can lead to confusion if the user wanted to use pyspark with a different version than the one installed by homebrew. Having the user set SPARK_HOME themself makes the process unambiguous.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @alope107. In addition, if people are using spark-submit, then this isn't necessary right? spark-submit sets up SPARK_HOME automatically.

Are people launching python apps frequently without using spark-submit?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the common use cases for pyspark without spark-submit is running from a notebook environment. I think there is a decent number of people that do this, and that more will once it's easier to do so.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's possible via the following:

PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS='notebook'  spark-1.4.0-bin-hadoop2.4/bin/pyspark

Not completely discoverable, but it works =)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really understand this part of the code, so it would be nice to get some core devs to chime in, but it looks like

./core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala

does a lot of logic, especially when deploying against YARN that seems important.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify, it's been fairly easy to get the pyspark launcher to launch via a notebook, either doing what @justinuang said, or just setting IPYTHON=1 (which basically does the same).

But there are scenarios where it's useful to forgo the pyspark launcher entirely, i.e. launch with ipython and then do all the Spark-related stuff. One key use case is in containerized notebook deployments (like tmpnb) where we want a way to launch/deploy notebooks in a generic way (e.g. with ipython), but still let someone import and launch a SparkContext after the fact.

This PR is a great step, we could get closer to that goal by adding more autodetection / path setting logic (as pointed out by @Carreau ). But I agree with @alope107 that it might be too brittle, and it would definitely be some work to support all the config / deployment modes that SparkSubmit handles now (which themselves change across releases). I suspect that's why the core devs have tried to force the language APIs to go through a common launcher, but would be good to get more input.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if os.environ.get("SPARK_HOME") is None:

@rgbkrk
Copy link
Contributor

rgbkrk commented Aug 20, 2015

👍 very excited about this

I'm assuming for deployments one will pin both the pyspark version from PyPI as well as the Spark version they're using?

@alope107
Copy link
Author

@rgbkrk Yes, as the pyspark and Spark versions must match each other exactly, it makes sense for deployments to pin both.

@justinuang
Copy link

@davies, what is this blocking on?

@davies
Copy link
Contributor

davies commented Sep 16, 2015

@justinuang We can work on this now, will review it this week.

@justinuang
Copy link

Thanks! Sorry for being demanding, was just hoping to get this into 1.6.0!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no pom file inside the released bin package, I think we should look for another way to find out the Spark version.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch; I hadn't noticed that. The only other standard way I'm seeing to get the version is by instantiating a Java Spark Context and querying its version. Is this acceptable, or is there a more lightweight solution?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be an version number in the path of assembly jar, could we use that?

@rgbkrk
Copy link
Contributor

rgbkrk commented Oct 8, 2015

Looks like @alope107 added fixes for the last few reviews. How does this stand now?

@alope107
Copy link
Author

Added check for version number in assembly jar if pom.xml is not present. @davies is this what you had in mind?

@mnazbro
Copy link
Contributor

mnazbro commented Oct 28, 2015

@davies Could we get a review on this? I'd like to know what is left blocking this review.

python/setup.py Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is now out of date (we're up to 0.9 :))

@gracew
Copy link

gracew commented Nov 10, 2015

@alope107 , could you update the py4j dependency? I would really like to see this merged as well =)

@alope107
Copy link
Author

@holdenk @gracew
Thanks, I bumped py4j version.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we will always need this branch, can we remove the other one (always find the version from assembly jar)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alope107 Can we go ahead and find the version only from the spark-assembly jar?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alope107 , would you mind updating this PR to remove the pom_xml_file_path branch? Thanks!

@nchammas
Copy link
Contributor

@davies - Does this PR have a realistic chance of making it in for Spark 2.0? If not, are we held up by implementation details or by a more fundamental problem with the idea of packaging PySpark?

FYI, someone earlier here mentioned using findspark as workaround to not being able to pip install PySpark. I wonder if anyone's used it and how helpful it's been for their use cases.

@davies
Copy link
Contributor

davies commented Apr 12, 2016

I'm do not have bandwidth to work on this (also don't think this is high priority), someone could take over this to move forward.

@nchammas
Copy link
Contributor

No worries. I just wanted to make sure that the idea was sound since there were concerns early on about whether we should even try to package PySpark independently.

@hougs
Copy link

hougs commented Apr 14, 2016

Pyspark being pip installable would be useful to many users. I've packaged
jars in pip installable python modules before, and will take a stab at
here. I'm going to take a different approach than the original author. it
seems to me that an already installed Spark jar and validating version
compatibility will be error prone and will likely cause trouble for users.
I'm intending to package the spark jar as part of the module. This will
mean that there is an order of operations for the deployment to PyPi to be
successful. I am going to assume that mvn clean package/install has
already happened in the build and that the requisite jar is in target/.

Making this change is going to create a new step in the build and
publishing process. Who is responsible for that for Spark?

Is Spark supported/expected to work on Windows? I'm confident that the
packaging I'm planning to do will work on unix-like systems but have never
tested it on Windows and cross platform compatibility for this packaging
step can be tricky/not guaranteed.

On Tue, Apr 12, 2016 at 3:14 PM, Nicholas Chammas [email protected]
wrote:

No worries. I just wanted to make sure that the idea was sound since there
were concerns early on about whether we should even try to package PySpark
independently.


You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#8318 (comment)

@nchammas
Copy link
Contributor

@jhlch - I think this will be a tough feature to get in, honestly, but if you want to take a fresh stab at it then I'm interested in helping with review and testing.

First, I think for the record we need someone to lay out the proposed benefits clearly since the JIRA for this PR, SPARK-1267, doesn't do that, and committers will want to weigh the proposed benefits against any ongoing maintenance cost they're going to be asked to bear.

I'm intending to package the spark jar as part of the module. This will
mean that there is an order of operations for the deployment to PyPi to be
successful.

This sounds good to me.

Making this change is going to create a new step in the build and
publishing process. Who is responsible for that for Spark?

This, I believe, will be the toughest part of getting a feature like this in: Committer buy-in.

Packaging Spark for PyPI means committers have extra work to do for every release, and more things that could go wrong that they have to worry about. Any Python packaging proposal will have to add little to no committer overhead (like requiring them to update version strings in more places), and perhaps include some tests to guarantee that the packaging won't silently break.

Next, we will have to figure out the details of who will own the PyPI account and coordinate with the ASF if they need to be in the picture. We will also likely need to reach out to the PyPI admins for a special limit increase on the size of the package we will be allowed to upload, or instrument some machinery to get the pip installation to automatically download large artifacts from somewhere else.

As for who the relevant committers might be, I think they would be @davies and @JoshRosen for Python, and @rxin and @srowen for packaging: Hey committers, are there any circumstances under which Python-specific packaging could become part of the regular Spark release process? If so, are there any prerequisites we haven't brought up here that you want to see met? I'm just trying gauge whether this has a realistic chance of ever making it in, or whether we just don't want to do this.

Is Spark supported/expected to work on Windows?

Yes, Spark is supported on Windows. (Though now that you mention it, this isn't spelled out clearly anywhere in the official docs.)

@rxin
Copy link
Contributor

rxin commented Apr 15, 2016

I think it's a good thing to do if we enable "pip install spark" for local modes. As you said, minimizing overhead would be great.

@holdenk
Copy link
Contributor

holdenk commented Aug 3, 2016

Is there committer interest in seeing something like this move forward now that we are past 2.0?

@holdenk
Copy link
Contributor

holdenk commented Oct 7, 2016

cc @mateiz are you interested in seeing something like this move forward?

@mateiz
Copy link
Contributor

mateiz commented Oct 7, 2016

Something like this would be great IMO. A few questions though:

  • How will it work if users want to run a different version of PySpark from a different version of Spark (maybe something they installed locally)? How can they easily swap that out? We don't want this making it harder to use Spark against a real cluster because the version you got from pip is wrong.
  • What are the mechanics of publishing to PyPI? Can we make an account that's shared by all the committers somehow? Can we sign releases? Note that there is a release policy at the ASF that we need to make sure this follows. In particular, does anyone have examples of other ASF projects that publish to PyPI?
  • What features will and won't work out of the box in the current implementation -- e.g. can you use it to access existing Hadoop clusters or S3, or is it just for local mode?
  • How do we automatically test this?

@mateiz
Copy link
Contributor

mateiz commented Oct 7, 2016

BTW the other change now is that we don't make an assembly JAR by default anymore, though we could build one for this. We just need a build script for this that's solid, produces a release-policy-compliant artifact, and can be tested automatically (or else it will bit rot).

@rgbkrk
Copy link
Contributor

rgbkrk commented Oct 8, 2016

How will it work if users want to run a different version of PySpark from a different version of Spark (maybe something they installed locally)? How can they easily swap that out? We don't want this making it harder to use Spark against a real cluster because the version you got from pip is wrong.

They have to deal with normal Python packaging semantics. Right now, not making it pip installable and importable actually makes it harder for us. We then rely on findspark to resolve the package (plus some amount of ritual to start the JVM...) In case you're wondering, yes I use Spark against a real live large cluster and so do users I support.

Can we make an account that's shared by all the committers somehow?

You can. However, it's easier to give access rights to each individual on PyPI.

Can we sign releases?

Yes, you can GPG sign them.

In particular, does anyone have examples of other ASF projects that publish to PyPI?

libcloud

@mateiz
Copy link
Contributor

mateiz commented Oct 9, 2016

Cool, good to know that there's another ASF project that does it. We should go for it then.

@hougs
Copy link

hougs commented Oct 12, 2016

I've got a branch that has a solid first pass at making pyspark pip installable. A few questions are:

  • How does this integrate with the typical build? Once the jar is built it needs to be put in a location pointed to by setup.py and MANIFEST.in.
  • What version requirements are there for numpy and pandas? I'm not confident that the one I list are correct or as specific as they could be.
  • Setup automated testing:
    • run-tests and run-tests.py should use environments where pyspark has been pip installed and remove the 'find jars' etc thing it currently does.
    • testpypi exists and could be useful in CI to make sure packaging and distribution never break. CI python envs could be initialized using pip install --extra-index-url https://testpypi.python.org/pypi pyspark

I've got too much on my plate to see this to the finish line in the next few months, but I do want to see this happen. Is someone else willing to take it from here? If not, I'll come back to it in Dec/Jan.

@holdenk
Copy link
Contributor

holdenk commented Oct 12, 2016

I'd be happy to take it from where @jhlch is at - I've got some bandwidth available to work on additional PySpark stuff and it seems like the interest on the committer side is here now so I'd love to help make this happen :)

@mateiz
Copy link
Contributor

mateiz commented Oct 17, 2016

Yes, it would be great to get this done. Just make sure that we have a good way to test it. Can you also document how a user is supposed to switch to a different pyspark (if they do have Spark installed locally somewhere)?

@holdenk
Copy link
Contributor

holdenk commented Oct 18, 2016

@mateiz: When you mean switch to a different PySpark do you mean they have different versions in different virtual envs or different traditionally downloaded PySparks or something else?

@mateiz
Copy link
Contributor

mateiz commented Oct 18, 2016

Probably switching from the PySpark in PyPI to a version you installed locally by downloading Spark.

@holdenk
Copy link
Contributor

holdenk commented Oct 19, 2016

Ah that makes more sense. Sort of the default way of going about this would make it so that pip installing PySpark puts pyspark on the path of the user, but if they explicitly call a different ./bin/pyspark or ./bin/spark-submit we prepend our own Python path before the existing ones so that one will have precedence over the pip installed one. It seems like the is the behaviour most people would expect (and its also probably the easiest to implement) :)

@holdenk
Copy link
Contributor

holdenk commented Nov 1, 2016

For those following along I've made a pip installable PR over at #15659 and I'd appreciate review from a committer (cc @mateiz / @davies ). That version also copies JARs so it can be self contained.

@holdenk
Copy link
Contributor

holdenk commented Nov 29, 2016

Since #15659 got merged, would you be ok with closing this @alope107? Thanks for your work on this :)

@asfgit asfgit closed this in 08d6441 Dec 7, 2016
zifeif2 pushed a commit to zifeif2/spark that referenced this pull request Nov 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.