-
Notifications
You must be signed in to change notification settings - Fork 32
Handle multiple Spark Sessions #40
base: master
Are you sure you want to change the base?
Conversation
Codecov Report
@@ Coverage Diff @@
## master #40 +/- ##
===========================================
- Coverage 96.61% 70.78% -25.83%
===========================================
Files 3 4 +1
Lines 59 89 +30
Branches 5 10 +5
===========================================
+ Hits 57 63 +6
- Misses 2 26 +24
Continue to review full report at Codecov.
|
|
Thanks for the contribution! I hope to have a deeper look early next week. |
mdboom
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this great submission. This has been a long-standing pain point.
In general, however, I'm concerned that this PR moves this from something that "just works", without any documentation or education required on the user's part to something that we'll have to explain to folks, but there may be a way around that.
If you read the docs of pyspark, it's clear that SparkSession and SparkContext are both singletons underneath, i.e. there will always be exactly one of them per kernel. So rather than having the magic to set the context, I think you can just get SparkSession._instantiatedSession, and, if not None, use that to get the URL. It's not great to use a private API, but maybe we can convince upstream to add a .get() function for us (that would be distinct from getOrCreate().
examples/Jupyter Spark example.ipynb
Outdated
| "spark = SparkSession \\\n", | ||
| " .builder \\\n", | ||
| " .appName(\"PythonPi\") \\\n", | ||
| " .master(\"yarn\") \\\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This causes my spark process to crash with:
Exception in thread "main" java.lang.Exception: When running with master 'yarn' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment.
Is it required?
README.md
Outdated
|
|
||
| To change the URL of the Spark API that the job metadata is fetched from | ||
| override the `Spark.url` config value, e.g. on the command line: | ||
| The Spark API that the job metadata is fetched from can be different for each SparkContext. As default, for the first Spark context uses port 4040, for the second 4041 and so on. If however `spark.ui.port` is set to 0 in SparkConf, Spark will choose a random ephemeral port for the API. In order to support this behaviour (and allow more than one tab in Jupyter with a SparkContext), copy the Jupyter notebook magic `spark_progress.py` from the `src/magic` into your ipython profile's startup folder, e.g. for the default profile this is `~/.ipython/profile_default/startup/`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should require users to copy something into their personal configuration to make this work. (There are many contexts in which Jupyter is run where the user doesn't even have access to that). Instead, we should install the magic along with the package, with a load_ipython_extension hook in __init__.py, and then the user would add %load_ext jupyter_spark to their notebook to load the extension.
src/jupyter_spark/spark.py
Outdated
| request_path = request.uri[len(self.proxy_url):] | ||
| return url_path_join(self.url, request_path) | ||
| def backend_url(self, url, path): | ||
| request_path = path[len(self.proxy_root):] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should use proxy_url here so it includes the baseUrl of IPython, if set.
tests/test_spark.py
Outdated
| ]) | ||
|
|
||
| def test_http_fetch_error(self): | ||
| def test_http_fetch_error_url_mssing(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: mssing -> missing
|
I guess I have changed the code accordingly. I personally don't really like using internal APIs, however I understand your rationale. I marked it with a TODO. |
|
A side note: If you work on an Hadoop cluster (as I do, hence the yarn stuff last time) shooting against uiWebUrl means shooting twice a second against the Resource Manager. If many users do this at the same time, this might create quite some traffic. Maybe a less chatty approach would be to use |
|
Thanks. I'm sorry -- I think I wasn't clear earlier. If you grab the spark context from the singleton, then the magic is completely optional in the common case. You would only need to use the magic if you explicitly want to set the url. Would you mind updating this so the magic is optional (and users can continue working as they have been unless this additional complexity is needed for them...?) |
|
@bernhard-42 : Hope I didn't scare you off by creating confusion. Your contribution is very much appreciated. |
|
No worries, first I didn't have time and then I forgot it ... |
|
@mdboom Any news regarding this? |
|
Is there any update on these changes getting pulled into the main project, or updates otherwise? This functionality would be very, very useful and the lack of it is a major block to using this extension. |
Proposal for Issue 22:
In the Jupyter notebook a Jupyter Comm target gets opened to listen for messages from a python kernel. A new Jupyter Magic uses this comm target to forward the Spark API URL to the notebook:
%spark_progress sparkwhere
sparkis the variable holding the Spark Session, so the magic can useglobals()["spark"].sparkContext.uiWebUrlto get the actual Spark API Url.Each call from the javascript notebook then forwards the Spark API Url as a query parameter
spark_urlto the backend handler which uses it to create the backend_url.This allows for multiple SparkContexts in different tabs and even for
spark.ui.port=0setting.