Handle multiple Spark Sessions #40

bernhard-42 · 2018-05-01T20:03:31Z

Proposal for Issue 22:

In the Jupyter notebook a Jupyter Comm target gets opened to listen for messages from a python kernel. A new Jupyter Magic uses this comm target to forward the Spark API URL to the notebook:

%spark_progress spark

where spark is the variable holding the Spark Session, so the magic can use globals()["spark"].sparkContext.uiWebUrl to get the actual Spark API Url.

Each call from the javascript notebook then forwards the Spark API Url as a query parameter spark_url to the backend handler which uses it to create the backend_url.

This allows for multiple SparkContexts in different tabs and even for spark.ui.port=0 setting.

codecov-io · 2018-05-01T20:17:17Z

Codecov Report

Merging #40 into master will decrease coverage by 25.82%.
The diff coverage is 35.13%.

@@             Coverage Diff             @@
##           master      #40       +/-   ##
===========================================
- Coverage   96.61%   70.78%   -25.83%     
===========================================
  Files           3        4        +1     
  Lines          59       89       +30     
  Branches        5       10        +5     
===========================================
+ Hits           57       63        +6     
- Misses          2       26       +24

Impacted Files	Coverage Δ
src/jupyter_spark/magic.py	`0% <0%> (ø)`
src/jupyter_spark/spark.py	`100% <100%> (ø)`	⬆️
src/jupyter_spark/handlers.py	`100% <100%> (ø)`	⬆️
src/jupyter_spark/__init__.py	`44.44% <25%> (-15.56%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 34ab4bf...38ed34a. Read the comment docs.

mdboom · 2018-05-11T22:30:23Z

Thanks for the contribution! I hope to have a deeper look early next week.

mdboom

Thanks for this great submission. This has been a long-standing pain point.

In general, however, I'm concerned that this PR moves this from something that "just works", without any documentation or education required on the user's part to something that we'll have to explain to folks, but there may be a way around that.

If you read the docs of pyspark, it's clear that SparkSession and SparkContext are both singletons underneath, i.e. there will always be exactly one of them per kernel. So rather than having the magic to set the context, I think you can just get SparkSession._instantiatedSession, and, if not None, use that to get the URL. It's not great to use a private API, but maybe we can convince upstream to add a .get() function for us (that would be distinct from getOrCreate().

mdboom · 2018-05-16T11:45:38Z

examples/Jupyter Spark example.ipynb

    "spark = SparkSession \\\n",
    "            .builder \\\n",
    "            .appName(\"PythonPi\") \\\n",
+    "            .master(\"yarn\") \\\n",


This causes my spark process to crash with:

Exception in thread "main" java.lang.Exception: When running with master 'yarn' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment.

Is it required?

mdboom · 2018-05-16T13:02:56Z

README.md


-To change the URL of the Spark API that the job metadata is fetched from
-override the `Spark.url` config value, e.g. on the command line:
+The Spark API that the job metadata is fetched from can be different for each SparkContext. As default, for the first Spark context uses port 4040, for the second 4041 and so on. If however `spark.ui.port` is set to 0 in SparkConf, Spark will choose a random ephemeral port for the API. In order to support this behaviour (and allow more than one tab in Jupyter with a SparkContext), copy the Jupyter notebook magic `spark_progress.py` from the `src/magic` into your ipython profile's startup folder, e.g. for the default profile this is `~/.ipython/profile_default/startup/`.


We should require users to copy something into their personal configuration to make this work. (There are many contexts in which Jupyter is run where the user doesn't even have access to that). Instead, we should install the magic along with the package, with a load_ipython_extension hook in __init__.py, and then the user would add %load_ext jupyter_spark to their notebook to load the extension.

mdboom · 2018-05-16T13:11:00Z

src/jupyter_spark/spark.py

-        request_path = request.uri[len(self.proxy_url):]
-        return url_path_join(self.url, request_path)
+    def backend_url(self, url, path):
+        request_path = path[len(self.proxy_root):]


Should use proxy_url here so it includes the baseUrl of IPython, if set.

mdboom · 2018-05-16T13:12:05Z

tests/test_spark.py

        ])

-    def test_http_fetch_error(self):
+    def test_http_fetch_error_url_mssing(self):


typo: mssing -> missing

bernhard-42 · 2018-05-17T20:01:14Z

I guess I have changed the code accordingly. I personally don't really like using internal APIs, however I understand your rationale. I marked it with a TODO.

bernhard-42 · 2018-05-17T20:01:57Z

A side note: If you work on an Hadoop cluster (as I do, hence the yarn stuff last time) shooting against uiWebUrl means shooting twice a second against the Resource Manager. If many users do this at the same time, this might create quite some traffic. Maybe a less chatty approach would be to use sc.statusTracker in a background thread in the notebook triggered by Jupyter cell hooks and communicating the status to the notebook javascript via the Jupyter comm layer - just an idea ...

mdboom · 2018-05-22T16:28:03Z

Thanks. I'm sorry -- I think I wasn't clear earlier. If you grab the spark context from the singleton, then the magic is completely optional in the common case. You would only need to use the magic if you explicitly want to set the url. Would you mind updating this so the magic is optional (and users can continue working as they have been unless this additional complexity is needed for them...?)

mdboom · 2018-06-05T16:17:25Z

@bernhard-42 : Hope I didn't scare you off by creating confusion. Your contribution is very much appreciated.

bernhard-42 · 2018-06-05T19:22:53Z

No worries, first I didn't have time and then I forgot it ...
Hope it now meets your expectations. If not, please feel free to accept and adapt as you need - this might actually be the faster process. I am happy either way.

ran-z · 2018-12-02T14:49:58Z

@mdboom Any news regarding this?
Or any other alternative solution for working with this extension on multiple tabs (each with a different Spark context and kernel)?

stevenstetzler · 2019-05-01T22:25:50Z

Is there any update on these changes getting pulled into the main project, or updates otherwise? This functionality would be very, very useful and the lack of it is a major block to using this extension.

bernhardBV added 3 commits May 1, 2018 19:24

Spark API Url forwarded from SparkSession

4d8d866

adapted tests to new Spark API Url handling

4f2e7b2

urllib quote for py2 and py3

fc60282

mdboom requested review from jezdez and mdboom May 11, 2018 22:30

mdboom suggested changes May 16, 2018

View reviewed changes

bernhardBV added 7 commits May 17, 2018 21:37

added load_ipython_extension and moved magic into module

93d46c4

remove yarn-client mode and use %load_ext

1712ac9

use proxy_url instead of proxy_root

3af70ac

fixed typo

3d37729

Adapt to chenged implementation

664231a

removed unused import

c7f72f0

fixed test

6138ad3

added autostart during %load_ext

38ed34a

Handle multiple Spark Sessions #40

Are you sure you want to change the base?

Handle multiple Spark Sessions #40

Uh oh!

Conversation

bernhard-42 commented May 1, 2018

Uh oh!

codecov-io commented May 1, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

mdboom commented May 11, 2018

Uh oh!

mdboom left a comment

Choose a reason for hiding this comment

Uh oh!

mdboom May 16, 2018

Choose a reason for hiding this comment

Uh oh!

mdboom May 16, 2018

Choose a reason for hiding this comment

Uh oh!

mdboom May 16, 2018

Choose a reason for hiding this comment

Uh oh!

mdboom May 16, 2018

Choose a reason for hiding this comment

Uh oh!

bernhard-42 commented May 17, 2018

Uh oh!

bernhard-42 commented May 17, 2018

Uh oh!

mdboom commented May 22, 2018

Uh oh!

mdboom commented Jun 5, 2018

Uh oh!

bernhard-42 commented Jun 5, 2018

Uh oh!

ran-z commented Dec 2, 2018

Uh oh!

stevenstetzler commented May 1, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

codecov-io commented May 1, 2018 •

edited

Loading