Skip to content
This repository was archived by the owner on Feb 4, 2021. It is now read-only.

Conversation

@bernhard-42
Copy link

Proposal for Issue 22:

In the Jupyter notebook a Jupyter Comm target gets opened to listen for messages from a python kernel. A new Jupyter Magic uses this comm target to forward the Spark API URL to the notebook:

%spark_progress spark

where spark is the variable holding the Spark Session, so the magic can use globals()["spark"].sparkContext.uiWebUrl to get the actual Spark API Url.

Each call from the javascript notebook then forwards the Spark API Url as a query parameter spark_url to the backend handler which uses it to create the backend_url.

This allows for multiple SparkContexts in different tabs and even for spark.ui.port=0 setting.

@codecov-io
Copy link

codecov-io commented May 1, 2018

Codecov Report

Merging #40 into master will decrease coverage by 25.82%.
The diff coverage is 35.13%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master      #40       +/-   ##
===========================================
- Coverage   96.61%   70.78%   -25.83%     
===========================================
  Files           3        4        +1     
  Lines          59       89       +30     
  Branches        5       10        +5     
===========================================
+ Hits           57       63        +6     
- Misses          2       26       +24
Impacted Files Coverage Δ
src/jupyter_spark/magic.py 0% <0%> (ø)
src/jupyter_spark/spark.py 100% <100%> (ø) ⬆️
src/jupyter_spark/handlers.py 100% <100%> (ø) ⬆️
src/jupyter_spark/__init__.py 44.44% <25%> (-15.56%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 34ab4bf...38ed34a. Read the comment docs.

@mdboom mdboom requested review from jezdez and mdboom May 11, 2018 22:30
@mdboom
Copy link
Contributor

mdboom commented May 11, 2018

Thanks for the contribution! I hope to have a deeper look early next week.

Copy link
Contributor

@mdboom mdboom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this great submission. This has been a long-standing pain point.

In general, however, I'm concerned that this PR moves this from something that "just works", without any documentation or education required on the user's part to something that we'll have to explain to folks, but there may be a way around that.

If you read the docs of pyspark, it's clear that SparkSession and SparkContext are both singletons underneath, i.e. there will always be exactly one of them per kernel. So rather than having the magic to set the context, I think you can just get SparkSession._instantiatedSession, and, if not None, use that to get the URL. It's not great to use a private API, but maybe we can convince upstream to add a .get() function for us (that would be distinct from getOrCreate().

"spark = SparkSession \\\n",
" .builder \\\n",
" .appName(\"PythonPi\") \\\n",
" .master(\"yarn\") \\\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This causes my spark process to crash with:

Exception in thread "main" java.lang.Exception: When running with master 'yarn' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment.

Is it required?

README.md Outdated

To change the URL of the Spark API that the job metadata is fetched from
override the `Spark.url` config value, e.g. on the command line:
The Spark API that the job metadata is fetched from can be different for each SparkContext. As default, for the first Spark context uses port 4040, for the second 4041 and so on. If however `spark.ui.port` is set to 0 in SparkConf, Spark will choose a random ephemeral port for the API. In order to support this behaviour (and allow more than one tab in Jupyter with a SparkContext), copy the Jupyter notebook magic `spark_progress.py` from the `src/magic` into your ipython profile's startup folder, e.g. for the default profile this is `~/.ipython/profile_default/startup/`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should require users to copy something into their personal configuration to make this work. (There are many contexts in which Jupyter is run where the user doesn't even have access to that). Instead, we should install the magic along with the package, with a load_ipython_extension hook in __init__.py, and then the user would add %load_ext jupyter_spark to their notebook to load the extension.

request_path = request.uri[len(self.proxy_url):]
return url_path_join(self.url, request_path)
def backend_url(self, url, path):
request_path = path[len(self.proxy_root):]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should use proxy_url here so it includes the baseUrl of IPython, if set.

])

def test_http_fetch_error(self):
def test_http_fetch_error_url_mssing(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: mssing -> missing

@bernhard-42
Copy link
Author

I guess I have changed the code accordingly. I personally don't really like using internal APIs, however I understand your rationale. I marked it with a TODO.

@bernhard-42
Copy link
Author

A side note: If you work on an Hadoop cluster (as I do, hence the yarn stuff last time) shooting against uiWebUrl means shooting twice a second against the Resource Manager. If many users do this at the same time, this might create quite some traffic. Maybe a less chatty approach would be to use sc.statusTracker in a background thread in the notebook triggered by Jupyter cell hooks and communicating the status to the notebook javascript via the Jupyter comm layer - just an idea ...

@mdboom
Copy link
Contributor

mdboom commented May 22, 2018

Thanks. I'm sorry -- I think I wasn't clear earlier. If you grab the spark context from the singleton, then the magic is completely optional in the common case. You would only need to use the magic if you explicitly want to set the url. Would you mind updating this so the magic is optional (and users can continue working as they have been unless this additional complexity is needed for them...?)

@mdboom
Copy link
Contributor

mdboom commented Jun 5, 2018

@bernhard-42 : Hope I didn't scare you off by creating confusion. Your contribution is very much appreciated.

@bernhard-42
Copy link
Author

No worries, first I didn't have time and then I forgot it ...
Hope it now meets your expectations. If not, please feel free to accept and adapt as you need - this might actually be the faster process. I am happy either way.

@ran-z
Copy link

ran-z commented Dec 2, 2018

@mdboom Any news regarding this?
Or any other alternative solution for working with this extension on multiple tabs (each with a different Spark context and kernel)?

@stevenstetzler
Copy link

Is there any update on these changes getting pulled into the main project, or updates otherwise? This functionality would be very, very useful and the lack of it is a major block to using this extension.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants