Skip to content

Conversation

@ornew
Copy link

@ornew ornew commented Mar 8, 2021

Web UI does not correctly get appId when it has proxy or history in URL.

In my case, it happens on https://jupyterhub.hosted.us/my-name/proxy/4040/executors/.
Web developer console says: jquery-3.4.1.min.js:2 GET https://jupyterhub.hosted.us/user/my-name/proxy/4040/api/v1/applications/4040/allexecutors 404, and it shows blank pages to me.

There is relative issue in jupyterhub jupyterhub/jupyter-server-proxy#57

var words = document.baseURI.split('/');
var ind = words.indexOf("proxy");
if (ind > 0) {
var appId = words[ind + 1];
cb(appId);
return;
}
ind = words.indexOf("history");
if (ind > 0) {
var appId = words[ind + 1];
cb(appId);
return;
}

It should not get from document.baseURI.
A request will occur, but performance impacts will be a bit.

What changes were proposed in this pull request?

This always get an appId using the API.

Why are the changes needed?

The UI does not appear correctly in some environments. For example, there is a Jupyterhub.

Does this PR introduce any user-facing change?

No, this is bug fix.

How was this patch tested?

I see it correctly works in my browser.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@dongjoon-hyun
Copy link
Member

Thank you for making a PR, @ornew .

cc @gengliangwang

@gengliangwang
Copy link
Member

@ornew could you show reproduce steps on Spark? It seems that the jupyterhub is using a different Spark UI URL from Spark.

Also the code change breaks the logic. Spark can get the app id without access rest API if the URL contains proxy or history.

@dongjoon-hyun
Copy link
Member

Gentle ping, @ornew .

@ornew
Copy link
Author

ornew commented Mar 16, 2021

@dongjoon-hyun @gengliangwang Thank you for your reply.

@ornew could you show reproduce steps on Spark? It seems that the jupyterhub is using a different Spark UI URL from Spark.

It's easy. Please run JupyterHub with jupyter-server-proxy, and PySpark.

I build Jupyter Hub on Kubernetes and provide a sandbox for a large number of users. Accessing the Spark UI requires some kind of proxy. JupyterHub's Server Proxy allows you to access ports in your sandbox environment without user interaction.

from pyspark import *
from pyspark.sql import *

spark = SparkSession.builder.getOrCreate()

スクリーンショット 2021-03-16 14 58 57

When accessing the Spark UI by port, the Jupyter Server Proxy path contains proxy, which causes incorrect parsing.

スクリーンショット 2021-03-16 15 01 56

@gengliangwang
Copy link
Member

@ornew Is it possible to fix it in JupyterHub? The issue is not on Spark itself.

@ornew
Copy link
Author

ornew commented Mar 16, 2021

@gengliangwang I think this is a Spark issue.

This always happens when the path contains proxy or history if without JupyterHub also. There are many use cases where the path does not include the appId, such as standalone or Kubernetes. Also there are many opportunities to access it via a proxy. Would you please reconsider about and the logic of getting appId that depends on the environment and URL?

@gengliangwang
Copy link
Member

@ornew I mean, if we can't reproduce the issue on a Spark cluster, then it is not the issue of Spark itself.
Spark does support running behind a reverse proxy, see #29820 for details.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, have you tested the code changes for Spark UI behind proxy and Spark UI of History server?

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Jun 28, 2021
@github-actions github-actions bot closed this Jun 29, 2021
@PerilousApricot
Copy link

PerilousApricot commented Dec 14, 2021

@gengliangwang -- I actually have a very simple reproducer using nginx as a reverse proxy and not jupyterhub (to eliminate that failure mode). The following script will set up the proxy, note that it redirects /user/PerilousApricot/proxy/4040/ to the root of the spark webUI (the URL is what jupyterhub would use, but obviously, this is a simple reverse-proxy without jupyterhub)

proxy-fail.sh

#!/bin/bash


cat << \EOT > nginx.conf
user  nginx;
worker_processes  auto;
error_log  /var/log/nginx/error.log notice;
pid        /var/run/nginx.pid;
events {
    worker_connections  1024;
}
http {
    include       /etc/nginx/mime.types;
    default_type  application/octet-stream;
    log_format  main  '[$time_local] "$request" '
                      '$status $body_bytes_sent "$http_referer" '
                      '"$http_x_forwarded_for"';
    access_log  /dev/stdout  main;
    server {
        listen       5050;
        server_name  localhost;
        location /user/PerilousApricot/proxy/4040/ {
            error_log  /dev/stderr debug;
            proxy_pass http://localhost:4040/;
            proxy_pass_header Content-Type;
        }        
    }
}

EOT

docker run -it --rm=true --name spark-31174-proxy --network=host -v $(pwd)/nginx.conf:/etc/nginx/nginx.conf:ro nginx

Run that proxy in one terminal, then run pyspark:

SPARK_PUBLIC_DNS=localhost:5050/user/PerilousApricot/proxy/4040/jobs/ pyspark --conf spark.ui.reverseProxyUrl=http://localhost:5050/user/PerilousApricot/proxy/4040/ --conf spark.driver.extraJavaOptions="-Dlog4j.debug=true" --conf spark.ui.proxyBase=/user/PerilousApricot/proxy/4040/ --conf spark.app.name=proxyApp

Open http://localhost:5050/user/PerilousApricot/proxy/4040/executors/ in a browser with "developer mode" enabled to watch the traffic come by. You will see a number of successful requests to various resources like:

 http://localhost:5050/user/PerilousApricot/proxy/4040//static/webui.css
 http://localhost:5050/user/PerilousApricot/proxy/4040//static/webui.js

Notice, however that there is a failed request (and the reason of this PR) -

http://localhost:5050/user/PerilousApricot/proxy/4040/api/v1/applications/4040/allexecutors

If you run curl manually on both that URL, you can see that it fails both at the reverse proxy and at the actual webui itself:

curl -v -o /dev/null http://localhost:5050/user/PerilousApricot/proxy/4040/api/v1/applications/4040/allexecutors
curl -v -o /dev/null http://localhost:4040/api/v1/applications/4040/allexecutors

But if you copy-paste the appId from the spark console (in my case I have: Spark context available as 'sc' (master = local[*], app id = local-1639522961946).), the following two requests succeed:

curl http://localhost:5050/user/PerilousApricot/proxy/4040/api/v1/applications/local-1639522961946
curl -v -o /dev/null http://localhost:4040/api/v1/applications/local-1639522961946

To confirm the issue, let's restart the proxy and pyspark, but instead of proxying /user/PerilousApricot/proxy/4040/, let's instead proxy to /user/PerilousApricot/yxorp/4040/ (note that there is no "proxy" in the proxied URL). First execute
proxy-win.sh

#!/bin/bash


cat << \EOT > nginx.conf
user  nginx;
worker_processes  auto;
error_log  /var/log/nginx/error.log notice;
pid        /var/run/nginx.pid;
events {
    worker_connections  1024;
}
http {
    include       /etc/nginx/mime.types;
    default_type  application/octet-stream;
    log_format  main  '[$time_local] "$request" '
                      '$status $body_bytes_sent "$http_referer" '
                      '"$http_x_forwarded_for"';
    access_log  /dev/stdout  main;
    server {
        listen       5050;
        server_name  localhost;
        location /user/PerilousApricot/yxorp/4040/ {
            #error_log  /dev/stderr debug;
            proxy_pass http://localhost:4040/;
            #proxy_redirect     off;
            proxy_pass_header Content-Type;
            #rewrite /user/PerilousApricot/yxorp/4040(/.*|$) $1  break;
        }        
    }
}

EOT

docker run -it --rm=true --name spark-31174-proxy --network=host -v $(pwd)/nginx.conf:/etc/nginx/nginx.conf:ro nginx

and then run in a different terminal

SPARK_PUBLIC_DNS=localhost:5050/user/PerilousApricot/yxorp/4040/jobs/ pyspark --conf spark.ui.reverseProxyUrl=http://localhost:5050/user/PerilousApricot/yxorp//4040/ --conf spark.driver.extraJavaOptions="-Dlog4j.debug=true" --conf spark.ui.proxyBase=/user/PerilousApricot/yxorp/4040/ --conf spark.app.name=proxyApp

Open http://localhost:5050/user/PerilousApricot/yxorp/4040//executors/ and you can see that the page renders properly. Looking at the development console, you will see that instead of attempting to open

http://localhost:5050/user/PerilousApricot/proxy/4040/api/v1/applications/4040/allexecutors

this version requests the status of the executors from

http://localhost:5050/user/PerilousApricot/yxorp/4040//api/v1/applications/local-1639523380430/allexecutors

I hope this is enough to show that @ornew did the right analysis -- Th fault isn't with jupyterhub, it is simply the fact that the logic that tries to look up the appId chokes if there is a path element named "proxy" in the URL.

Can you please re-examine this?

EDIT: I tested with spark 3.2.0

@gengliangwang
Copy link
Member

@PerilousApricot Are you running Spark as a cluster? If yes, Spark supports reverse proxy, see the following PRs for details:
#13950
#29820

@PerilousApricot
Copy link

PerilousApricot commented Dec 15, 2021

Hi @gengliangwang this is running in client mode. The use-case is running spark within a jupyter notebook

Thanks for the pointers, but the point of the PR is that there is a bug in how the reverse proxying is handled. If you see the reproducer, I am using the config options mentioned in #13950 and #29820.

@PerilousApricot
Copy link

In the current master, when you reverse proxy to

http://localhost:5050/user/PerilousApricot/proxy/4040/

then Spark UI tries to do an API call to

http://localhost:5050/user/PerilousApricot/proxy/4040/api/v1/applications/4040/allexecutors

to retrieve the executor status, but this is incorrect, it should be

http://localhost:5050/user/PerilousApricot/proxy/4040/api/v1/applications/local-1639523380430/allexecutors

(where local-1639523380430 is the appId of the SparkContext).

The problem is that Spark itself bungles the handling of the appId. You said in #31774 (comment) that the problem was unreproducible in a Spark cluster, I hope that the reproducer I put in the comment above is enough to show the issue. Please let me know if I can help clarify it better.

@PerilousApricot
Copy link

Hello @gengliangwang and Happy New years! I'm back from vacation and was wondering if you had further thoughts on this issue. Were you able to reproduce the bug?

@PerilousApricot
Copy link

Hello @gengliangwang checking up on this issue. I gave a reproducer above that clearly shows the issue. Have you had a chance to take a look?

@gengliangwang
Copy link
Member

@PerilousApricot I will take a close look before the 3.3 release.

@PerilousApricot
Copy link

@gengliangwang Thank you very much! This would be a huge relief for our use-case

@MaxGekk
Copy link
Member

MaxGekk commented Apr 4, 2022

@gengliangwang Any chance this is finished soon?

@PerilousApricot
Copy link

PerilousApricot commented Apr 4, 2022 via email

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Apr 4, 2022

According to the recent comments, I removed Stale tag and reopened this. However, the PR itself seems to have conflicts already due to the long inactivity.

@PerilousApricot
Copy link

Hello @dongjoon-hyun I'm happy to update the PR if there is someone available to review and merge the resulting code.

@MaxGekk
Copy link
Member

MaxGekk commented Apr 11, 2022

@gengliangwang @dongjoon-hyun Could you, please, help to review this if we plan to have it in 3.3 (just in case, it is in the allow list).

@MaxGekk
Copy link
Member

MaxGekk commented Apr 11, 2022

@ornew Please, resolve conflicts.

@gengliangwang
Copy link
Member

@ornew @PerilousApricot I think I got your point now. You would like to use the revert proxy feature on a single Spark node, instead of standalone mode with master/worker.
The issue happens if the proxy prefix URL contains proxy. For example, the URL is "/proxy/4040" and the APP ID will be processed as 4040.
It seems that we can always get the APP id via the rest API, but I am not sure if that would cause any problem. A better fix is to check if the prefix URL contains the word "prefix" or "history"

@gengliangwang
Copy link
Member

SPARK_PUBLIC_DNS=localhost:5050/user/PerilousApricot/proxy/4040/jobs/ pyspark --conf spark.ui.reverseProxyUrl=http://localhost:5000/user/PerilousApricot/proxy/4040/ --conf spark.driver.extraJavaOptions="-Dlog4j.debug=true" --conf spark.ui.proxyBase=/user/PerilousApricot/proxy/4040/ --conf spark.app.name=proxyApp

@PerilousApricot should all the port be 5050? You are setting the revert proxy URL as http://localhost:5000, which is confusing.

@PerilousApricot
Copy link

@gengliangwang Yes, thanks for the catch. I'll update my comment (must've transposed something when copy-pasting)

dongjoon-hyun pushed a commit that referenced this pull request Apr 13, 2022
…e proxy URL

### What changes were proposed in this pull request?

When the reverse proxy URL contains "proxy" or "history", the application ID in UI is wrongly parsed.
For example, if we set spark.ui.reverseProxyURL as "/test/proxy/prefix" or "/test/history/prefix", the application ID is parsed as "prefix" and the related API calls will fail in stages/executors pages:
```
.../api/v1/applications/prefix/allexecutors
```
instead of
```
.../api/v1/applications/app-20220413142241-0000/allexecutors
```

There are more contexts in #31774
We can fix this entirely like #36174, but it is risky and complicated to do that.

### Why are the changes needed?

Avoid users setting keywords in reverse proxy URL and getting wrong UI results.

### Does this PR introduce _any_ user-facing change?

No
### How was this patch tested?

A new unit test.
Also doc preview:
<img width="1743" alt="image" src="https://user-images.githubusercontent.com/1097932/163126641-da315012-aae5-45a5-a048-340a5dd6e91e.png">

Closes #36176 from gengliangwang/forbidURLPrefix.

Authored-by: Gengliang Wang <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
@PerilousApricot
Copy link

@MaxGekk there are some things @gengliangwang and @dongjoon-hyun would like fixed with this PR, but I'm hopeful that we can converge soon on a solution they'll accept.

Web UI does not correctly get appId when it has `proxy` or `history` in URL. 

In my case, it's happened on `https://jupyterhub.hosted.our/my-name/proxy/4040/executors/`.
There is relative issue in jupyterhub jupyterhub/jupyter-server-proxy#57

It should not get appId from document.BaseURI.
A request will occur, but performance impacts will be a bit.
@ornew ornew force-pushed the fix-web-ui-get-correct-app-id branch from 14bd9b2 to 9548338 Compare April 16, 2022 13:05
@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@lordk911
Copy link

@PerilousApricot
jupyterhub/jupyter-server-proxy#57 (comment)

use this to workaround

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants