[SPARK-34659] Fix that Web UI always correctly get appId #31774

ornew · 2021-03-08T03:23:35Z

Web UI does not correctly get appId when it has proxy or history in URL.

In my case, it happens on https://jupyterhub.hosted.us/my-name/proxy/4040/executors/.
Web developer console says: jquery-3.4.1.min.js:2 GET https://jupyterhub.hosted.us/user/my-name/proxy/4040/api/v1/applications/4040/allexecutors 404, and it shows blank pages to me.

There is relative issue in jupyterhub jupyterhub/jupyter-server-proxy#57

spark/core/src/main/resources/org/apache/spark/ui/static/utils.js

Lines 93 to 105 in 2526fde

    
           var words = document.baseURI.split('/'); 
        
           var ind = words.indexOf("proxy"); 
        
           if (ind > 0) { 
        
             var appId = words[ind + 1]; 
        
             cb(appId); 
        
             return; 
        
           } 
        
           ind = words.indexOf("history"); 
        
           if (ind > 0) { 
        
             var appId = words[ind + 1]; 
        
             cb(appId); 
        
             return; 
        
           }

It should not get from document.baseURI.
A request will occur, but performance impacts will be a bit.

What changes were proposed in this pull request?

This always get an appId using the API.

Why are the changes needed?

The UI does not appear correctly in some environments. For example, there is a Jupyterhub.

Does this PR introduce any user-facing change?

No, this is bug fix.

How was this patch tested?

I see it correctly works in my browser.

AmplabJenkins · 2021-03-08T04:09:53Z

Can one of the admins verify this patch?

dongjoon-hyun · 2021-03-09T06:44:54Z

Thank you for making a PR, @ornew .

cc @gengliangwang

gengliangwang · 2021-03-09T17:02:03Z

@ornew could you show reproduce steps on Spark? It seems that the jupyterhub is using a different Spark UI URL from Spark.

Also the code change breaks the logic. Spark can get the app id without access rest API if the URL contains proxy or history.

dongjoon-hyun · 2021-03-11T05:19:15Z

Gentle ping, @ornew .

ornew · 2021-03-16T06:12:42Z

@dongjoon-hyun @gengliangwang Thank you for your reply.

@ornew could you show reproduce steps on Spark? It seems that the jupyterhub is using a different Spark UI URL from Spark.

It's easy. Please run JupyterHub with jupyter-server-proxy, and PySpark.

I build Jupyter Hub on Kubernetes and provide a sandbox for a large number of users. Accessing the Spark UI requires some kind of proxy. JupyterHub's Server Proxy allows you to access ports in your sandbox environment without user interaction.

from pyspark import *
from pyspark.sql import *

spark = SparkSession.builder.getOrCreate()

When accessing the Spark UI by port, the Jupyter Server Proxy path contains proxy, which causes incorrect parsing.

gengliangwang · 2021-03-16T06:23:42Z

@ornew Is it possible to fix it in JupyterHub? The issue is not on Spark itself.

ornew · 2021-03-16T06:46:29Z

@gengliangwang I think this is a Spark issue.

This always happens when the path contains proxy or history if without JupyterHub also. There are many use cases where the path does not include the appId, such as standalone or Kubernetes. Also there are many opportunities to access it via a proxy. Would you please reconsider about and the logic of getting appId that depends on the environment and URL?

gengliangwang · 2021-03-19T17:34:45Z

@ornew I mean, if we can't reproduce the issue on a Spark cluster, then it is not the issue of Spark itself.
Spark does support running behind a reverse proxy, see #29820 for details.

gengliangwang · 2021-03-19T17:36:01Z

core/src/main/resources/org/apache/spark/ui/static/utils.js

BTW, have you tested the code changes for Spark UI behind proxy and Spark UI of History server?

github-actions · 2021-06-28T00:07:46Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

PerilousApricot · 2021-12-14T23:15:20Z

@gengliangwang -- I actually have a very simple reproducer using nginx as a reverse proxy and not jupyterhub (to eliminate that failure mode). The following script will set up the proxy, note that it redirects /user/PerilousApricot/proxy/4040/ to the root of the spark webUI (the URL is what jupyterhub would use, but obviously, this is a simple reverse-proxy without jupyterhub)

proxy-fail.sh

#!/bin/bash


cat << \EOT > nginx.conf
user  nginx;
worker_processes  auto;
error_log  /var/log/nginx/error.log notice;
pid        /var/run/nginx.pid;
events {
    worker_connections  1024;
}
http {
    include       /etc/nginx/mime.types;
    default_type  application/octet-stream;
    log_format  main  '[$time_local] "$request" '
                      '$status $body_bytes_sent "$http_referer" '
                      '"$http_x_forwarded_for"';
    access_log  /dev/stdout  main;
    server {
        listen       5050;
        server_name  localhost;
        location /user/PerilousApricot/proxy/4040/ {
            error_log  /dev/stderr debug;
            proxy_pass http://localhost:4040/;
            proxy_pass_header Content-Type;
        }        
    }
}

EOT

docker run -it --rm=true --name spark-31174-proxy --network=host -v $(pwd)/nginx.conf:/etc/nginx/nginx.conf:ro nginx

Run that proxy in one terminal, then run pyspark:

SPARK_PUBLIC_DNS=localhost:5050/user/PerilousApricot/proxy/4040/jobs/ pyspark --conf spark.ui.reverseProxyUrl=http://localhost:5050/user/PerilousApricot/proxy/4040/ --conf spark.driver.extraJavaOptions="-Dlog4j.debug=true" --conf spark.ui.proxyBase=/user/PerilousApricot/proxy/4040/ --conf spark.app.name=proxyApp

Open http://localhost:5050/user/PerilousApricot/proxy/4040/executors/ in a browser with "developer mode" enabled to watch the traffic come by. You will see a number of successful requests to various resources like:

 http://localhost:5050/user/PerilousApricot/proxy/4040//static/webui.css
 http://localhost:5050/user/PerilousApricot/proxy/4040//static/webui.js

Notice, however that there is a failed request (and the reason of this PR) -

http://localhost:5050/user/PerilousApricot/proxy/4040/api/v1/applications/4040/allexecutors

If you run curl manually on both that URL, you can see that it fails both at the reverse proxy and at the actual webui itself:

curl -v -o /dev/null http://localhost:5050/user/PerilousApricot/proxy/4040/api/v1/applications/4040/allexecutors
curl -v -o /dev/null http://localhost:4040/api/v1/applications/4040/allexecutors

But if you copy-paste the appId from the spark console (in my case I have: Spark context available as 'sc' (master = local[*], app id = local-1639522961946).), the following two requests succeed:

curl http://localhost:5050/user/PerilousApricot/proxy/4040/api/v1/applications/local-1639522961946
curl -v -o /dev/null http://localhost:4040/api/v1/applications/local-1639522961946

To confirm the issue, let's restart the proxy and pyspark, but instead of proxying /user/PerilousApricot/proxy/4040/, let's instead proxy to /user/PerilousApricot/yxorp/4040/ (note that there is no "proxy" in the proxied URL). First execute
proxy-win.sh

#!/bin/bash


cat << \EOT > nginx.conf
user  nginx;
worker_processes  auto;
error_log  /var/log/nginx/error.log notice;
pid        /var/run/nginx.pid;
events {
    worker_connections  1024;
}
http {
    include       /etc/nginx/mime.types;
    default_type  application/octet-stream;
    log_format  main  '[$time_local] "$request" '
                      '$status $body_bytes_sent "$http_referer" '
                      '"$http_x_forwarded_for"';
    access_log  /dev/stdout  main;
    server {
        listen       5050;
        server_name  localhost;
        location /user/PerilousApricot/yxorp/4040/ {
            #error_log  /dev/stderr debug;
            proxy_pass http://localhost:4040/;
            #proxy_redirect     off;
            proxy_pass_header Content-Type;
            #rewrite /user/PerilousApricot/yxorp/4040(/.*|$) $1  break;
        }        
    }
}

EOT

docker run -it --rm=true --name spark-31174-proxy --network=host -v $(pwd)/nginx.conf:/etc/nginx/nginx.conf:ro nginx

and then run in a different terminal

SPARK_PUBLIC_DNS=localhost:5050/user/PerilousApricot/yxorp/4040/jobs/ pyspark --conf spark.ui.reverseProxyUrl=http://localhost:5050/user/PerilousApricot/yxorp//4040/ --conf spark.driver.extraJavaOptions="-Dlog4j.debug=true" --conf spark.ui.proxyBase=/user/PerilousApricot/yxorp/4040/ --conf spark.app.name=proxyApp

Open http://localhost:5050/user/PerilousApricot/yxorp/4040//executors/ and you can see that the page renders properly. Looking at the development console, you will see that instead of attempting to open

http://localhost:5050/user/PerilousApricot/proxy/4040/api/v1/applications/4040/allexecutors

this version requests the status of the executors from

http://localhost:5050/user/PerilousApricot/yxorp/4040//api/v1/applications/local-1639523380430/allexecutors

I hope this is enough to show that @ornew did the right analysis -- Th fault isn't with jupyterhub, it is simply the fact that the logic that tries to look up the appId chokes if there is a path element named "proxy" in the URL.

Can you please re-examine this?

EDIT: I tested with spark 3.2.0

gengliangwang · 2021-12-15T09:20:34Z

@PerilousApricot Are you running Spark as a cluster? If yes, Spark supports reverse proxy, see the following PRs for details:
#13950
#29820

PerilousApricot · 2021-12-15T13:58:54Z

Hi @gengliangwang this is running in client mode. The use-case is running spark within a jupyter notebook

Thanks for the pointers, but the point of the PR is that there is a bug in how the reverse proxying is handled. If you see the reproducer, I am using the config options mentioned in #13950 and #29820.

PerilousApricot · 2021-12-15T14:03:24Z

In the current master, when you reverse proxy to

http://localhost:5050/user/PerilousApricot/proxy/4040/

then Spark UI tries to do an API call to

http://localhost:5050/user/PerilousApricot/proxy/4040/api/v1/applications/4040/allexecutors

to retrieve the executor status, but this is incorrect, it should be

http://localhost:5050/user/PerilousApricot/proxy/4040/api/v1/applications/local-1639523380430/allexecutors

(where local-1639523380430 is the appId of the SparkContext).

The problem is that Spark itself bungles the handling of the appId. You said in #31774 (comment) that the problem was unreproducible in a Spark cluster, I hope that the reproducer I put in the comment above is enough to show the issue. Please let me know if I can help clarify it better.

PerilousApricot · 2022-01-04T16:07:57Z

Hello @gengliangwang and Happy New years! I'm back from vacation and was wondering if you had further thoughts on this issue. Were you able to reproduce the bug?

PerilousApricot · 2022-03-16T22:48:31Z

Hello @gengliangwang checking up on this issue. I gave a reproducer above that clearly shows the issue. Have you had a chance to take a look?

gengliangwang · 2022-03-17T12:06:44Z

@PerilousApricot I will take a close look before the 3.3 release.

PerilousApricot · 2022-03-18T00:14:42Z

@gengliangwang Thank you very much! This would be a huge relief for our use-case

MaxGekk · 2022-04-04T18:42:04Z

@gengliangwang Any chance this is finished soon?

PerilousApricot · 2022-04-04T19:01:25Z

This is really important for us, so I hope it doesn't slip through the cracks. Is there anyone else available to review?-- It's dark in this basement.

dongjoon-hyun · 2022-04-04T20:00:22Z

According to the recent comments, I removed Stale tag and reopened this. However, the PR itself seems to have conflicts already due to the long inactivity.

PerilousApricot · 2022-04-04T20:34:40Z

Hello @dongjoon-hyun I'm happy to update the PR if there is someone available to review and merge the resulting code.

MaxGekk · 2022-04-11T09:21:05Z

@gengliangwang @dongjoon-hyun Could you, please, help to review this if we plan to have it in 3.3 (just in case, it is in the allow list).

MaxGekk · 2022-04-11T09:21:41Z

@ornew Please, resolve conflicts.

gengliangwang · 2022-04-12T15:22:10Z

@ornew @PerilousApricot I think I got your point now. You would like to use the revert proxy feature on a single Spark node, instead of standalone mode with master/worker.
The issue happens if the proxy prefix URL contains proxy. For example, the URL is "/proxy/4040" and the APP ID will be processed as 4040.
It seems that we can always get the APP id via the rest API, but I am not sure if that would cause any problem. A better fix is to check if the prefix URL contains the word "prefix" or "history"

gengliangwang · 2022-04-12T15:23:19Z

SPARK_PUBLIC_DNS=localhost:5050/user/PerilousApricot/proxy/4040/jobs/ pyspark --conf spark.ui.reverseProxyUrl=http://localhost:5000/user/PerilousApricot/proxy/4040/ --conf spark.driver.extraJavaOptions="-Dlog4j.debug=true" --conf spark.ui.proxyBase=/user/PerilousApricot/proxy/4040/ --conf spark.app.name=proxyApp

@PerilousApricot should all the port be 5050? You are setting the revert proxy URL as http://localhost:5000, which is confusing.

PerilousApricot · 2022-04-12T15:27:41Z

@gengliangwang Yes, thanks for the catch. I'll update my comment (must've transposed something when copy-pasting)

…e proxy URL ### What changes were proposed in this pull request? When the reverse proxy URL contains "proxy" or "history", the application ID in UI is wrongly parsed. For example, if we set spark.ui.reverseProxyURL as "/test/proxy/prefix" or "/test/history/prefix", the application ID is parsed as "prefix" and the related API calls will fail in stages/executors pages: ``` .../api/v1/applications/prefix/allexecutors ``` instead of ``` .../api/v1/applications/app-20220413142241-0000/allexecutors ``` There are more contexts in #31774 We can fix this entirely like #36174, but it is risky and complicated to do that. ### Why are the changes needed? Avoid users setting keywords in reverse proxy URL and getting wrong UI results. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? A new unit test. Also doc preview: <img width="1743" alt="image" src="https://user-images.githubusercontent.com/1097932/163126641-da315012-aae5-45a5-a048-340a5dd6e91e.png"> Closes #36176 from gengliangwang/forbidURLPrefix. Authored-by: Gengliang Wang <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

PerilousApricot · 2022-04-14T11:38:38Z

@MaxGekk there are some things @gengliangwang and @dongjoon-hyun would like fixed with this PR, but I'm hopeful that we can converge soon on a solution they'll accept.

Web UI does not correctly get appId when it has `proxy` or `history` in URL. In my case, it's happened on `https://jupyterhub.hosted.our/my-name/proxy/4040/executors/`. There is relative issue in jupyterhub jupyterhub/jupyter-server-proxy#57 It should not get appId from document.BaseURI. A request will occur, but performance impacts will be a bit.

github-actions · 2022-09-30T00:36:30Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

lordk911 · 2025-05-29T02:22:18Z

@PerilousApricot
jupyterhub/jupyter-server-proxy#57 (comment)

use this to workaround

github-actions bot added CORE WEB UI labels Mar 8, 2021

gengliangwang reviewed Mar 19, 2021

View reviewed changes

github-actions bot added the Stale label Jun 28, 2021

github-actions bot closed this Jun 29, 2021

dongjoon-hyun removed the Stale label Apr 4, 2022

dongjoon-hyun reopened this Apr 4, 2022

gengliangwang mentioned this pull request Apr 13, 2022

[SPARK-34659][UI] Forbid using keyword "proxy" or "history" in reverse proxy URL #36176

Closed

ornew force-pushed the fix-web-ui-get-correct-app-id branch from 14bd9b2 to 9548338 Compare April 16, 2022 13:05

Merge branch 'apache:master' into fix-web-ui-get-correct-app-id

1704e3b

github-actions bot added the Stale label Sep 30, 2022

github-actions bot closed this Oct 1, 2022

takluyver mentioned this pull request Feb 3, 2023

404 error when lookup spark executors jupyterhub/jupyter-server-proxy#361

Closed

	var words = document.baseURI.split('/');
	var ind = words.indexOf("proxy");
	if (ind > 0) {
	var appId = words[ind + 1];
	cb(appId);
	return;
	}
	ind = words.indexOf("history");
	if (ind > 0) {
	var appId = words[ind + 1];
	cb(appId);
	return;
	}

[SPARK-34659] Fix that Web UI always correctly get appId #31774

[SPARK-34659] Fix that Web UI always correctly get appId #31774

Uh oh!

Conversation

ornew commented Mar 8, 2021

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

AmplabJenkins commented Mar 8, 2021

Uh oh!

dongjoon-hyun commented Mar 9, 2021

Uh oh!

gengliangwang commented Mar 9, 2021

Uh oh!

dongjoon-hyun commented Mar 11, 2021

Uh oh!

ornew commented Mar 16, 2021

Uh oh!

gengliangwang commented Mar 16, 2021

Uh oh!

ornew commented Mar 16, 2021

Uh oh!

gengliangwang commented Mar 19, 2021

Uh oh!

gengliangwang Mar 19, 2021

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jun 28, 2021

Uh oh!

PerilousApricot commented Dec 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gengliangwang commented Dec 15, 2021

Uh oh!

PerilousApricot commented Dec 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

PerilousApricot commented Dec 15, 2021

Uh oh!

PerilousApricot commented Jan 4, 2022

Uh oh!

PerilousApricot commented Mar 16, 2022

Uh oh!

gengliangwang commented Mar 17, 2022

Uh oh!

PerilousApricot commented Mar 18, 2022

Uh oh!

MaxGekk commented Apr 4, 2022

Uh oh!

PerilousApricot commented Apr 4, 2022 via email

Uh oh!

dongjoon-hyun commented Apr 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

PerilousApricot commented Apr 4, 2022

Uh oh!

MaxGekk commented Apr 11, 2022

Uh oh!

MaxGekk commented Apr 11, 2022

Uh oh!

gengliangwang commented Apr 12, 2022

Uh oh!

gengliangwang commented Apr 12, 2022

Uh oh!

PerilousApricot commented Apr 12, 2022

Uh oh!

PerilousApricot commented Apr 14, 2022

Uh oh!

github-actions bot commented Sep 30, 2022

Uh oh!

lordk911 commented May 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

PerilousApricot commented Dec 14, 2021 •

edited

Loading

PerilousApricot commented Dec 15, 2021 •

edited

Loading

dongjoon-hyun commented Apr 4, 2022 •

edited

Loading