-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-47954][K8S] Support creating ingress entry for external UI access #46184
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
+CC @zhouyejoe |
...etes/core/src/main/scala/org/apache/spark/deploy/k8s/features/DriverServiceFeatureStep.scala
Outdated
Show resolved
Hide resolved
|
cc @dongjoon-hyun @yaooqinn @LuciferYang @EnricoMi appreciate it if you could have a look on this idea |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don't do that. This is a huge regression for the existing system by reducing the space of service name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make sense, we can generate the ingress name independently to avoid such regression
...rce-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/KubernetesConf.scala
Outdated
Show resolved
Hide resolved
...rce-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/KubernetesConf.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is not a bug, please move this logInfo to outside of this PR with a different message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of warning like this, please check at line 37 together and proceed with no-op.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This return statement implies that we might have too complex code path here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we provide a way for Spark Job runs successfully even DriverIngressFeatureStep and its K8s resources fails? IIUC, if something goes wrong, the Spark job is not started, isn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's a good point.
yes, any failure during resource creation will fail the job submission at the current stage, and I did see the ingress creation failure cases internally. we mitigate it by providing a switch to allow users to disable creating ingress for certain jobs by spark.kubernetes.driver.ingress.enabled.
currently, we create resources in batches,
Lines 161 to 181 in 48e207f
| // Refresh all pre-resources' owner references | |
| try { | |
| addOwnerReference(createdDriverPod, preKubernetesResources) | |
| kubernetesClient.resourceList(preKubernetesResources: _*).forceConflicts().serverSideApply() | |
| } catch { | |
| case NonFatal(e) => | |
| kubernetesClient.pods().resource(createdDriverPod).delete() | |
| kubernetesClient.resourceList(preKubernetesResources: _*).delete() | |
| throw e | |
| } | |
| // setup resources after pod creation, and refresh all resources' owner references | |
| try { | |
| val otherKubernetesResources = resolvedDriverSpec.driverKubernetesResources ++ Seq(configMap) | |
| addOwnerReference(createdDriverPod, otherKubernetesResources) | |
| kubernetesClient.resourceList(otherKubernetesResources: _*).forceConflicts().serverSideApply() | |
| } catch { | |
| case NonFatal(e) => | |
| kubernetesClient.pods().resource(createdDriverPod).delete() | |
| throw e | |
| } |
if we want to make some resources optional, we need to add a new batch to handle them independently. meanwhile, a new method may need to add in KubernetesFeatureConfigStep, something like
trait KubernetesFeatureConfigStep {
...
def getAdditionalOptionalKubernetesResources(): Seq[HasMetadata] = Seq.empty
}
@dongjoon-hyun WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on the other hand, silently ignoring a resource that the user requested may confuse users. for example, if something goes wrong during the spark UI start, we fail the job immediately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need a decision here.
option 1: provide a switch spark.kubernetes.driver.ingress.enabled to allow the user to explicitly enable/disable ingress, and fail the job if ingress creation failed.
pros:
- behavior consistent with
spark.ui.enabled; - if ingress creation failed, the user knows what happened rather than confusing why Spark UI is unavailable;
- code change is simple;
cons:
- ingress is actually an optional resource, technically, Spark could run well without ingress.
option 2: treat ingress as an optional resource, ignore ingress creation failure
pros:
- Keep the same stability for Spark jobs submission even with ingress enabled.
cons:
- ingress resource is not reliable even the user enable ingress explicitly
- we need to touch the developer interface to achieve such functionalities (there is one proposal above)
internally, we adopted option 1, and some users explicitly disabled the ingress because they didn't want the job submission to fail because of ingress creation failure.
@dongjoon-hyun @yaooqinn @mridulm @LuciferYang please let me know which one you are preferred.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To @pan3793 , I'm very careful about this PR because we consider Having K8s Ingress is more vulnerable than having Spark UI because it increases the boundary of exposure. Spark UI itself is not open to the other pods if you design to isolate Spark pod-to-pod communication carefully.
In addition, some K8s implementations (and their security backend) increases K8s Ingress creation. I expected a few minutes delays for only K8s ingress entity creations. It could mean a delay of Spark Job time too if we design to wait for K8s ingress.
Since we verified that the deletion is handled by K8s ownership correctly. Let's double-check the following.
- What is the job start time for Option 1? There is no delay?
- If there is a delay, is there a way to mitigate it?
- If there is no delay because it is handled separately, what happens during the time gap after the job start already and
K8s Ingressis still under construction. Are we in the same situation whereOption 2'sCons 1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I expected a few minutes delays for only K8s ingress entity creations.
that surprises me, in our cluster, the overhead of ingress creation is negligible.
I tested with ingress enabled/disabled 5 times in internal cluster to measure the job submission time by spark.app.startTime - spark.app.submitTime. (based on the implementation of option 1)
ingress enabled: 20715, 24954, 17129, 16631, 16076
ingress disabled: 20246, 20262, 17176, 18633, 19962
the difference is majorly caused by Pod creation, ingress creation is negligible.
if the ingress creation is too heavy, option 2 is the only choice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we don't need this because KUBERNETES_INGRESS_HOST_PATTERN is mandatory always. Shall we remove this redudant configuration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May I ask when you want to override this? And, is it safe in terms of the security?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
internally, we only use path / with pathType ImplementationSpecific, let me hardcode it and make it configurable if someone has real request in the future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that there is no owner setting here. Could you double-check if this increase the number of ingress monotonically? I'm wondering who and when this ingress is removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, OwnerReference is handled at
Line 175 in 48e207f
| addOwnerReference(createdDriverPod, otherKubernetesResources) |
I checked that deleting the driver pod removes the ingress too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the confirmation.
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @pan3793 . I left a few comments.
|
Gentle ping, @pan3793 . It would be great if we can land this before Apache Spark 4.0.0-preview. To @cloud-fan , just a question, when do you think you are going to cut |
|
@dongjoon-hyun sorry for late, I'm a little busy these days, will address comments soon |
This reverts commit b45bb1688e07b1f7a3639c1015a337d3160644b8.
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, please address this.
We don't need spark.kubernetes.driver.ingress.enabled because spark.kubernetes.driver.ingress.host is the same semantic.
|
For the record, Apache Spark 4.0.0-preview is scheduled on next Monday. |
|
Just FYI, please take your time. We can target this for Apache Spark 4.0.0. |
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
|
@pan3793 Is there any new progress in this pr? To access the url: http://hostname:port/proxy/[applicationId], add conf: spark.driver.extraJavaOptions="-Dspark.ui.proxyBase=/proxy/applicationId" example: apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: spark-app-ingress
namespace: superior
spec:
ingressClassName: nginx
rules:
- host: example.com
http:
paths:
- path: /proxy/spark-1745724044946
pathType: Prefix
backend:
service:
name: spark-1745724044946-driver-svc
port:
number: 4040To access the url http://hostname:port/proxy/spark-174572404494 is redirected to the http://hostname:port/jobs。 Failed to access the UI. sparkcontext initializes the ui, and the basepath is empty |

What changes were proposed in this pull request?
Ingress a kind of K8s resource to expose HTTP and HTTPS routes from outside the K8s cluster to services within the K8s cluster. It's a common use case that uses ingress to access Spark UI for Spark jobs running on K8s in cluster mode.
This PR aims to add such a built-in feature to allow users to enable Ingress for Spark jobs which running on K8s cluster mode.
State: functionality completed; manuelly tested; ready for review to collect feedback
TODO: UT and docs
Why are the changes needed?
Simplify Spark live UI access from outside of the K8s.
Does this PR introduce any user-facing change?
This PR adds a new feature, but is disabled by default.
How was this patch tested?
I will supply the unit tests later.
Maunnely tested on an internal K8s cluster.
Was this patch authored or co-authored using generative AI tooling?
No.