-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-53335][K8S] Support spark.kubernetes.driver.annotateExitException
#52068
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-53335][K8S] Support spark.kubernetes.driver.annotateExitException
#52068
Conversation
|
+CC @dongjoon-hyun Can you please take a look ? |
e1661a6 to
15f1b23
Compare
|
cc @viirya @dongjoon-hyun @cloud-fan too |
|
Ack, @sunchao . |
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To @ForVic , @mridulm and @sunchao , I have two questions.
- Are the failed driver pods supposed to be kept for a while with the diagnosis annotation?
- Who are the final audience to be able to see the diagnosis? A user or some other scrapper or automated systems? Could you give us some examples which you are currently using?
15f1b23 to
b2b0c8b
Compare
|
...netes/core/src/main/scala/org/apache/spark/deploy/k8s/SparkKubernetesDiagnosticsSetter.scala
Outdated
Show resolved
Hide resolved
...netes/core/src/main/scala/org/apache/spark/deploy/k8s/SparkKubernetesDiagnosticsSetter.scala
Outdated
Show resolved
Hide resolved
...netes/core/src/main/scala/org/apache/spark/deploy/k8s/SparkKubernetesDiagnosticsSetter.scala
Outdated
Show resolved
Hide resolved
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- For feature naming,
diagnosticssounds like an over-claim of this feature. Can we use more technical name instead. Technically, this feature makes a naive annotation whose content is the limited prefix of message ofThrowable. There is no sophisticated diagnostic activity here.
SparkStringUtils.abbreviate(StringUtils.stringifyException(throwable), KUBERNETES_DIAGNOSTICS_MESSAGE_LIMIT_BYTES)- For annotation naming, I agree that
spark.prefix might be needed to distinguish from other annotations. However do we needkubernetes-prefix? Technically, the main content is Spark's internal exception instead ofK8s control plan, isn't it? Some error might come fromK8sbut the name looks very misleading to me because it means onlyK8srelated diagnostics.
In short, I'd like to recommend to revise like the following, @ForVic .
spark.kubernetes.driver.annotateExitExceptionconfig name might be more accurate and proper for what this PR proposes.spark.exceptionorspark.exit-exceptionannotation name does in the same way.
Sure, my intention with the name |
...netes/core/src/main/scala/org/apache/spark/deploy/k8s/SparkKubernetesDiagnosticsSetter.scala
Outdated
Show resolved
Hide resolved
|
Thank you for revising this PR, @ForVic . I believe we almost reached the final stage. Hopefully, we can merge your PR this week. |
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM (Pending CIs). Thank you for updating the PR, @ForVic .
spark.kubernetes.driver.annotateExitException
What changes were proposed in this pull request?
This PR aims to provide a way for users to annotate the driver pod with exception message when the applications exit with exceptions.
Why are the changes needed?
For jobs which run on kubernetes there is no native concept of diagnostics (like there is in YARN), which means that for debugging and triaging errors users must go to logs. For jobs which run on YARN this is often not necessary, since the diagnostics contains the root cause reason for failure. Additionally, for platforms which provide automation of failure insights, or make decisions based on failures, there must be a custom solution or deciding why the application failed (e.g. log and stack trace parsing).
We use a similar mechanism as #23599 to load custom implementations in order to avoid the dependency on the k8s module from SparkSubmit.
Does this PR introduce any user-facing change?
Yes, a config, which is defaulted to false.
How was this patch tested?
Unit tested + verified in production k8s cluster.
Was this patch authored or co-authored using generative AI tooling?
No