Skip to content

Conversation

@ForVic
Copy link
Contributor

@ForVic ForVic commented Aug 18, 2025

What changes were proposed in this pull request?

This PR aims to provide a way for users to annotate the driver pod with exception message when the applications exit with exceptions.

Why are the changes needed?

For jobs which run on kubernetes there is no native concept of diagnostics (like there is in YARN), which means that for debugging and triaging errors users must go to logs. For jobs which run on YARN this is often not necessary, since the diagnostics contains the root cause reason for failure. Additionally, for platforms which provide automation of failure insights, or make decisions based on failures, there must be a custom solution or deciding why the application failed (e.g. log and stack trace parsing).

We use a similar mechanism as #23599 to load custom implementations in order to avoid the dependency on the k8s module from SparkSubmit.

Does this PR introduce any user-facing change?

Yes, a config, which is defaulted to false.

How was this patch tested?

Unit tested + verified in production k8s cluster.

Was this patch authored or co-authored using generative AI tooling?

No

@ForVic ForVic changed the title [WIP][K8S] Optionally capture diagnostics for jobs on Kubernetes [SPARK-53335][K8S] Optionally capture diagnostics for jobs on Kubernetes Aug 20, 2025
@ForVic ForVic marked this pull request as ready for review August 20, 2025 03:50
@mridulm
Copy link
Contributor

mridulm commented Sep 10, 2025

+CC @dongjoon-hyun Can you please take a look ?
This is something we have started to recently leverage internally.

@ForVic ForVic force-pushed the vsunderl/better_kubernetes_diagnostics branch 2 times, most recently from e1661a6 to 15f1b23 Compare October 16, 2025 17:16
@sunchao
Copy link
Member

sunchao commented Oct 16, 2025

cc @viirya @dongjoon-hyun @cloud-fan too

@dongjoon-hyun
Copy link
Member

Ack, @sunchao .

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To @ForVic , @mridulm and @sunchao , I have two questions.

  1. Are the failed driver pods supposed to be kept for a while with the diagnosis annotation?
  2. Who are the final audience to be able to see the diagnosis? A user or some other scrapper or automated systems? Could you give us some examples which you are currently using?

@ForVic ForVic force-pushed the vsunderl/better_kubernetes_diagnostics branch from 15f1b23 to b2b0c8b Compare October 16, 2025 21:37
@ForVic
Copy link
Contributor Author

ForVic commented Oct 16, 2025

To @ForVic , @mridulm and @sunchao , I have two questions.

  1. Are the failed driver pods supposed to be kept for a while with the diagnosis annotation?
  2. Who are the final audience to be able to see the diagnosis? A user or some other scrapper or automated systems? Could you give us some examples which you are currently using?
    @dongjoon-hyun
  1. Driver pod lifecycle is outside of Spark, so it's free for spark operator or external systems to manage it how they'd like. If they delete instantly than this implementation isn't useful to them. I've seen instances of having a watch on the driver pod and on completion capturing diagnostics and then deleting pod and I've also seen where there is a periodic polling of completed driver pods and then could capture this field and then delete.
  2. Both users and automation. It's improved automated tooling for things like auto-memory-scaling on OOM where we use this to capture the error and classify. Also when the user class has an exception that fails the job this makes it easier for debugging if we are able to show them the most likely error reason, and not have them always need to explore logs.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. For feature naming, diagnostics sounds like an over-claim of this feature. Can we use more technical name instead. Technically, this feature makes a naive annotation whose content is the limited prefix of message of Throwable . There is no sophisticated diagnostic activity here.
SparkStringUtils.abbreviate(StringUtils.stringifyException(throwable), KUBERNETES_DIAGNOSTICS_MESSAGE_LIMIT_BYTES)
  1. For annotation naming, I agree that spark. prefix might be needed to distinguish from other annotations. However do we need kubernetes- prefix? Technically, the main content is Spark's internal exception instead of K8s control plan, isn't it? Some error might come from K8s but the name looks very misleading to me because it means only K8s related diagnostics.

In short, I'd like to recommend to revise like the following, @ForVic .

  • spark.kubernetes.driver.annotateExitException config name might be more accurate and proper for what this PR proposes.
  • spark.exception or spark.exit-exception annotation name does in the same way.

@ForVic
Copy link
Contributor Author

ForVic commented Oct 24, 2025

In short, I'd like to recommend to revise like the following, @ForVic .

  • spark.kubernetes.driver.annotateExitException config name might be more accurate and proper for what this PR proposes.
  • spark.exception or spark.exit-exception annotation name does in the same way.

Sure, my intention with the name diagnostics was to match the spark-on-yarn behavior, which we refer to as diagnostics in spark, but I have no problem changing the name. On other point, I agree, that makes sense.

@dongjoon-hyun
Copy link
Member

Thank you for revising this PR, @ForVic . I believe we almost reached the final stage. Hopefully, we can merge your PR this week.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM (Pending CIs). Thank you for updating the PR, @ForVic .

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-53335][K8S] Optionally capture diagnostics for jobs on Kubernetes [SPARK-53335][K8S] Support spark.kubernetes.driver.annotateExitException Oct 24, 2025
@dongjoon-hyun
Copy link
Member

Merged to master for Apache Spark 4.1.0-preview3.

Thank you, @ForVic , @mridulm , @sunchao , @viirya .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants