Change driver pod's restart policy from OnFailure to Never

After https://github.com/apache-spark-on-k8s/spark/pull/138 the exit code of the rest server (and thus the driver pod) is the exit code of the Spark driver.  This means the driver pod is exiting in non-zero ways more frequently.

As @kimoonkim notes in https://github.com/apache-spark-on-k8s/spark/issues/135#issuecomment-281862247 this means that k8s is now restarting the driver pod on failure since we have the `OnFailure` restart policy set on the driver pod.  But the restarted driver pod never gets sent a submission from the launcher because we don't have that logic built in yet.

So we need to either:
- build re-launch logic into the launcher for driver pod failure and restart
- turn off driver pod restart so the launcher shuts down cleanly (though we'll want an error message that the Spark job failed)

I'd prefer option 2 and to pursue option 1 in the future.  Potentially a richer version of option 1 would also include creating a k8s Job resource for the driver so we're interacting with k8s at a higher level than directly on pods.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Change driver pod's restart policy from OnFailure to Never #144

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Change driver pod's restart policy from OnFailure to Never #144

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions