This repository was archived by the owner on Jan 9, 2020. It is now read-only.

Description
After #138 the exit code of the rest server (and thus the driver pod) is the exit code of the Spark driver. This means the driver pod is exiting in non-zero ways more frequently.
As @kimoonkim notes in #135 (comment) this means that k8s is now restarting the driver pod on failure since we have the OnFailure restart policy set on the driver pod. But the restarted driver pod never gets sent a submission from the launcher because we don't have that logic built in yet.
So we need to either:
- build re-launch logic into the launcher for driver pod failure and restart
- turn off driver pod restart so the launcher shuts down cleanly (though we'll want an error message that the Spark job failed)
I'd prefer option 2 and to pursue option 1 in the future. Potentially a richer version of option 1 would also include creating a k8s Job resource for the driver so we're interacting with k8s at a higher level than directly on pods.