Skip to content

Conversation

@devaraj-kavali
Copy link

What changes were proposed in this pull request?

When the mesosDriver.run() returns with the status as DRIVER_ABORTED then throwing the exception which can be handled from SparkUncaughtExceptionHandler to shutdown the dispatcher.

How was this patch tested?

I verified it manually, the driver thread throws exception when mesosDriver.run() returns with the DRIVER_ABORTED status.

@tnachen
Copy link
Contributor

tnachen commented Jun 2, 2016

ok to test

@SparkQA
Copy link

SparkQA commented Jun 2, 2016

Test build #59856 has finished for PR 13143 at commit c16fb5f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

error = Some(new SparkException("Error starting driver, DRIVER_ABORTED"))
markErr()
val ex = new SparkException("Error starting driver, DRIVER_ABORTED")
// if the driver gets aborted after the successful registration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/after the successful registration/after registration/g

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also to simplify the code, can we just throw SparkExecption here? Then the catch will then handle all cases

@tnachen
Copy link
Contributor

tnachen commented Jun 22, 2016

Is it because MesosDriver actually threw an exception?

@devaraj-kavali
Copy link
Author

devaraj-kavali commented Jun 22, 2016

MesosDriver doesn't throw any exception, it just returns with the value as Status.DRIVER_ABORTED.

  registerLatch.await()

  // propagate any error to the calling thread. This ensures that SparkContext creation fails
  // without leaving a broken context that won't be able to schedule any tasks
  error.foreach(throw _)

This code handles exceptions and throws if it gets Status.DRIVER_ABORTED during registration, once the registration completes there is no code to handle and will be skipped the status and also thread dies.

@SparkQA
Copy link

SparkQA commented Dec 28, 2016

Test build #70648 has finished for PR 13143 at commit 946ad7d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@devaraj-kavali
Copy link
Author

@tnachen Can you check this?

@devaraj-kavali
Copy link
Author

@mgummelt /@tnachen, can you have a look into this?

@mgummelt
Copy link

What whole function is designed poorly. We need to totally change it instead of tacking this on. We shouldn't be calling driver.run() in a separate thread. We should be calling driver.start()

@devaraj-kavali
Copy link
Author

Thanks @mgummelt for the feedback, will update the PR with the function rewrite.

@SparkQA
Copy link

SparkQA commented Apr 18, 2017

Test build #75907 has finished for PR 13143 at commit 131edc3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 18, 2017

Test build #75908 has finished for PR 13143 at commit b9893a3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@skonto
Copy link
Contributor

skonto commented Jul 27, 2017

@ArtRand @susanxhuynh could please review this before we call for a merge?

@ArtRand
Copy link

ArtRand commented Sep 21, 2017

@skonto @susanxhuynh @devaraj-kavali are people still interested in this? I was just playing around with this code to clean up ZK state.. Would be happy to try this when I have a few cycles.

@devaraj-kavali
Copy link
Author

@ArtRand I think this is still an issue which needs to be merged, do you have any observations with this PR?

@ArtRand
Copy link

ArtRand commented Sep 23, 2017

Hello @devaraj-kavali. Yes. I've been playing around with this because it's inconvenient to clean up ZK whenever you uninstall/reinstall the Dispatcher. The problem is that the only signal of a re-install vs. a failover is when Mesos gives you a DRIVER_ABORTED error (signals a re-install). I think a solution will involve having different methods to starting SchedulerDrivers for Spark Drivers (applications) and the Dispatcher because they inherently have different requirements with respect to state.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@HyukjinKwon
Copy link
Member

ping @devaraj-kavali for @ArtRand's comment above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants