Skip to content

Conversation

@tomwhite
Copy link
Member

This allows clients to retrieve the original exception from the
cause field of the SparkException that is thrown by the driver.
If the original exception is not in fact Serializable then it will
not be returned, but the message and stacktrace will be. (All Java
Throwables implement the Serializable interface, but this is no
guarantee that a particular implementation can actually be
serialized.)

@srowen
Copy link
Member

srowen commented Jun 25, 2015

OK to test

@SparkQA
Copy link

SparkQA commented Jun 25, 2015

Test build #961 has finished for PR 7014 at commit fc484b9.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: multiline method defs should have each arg on a separate line, all indented 4 extra spaces. Also include the return type (I know it wasn't there before but as long as you're touching this ...)

private[scheduler] def handleTaskSetFailed(
    taskSet: TaskSet,
    reason: String,
    exception: Option[Throwable] = None): Unit = {

@squito
Copy link
Contributor

squito commented Jun 25, 2015

@tomwhite this looks great! thanks so much for working on this, I think it will be a really good addition. I left some minor style comments in addition to the ones the automatic checker found, but overall this seems close.

@tomwhite
Copy link
Member Author

@squito thanks for the review! I've addressed your comments in a new commit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: order alphabetically {File, NotSerializableException}

@squito
Copy link
Contributor

squito commented Jun 26, 2015

a couple of minor comments, aside from that just waiting to see if the tests pass

@SparkQA
Copy link

SparkQA commented Jun 26, 2015

Test build #965 has finished for PR 7014 at commit 44cb266.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class TaskSetFailed(taskSet: TaskSet, reason: String, exception: Option[Throwable])

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this break pattern matching on this class? Is it possible to add an unapply method to preserve matching for the old signature? I'm rusty on the details of backwards compatibility for case classes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pwendell You're right - it will break pattern matching on the class. My understanding is that an unapply method won't help since pattern matching won't use it (they are for user code).

Case classes don't play well with binary compatibility, it seems. To do this compatibly, we'd have to have another case class, called ExceptionFailureWithCause say, and a trait that both it and ExceptionFailure extend with the common fields. Then everywhere that handles ExceptionFailure would also have to handle ExceptionFailureWithCause.

Having said all that, this class is marked @DeveloperApi so it's within the contract to change it. The fullStackTrace field was added last November, for example. I can understand the general reluctance to change code even if it is marked as being for developers only, but it's not clear if the workaround here to preserve binary compatibility is worth the complexity it adds.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're definitely out of luck for binary compatibility, but I think @pwendell just wanted to preserve source compatibility (ie. maybe users will need to recompile, but they won't need to change their code at all).

However, I don't think that is possible either. You would need to have another method like def unapply(ef: ExceptionFailure): Option[(String, String, Array[StackTraceElement], String, Option[TaskMetrics])] -- ie., exactly the same as the built-in unapply, but without the final Option[Throwable] in the return type. But that isn't legal overloading -- it has the same set of arguments as the built-in unapply, just a different return type.

Is there another way around this I'm not seeing? I agree we shouldn't change things willy-nilly just b/c its @DeveloperApi, but IMO this change is worth it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay - if it's not possible to preserve matching, let's just change it. Adding fields will only break matching and shouldn't hut users who just access the fields. We might later on add some docs to suggest avoiding use of matching on developer API's for this reason (no need to block this though).

@squito
Copy link
Contributor

squito commented Jul 1, 2015

@pwendell are you OK with the change to ExceptionFailure? If you're ok w/ it, this lgtm

@squito
Copy link
Contributor

squito commented Jul 7, 2015

ping @pwendell

@pwendell
Copy link
Contributor

I think the compatibility is okay, but two other quick questions:

  1. Is it well defined which exception caused the task to fail? What if a task fails N times with N different exceptions, which is present in the cause? I think this should be documented clearly in the docs for the developer API.
  2. Maybe @kayousterhout could look quickly at the DAG scheduler changes.

@markhamstra
Copy link
Contributor

@pwendell I'm not seeing anything concerning in the DAGScheduler changes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the wrong place for this doc on preserveCause. That should go on the constructor that has it. Here I'd just doc the exception (and add an explanation of which exception is corresponds to as Patrick asked)

@squito
Copy link
Contributor

squito commented Jul 15, 2015

I played with this locally a bit, and I think that actually defaulting Option[Throwable] = None is covering up some cases where there really is an exception you should include. I think you should just remove that defaulting everywhere. It'll lead to a handful of compile errors, but mostly they will be easy to fix -- either there will be an obvious exception to pass or there won't ...

... except for the default on ExceptionFailure. That will lead to a failure in JsonProtocol.taskEndReasonFromJson. Is it possible to turn the actual exception into json in JsonProtocol.taskEndReasonToJson, so that it can be read back in? I was wondering if getting the full exception back might be useful to anything which wants to analyze the event logs. I guess its impossible for some general purpose job to know about whatever user exceptions there are, so maybe there isn't any good reason for getting the full exception in json anyway.

So I'm leaning towards just putting in None for the exception in JsonProtocol.taskEndReasonFromJson, but just wanted to share my thoughts.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: space after "{" (or use parens instead, with no spaces around them) (Realize that this was there before, but might as well fix it now)

@kayousterhout
Copy link
Contributor

Scheduler changes LGTM, subject to @squito's suggestion

@tomwhite
Copy link
Member Author

Thanks for all the feedback @pwendell, @squito, and @kayousterhout. I've addressed all your points and updated this PR.

@squito Regarding JsonProtocol, I can't see a way of serializing the fields of an arbitrary user exception class to JSON and the reconstituting them, so I agree that putting it as None is the right way to go. The stack trace and message will be preserved in the same way that they currently are though.

@SparkQA
Copy link

SparkQA commented Jul 16, 2015

Test build #1082 has finished for PR 7014 at commit e5a1d7c.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class TaskSetFailed(taskSet: TaskSet, reason: String, exception: Option[Throwable])

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible that an exception does not throw a NotSerializableException but cannot be deserialized on the other side? This could be due to, say, class loader issues or other issues not caught during serialization time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(By the way, the particular exception I'm thinking of is Scala's MatchError, which actually includes the object which failed to match.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is definitely possible. I've written a test for this case, and implemented a fix which uses a wrapper for the exception that gracefully falls back if deserialization of the exception fails.

@SparkQA
Copy link

SparkQA commented Jul 30, 2015

Test build #1238 has finished for PR 7014 at commit a14f282.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class TaskSetFailed(taskSet: TaskSet, reason: String, exception: Option[Throwable])

@aarondav
Copy link
Contributor

Looks good from my end.

@squito
Copy link
Contributor

squito commented Jul 30, 2015

@tomwhite there is a legit failure here, looks like you need to merge w/ master and fix a compile error

@tomwhite tomwhite force-pushed the propagate-user-exceptions branch from a14f282 to d531c93 Compare August 5, 2015 15:13
@tomwhite
Copy link
Member Author

tomwhite commented Aug 6, 2015

I rebased this on master. Is there a way to get this to be retested?

@kayousterhout
Copy link
Contributor

Jenkins, retest this please

@rxin
Copy link
Contributor

rxin commented Aug 7, 2015

I triggered Jenkins.

@SparkQA
Copy link

SparkQA commented Aug 7, 2015

Test build #1401 has finished for PR 7014 at commit d531c93.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class TaskSetFailed(taskSet: TaskSet, reason: String, exception: Option[Throwable])

@squito
Copy link
Contributor

squito commented Aug 7, 2015

@tomwhite still looks like a real compile error:

[error] /home/jenkins/workspace/NewSparkPullRequestBuilder/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala:822: not enough arguments for method abortStage: (failedStage: org.apache.spark.scheduler.Stage, reason: String, exception: Option[Throwable])Unit.
[error] Unspecified value parameter exception.
[error]         abortStage(stage, s"Task creation failed: $e\n${e.getStackTraceString}")
[error]                   ^

tomwhite added 13 commits August 7, 2015 15:38
This allows clients to retrieve the original exception from the
cause field of the SparkException that is thrown by the driver.
If the original exception is not in fact Serializable then it will
not be returned, but the message and stacktrace will be. (All Java
Throwables implement the Serializable interface, but this is no
guarantee that a particular implementation can actually be
serialized.)
last failure for a task, and add a test for this case.

Remove the default of None for the failure exception.

Address nits.
@tomwhite tomwhite force-pushed the propagate-user-exceptions branch from d531c93 to 4c884d0 Compare August 7, 2015 15:03
@tomwhite
Copy link
Member Author

tomwhite commented Aug 7, 2015

Thanks. The error came from new code that was committed in SPARK-4352. I've rebased and fixed the offending line.

@SparkQA
Copy link

SparkQA commented Aug 7, 2015

Test build #1404 has finished for PR 7014 at commit 4c884d0.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class TaskSetFailed(taskSet: TaskSet, reason: String, exception: Option[Throwable])

@SparkQA
Copy link

SparkQA commented Aug 7, 2015

Test build #1405 has finished for PR 7014 at commit 4c884d0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

asfgit pushed a commit that referenced this pull request Aug 12, 2015
This allows clients to retrieve the original exception from the
cause field of the SparkException that is thrown by the driver.
If the original exception is not in fact Serializable then it will
not be returned, but the message and stacktrace will be. (All Java
Throwables implement the Serializable interface, but this is no
guarantee that a particular implementation can actually be
serialized.)

Author: Tom White <[email protected]>

Closes #7014 from tomwhite/propagate-user-exceptions.

(cherry picked from commit 2e68066)
Signed-off-by: Imran Rashid <[email protected]>
@asfgit asfgit closed this in 2e68066 Aug 12, 2015
@squito
Copy link
Contributor

squito commented Aug 12, 2015

merged to master & 1.5 thanks @tomwhite !

CodingCat pushed a commit to CodingCat/spark that referenced this pull request Aug 17, 2015
This allows clients to retrieve the original exception from the
cause field of the SparkException that is thrown by the driver.
If the original exception is not in fact Serializable then it will
not be returned, but the message and stacktrace will be. (All Java
Throwables implement the Serializable interface, but this is no
guarantee that a particular implementation can actually be
serialized.)

Author: Tom White <[email protected]>

Closes apache#7014 from tomwhite/propagate-user-exceptions.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants