[SPARK-4014] Add TaskContext.attemptNumber and deprecate TaskContext.attemptId #3849

JoshRosen · 2014-12-30T22:07:30Z

TaskContext.attemptId is misleadingly-named, since it currently returns a taskId, which uniquely identifies a particular task attempt within a particular SparkContext, instead of an attempt number, which conveys how many times a task has been attempted.

This patch deprecates TaskContext.attemptId and add TaskContext.taskId and TaskContext.attemptNumber fields. Prior to this change, it was impossible to determine whether a task was being re-attempted (or was a speculative copy), which made it difficult to write unit tests for tasks that fail on early attempts or speculative tasks that complete faster than original tasks.

Earlier versions of the TaskContext docs suggest that attemptId behaves like attemptNumber, so there's an argument to be made in favor of changing this method's implementation. Since we've decided against making that change in maintenance branches, I think it's simpler to add better-named methods and retain the old behavior for attemptId; if attemptId behaved differently in different branches, then this would cause confusing build-breaks when backporting regression tests that rely on the new attemptId behavior.

Most of this patch is fairly straightforward, but there is a bit of trickiness related to Mesos tasks: since there's no field in MesosTaskInfo to encode the attemptId, I packed it into the data field alongside the task binary.

…stead of task ID.

JoshRosen · 2014-12-30T22:08:12Z

core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerBackend.scala

Here's the Mesos trickiness that I alluded to in the PR description.

can we type data explicitly?

JoshRosen · 2014-12-30T22:12:02Z

/cc @koeninger, who raised this issue on the mailing list, and @yhuai, who filed the original JIRA issue.

Also, /cc @pwendell, @andrewor14, and @tdas for review.

Technically, this changes the behavior of attemptId; I'd argue that the old behavior was a bug, but we should consider whether we need to preserve it and introduce a new method which returns the attempt number. I don't think there's anything particularly useful that you can do with the taskId, but maybe I'm mistaken.

Also, "attempt number" would be a better name than "attempt ID", but I think we're kind of stuck with attemptId for binary compatibility reasons.

SparkQA · 2014-12-30T22:12:35Z

Test build #24908 has started for PR 3849 at commit 9d8d4d1.

This patch merges cleanly.

rxin · 2014-12-30T22:20:32Z

core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala

can we use named argument for the two zeros and the true?

Good call; I didn't do this for the test code, but this line is in DAGScheduler so it should use named arguments.

koeninger · 2014-12-30T22:28:48Z

Thanks for this. Most of the uses of attemptId I've seen look like they were assuming it meant the 0-based attempt number.

pwendell · 2014-12-30T22:38:07Z

So personally I don't think we should change the semantics of attemptId because this has been exposed to user applications and they could silently break if we modify the meaning of the field (my original JIRA referred to an internal use of this). What it means right now is "a global GUID over all attempts" - that is a bit of an awkward definition, but I don't think it's fair to call this a bug - it was just a weird definition.

So I'd be in favor of deprecating this in favor of taskAttemptId (a new field) and say that it was renamed to avoid confusion. Then we can add another field, attemptCount or attemptNum or something to convey the more intuitive thing.

It will be slightly awkward, but if anyone reads the docs it should be obvious. In fact, we should probably spruce up the docs here for things like partitionID which right now are probably not super clear to users.

koeninger · 2014-12-30T22:48:22Z

The flip side is that it's already documented as doing the "right" thing:

http://spark.apache.org/docs/1.1.1/api/scala/index.html#org.apache.spark.TaskContext

val attemptId: Long

the number of attempts to execute this task

On Tue, Dec 30, 2014 at 4:38 PM, Patrick Wendell [email protected]
wrote:

So personally I don't think we should change the semantics of attemptId
because this has been exposed to user applications and they could silently
break if we modify the meaning of the field (my original JIRA referred to
an internal use of this). What it means right now is "a global GUID over
all attempts" - that is a bit of an awkward definition, but I don't think
it's fair to call this a bug - it was just a weird definition.

So I'd be in favor of deprecating this in favor of taskAttemptId (a new
field) and say that it was renamed to avoid confusion. Then we can add
another field, attemptCount or attemptNum or something to convey the more
intuitive thing.

It will be slightly awkward, but if anyone reads the docs it should be
obvious. In fact, we should probably spruce up the docs here for things
like partitionID which right now are probably not super clear to users.

—
Reply to this email directly or view it on GitHub
#3849 (comment).

pwendell · 2014-12-30T22:53:39Z

Ah I see - I didn't see the doc. I'm more on the fence in this case (because there was a doc that created a specification). So I guess I'm fine either way.

JoshRosen · 2014-12-30T22:53:56Z

Annoyingly, it looks like ScalaDoc doesn't display Javadoc annotations, so in the Scala documentation for 1.2.0 the TaskContext class appears to have lost all documentation, even though it still shows up in the Java docs:

I don't know how the docs for attemptId managed to get lost during the TaskContext stabilization patch, but this suggests that we need a better review checklist for public APIs which mandates full Scaladoc / Javadoc for all public interfaces.

JoshRosen · 2014-12-30T22:56:16Z

One potential consideration is backporing: if we agree that attemptId's meaning should be changed to reflect the 1.1.1 documentation, then we should make the same fix in branch-1.1, which will require a separate patch because that branch uses a different implementation of TaskContext.

pwendell · 2014-12-30T22:58:03Z

Hm - we probably just shouldn't backport it, again I think users might be depending on the hold behavior, putting it in a patch release is a bit iffy.

JoshRosen · 2014-12-30T23:01:34Z

Should we at least leave a documentation note to inform users about the difference in behavior? I'm worried that someone will look at the 1.2 docs, write some code which relies on the correct behavior, then be surprised if they run it on an older release.

SparkQA · 2014-12-30T23:34:35Z

Test build #24908 has finished for PR 3849 at commit 9d8d4d1.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Sort(

AmplabJenkins · 2014-12-30T23:34:38Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24908/
Test FAILed.

JoshRosen · 2014-12-31T00:02:15Z

Hmm, looks like this failed MiMa checks due to the addition of a new method to a public interface:

[info] spark-core: found 1 potential binary incompatibilities (filtered 185)
[error]  * abstract method taskId()Long in class org.apache.spark.TaskContext does not have a correspondent in old version
[error]    filter with: ProblemFilters.exclude[MissingMethodProblem]("org.apache.spark.TaskContext.taskId")

I'll add an exclusion.

- Introduce new `attemptNumber` and `taskAttemptId` methods to avoid ambuiguity. - Change `attemptNumber` to return Int instead of Long, since it was being treated as an Int elsewhere. - Add more Javadocs. - Add Mima excludes for new methods.

SparkQA · 2014-12-31T02:12:44Z

Test build #24923 has started for PR 3849 at commit cbe4d76.

This patch merges cleanly.

SparkQA · 2014-12-31T03:45:29Z

Test build #24923 has finished for PR 3849 at commit cbe4d76.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-12-31T03:45:33Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24923/
Test FAILed.

JoshRosen · 2014-12-31T05:48:13Z

core/src/test/scala/org/apache/spark/scheduler/TaskContextSuite.scala

Just realized that I can change the maxFailures property on local instead of having to use local-cluster. Let me make that change, since it's a better practice and will speed up the tests.

SparkQA · 2014-12-31T06:42:31Z

Test build #24945 has started for PR 3849 at commit 8c387ce.

This patch merges cleanly.

SparkQA · 2014-12-31T08:17:35Z

Test build #24945 has finished for PR 3849 at commit 8c387ce.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-01-06T02:48:56Z

Test build #25075 has finished for PR 3849 at commit 1d43aa6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-01-06T02:49:00Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25075/
Test PASSed.

pwendell · 2015-01-11T07:29:56Z

@JoshRosen LGTM relating the renaming.

Conflicts: core/src/test/scala/org/apache/spark/scheduler/TaskSchedulerImplSuite.scala

SparkQA · 2015-01-12T19:32:38Z

Test build #25425 has started for PR 3849 at commit 38574d4.

This patch merges cleanly.

SparkQA · 2015-01-12T21:10:41Z

Test build #25425 has finished for PR 3849 at commit 38574d4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-01-12T21:10:45Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25425/
Test PASSed.

tnachen · 2015-01-12T22:48:25Z

core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerBackend.scala

Can we introduce another wrapper here instead? I'm imagine we will be adding more fields to serialize to Mesos executors, and it's a lot easier to maintain a struct then position with types and offsets.

That's a good idea; putting the serialization / deserialization code in the same wrapper class will make it much easier to verify that it's correct / test it separately. I'll push a new commit to do this.

SparkQA · 2015-01-13T19:57:39Z

Test build #25483 has started for PR 3849 at commit 5cfff05.

This patch merges cleanly.

tnachen · 2015-01-13T20:05:19Z

Thanks for adding the wrapper, the Mesos portions looks good to me.

SparkQA · 2015-01-13T21:30:35Z

Test build #25483 has finished for PR 3849 at commit 5cfff05.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-01-13T21:30:39Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25483/
Test PASSed.

JoshRosen · 2015-01-14T03:46:44Z

Noo! This became merge-conflicted. Let me bring it up to date...

Conflicts: project/MimaExcludes.scala

SparkQA · 2015-01-14T03:52:37Z

Test build #25505 has started for PR 3849 at commit 89d03e0.

This patch merges cleanly.

SparkQA · 2015-01-14T05:31:57Z

Test build #25505 has finished for PR 3849 at commit 89d03e0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-01-14T05:32:00Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25505/
Test PASSed.

JoshRosen · 2015-01-14T19:45:06Z

I'm going to merge this into master (1.3.0) since it's a blocker for some tests that I want to write. I'll look into backporting this into maintenance branches, too, since that would allow me to backport regression tests that use the new methods.

- Rewind ByteBuffer before making ByteString (This fixes a bug introduced in #3849 / SPARK-4014) Author: Jongyoul Lee <[email protected]> Closes #4119 from jongyoul/SPARK-5333 and squashes the following commits: c6693a8 [Jongyoul Lee] [SPARK-5333][Mesos] MesosTaskLaunchData occurs BufferUnderflowException - changed logDebug location 4141f58 [Jongyoul Lee] [SPARK-5333][Mesos] MesosTaskLaunchData occurs BufferUnderflowException - Added license information 2190606 [Jongyoul Lee] [SPARK-5333][Mesos] MesosTaskLaunchData occurs BufferUnderflowException - Adjusted imported libraries b7f5517 [Jongyoul Lee] [SPARK-5333][Mesos] MesosTaskLaunchData occurs BufferUnderflowException - Rewind ByteBuffer before making ByteString

- Rewind ByteBuffer before making ByteString (This fixes a bug introduced in apache#3849 / SPARK-4014) Author: Jongyoul Lee <[email protected]> Closes apache#4119 from jongyoul/SPARK-5333 and squashes the following commits: c6693a8 [Jongyoul Lee] [SPARK-5333][Mesos] MesosTaskLaunchData occurs BufferUnderflowException - changed logDebug location 4141f58 [Jongyoul Lee] [SPARK-5333][Mesos] MesosTaskLaunchData occurs BufferUnderflowException - Added license information 2190606 [Jongyoul Lee] [SPARK-5333][Mesos] MesosTaskLaunchData occurs BufferUnderflowException - Adjusted imported libraries b7f5517 [Jongyoul Lee] [SPARK-5333][Mesos] MesosTaskLaunchData occurs BufferUnderflowException - Rewind ByteBuffer before making ByteString

JoshRosen added 3 commits December 30, 2014 14:00

Add failing test for SPARK-4014

fd515a5

[SPARK-4014] Change TaskContext.attemptId to return attempt number in…

1e7a933

…stead of task ID.

Doc typo

9d8d4d1

JoshRosen reviewed Dec 30, 2014
View reviewed changes

rxin reviewed Dec 30, 2014
View reviewed changes

Address some of Reynold's minor comments

b2dffa3

JoshRosen reviewed Dec 31, 2014
View reviewed changes

Use local with maxRetries instead of local-cluster.

8c387ce

JoshRosen added 2 commits January 12, 2015 11:21

Merge remote-tracking branch 'origin/master' into SPARK-4014

a180b88

Conflicts: core/src/test/scala/org/apache/spark/scheduler/TaskSchedulerImplSuite.scala

attemptId -> taskAttemptId in PairRDDFunctions

38574d4

tnachen reviewed Jan 12, 2015
View reviewed changes

Introduce wrapper for serializing Mesos task launch data.

5cfff05

Merge remote-tracking branch 'origin/master' into SPARK-4014

89d03e0

Conflicts: project/MimaExcludes.scala

asfgit closed this in 259936b Jan 14, 2015

jongyoul mentioned this pull request Jan 20, 2015

[SPARK-5333][Mesos] MesosTaskLaunchData occurs BufferUnderflowException #4119

Closed

JoshRosen deleted the SPARK-4014 branch January 20, 2015 18:19

[SPARK-4014] Add TaskContext.attemptNumber and deprecate TaskContext.attemptId #3849

[SPARK-4014] Add TaskContext.attemptNumber and deprecate TaskContext.attemptId #3849

Uh oh!

Conversation

JoshRosen commented Dec 30, 2014

Uh oh!

JoshRosen Dec 30, 2014

Choose a reason for hiding this comment

Uh oh!

rxin Dec 30, 2014

Choose a reason for hiding this comment

Uh oh!

JoshRosen commented Dec 30, 2014

Uh oh!

SparkQA commented Dec 30, 2014

Uh oh!

rxin Dec 30, 2014

Choose a reason for hiding this comment

Uh oh!

JoshRosen Dec 30, 2014

Choose a reason for hiding this comment

Uh oh!

koeninger commented Dec 30, 2014

Uh oh!

pwendell commented Dec 30, 2014

Uh oh!

koeninger commented Dec 30, 2014

Uh oh!

pwendell commented Dec 30, 2014

Uh oh!

JoshRosen commented Dec 30, 2014

Uh oh!

JoshRosen commented Dec 30, 2014

Uh oh!

pwendell commented Dec 30, 2014

Uh oh!

JoshRosen commented Dec 30, 2014

Uh oh!

SparkQA commented Dec 30, 2014

Uh oh!

AmplabJenkins commented Dec 30, 2014

Uh oh!

JoshRosen commented Dec 31, 2014

Uh oh!

SparkQA commented Dec 31, 2014

Uh oh!

SparkQA commented Dec 31, 2014

Uh oh!

AmplabJenkins commented Dec 31, 2014

Uh oh!

JoshRosen Dec 31, 2014

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 31, 2014

Uh oh!

SparkQA commented Dec 31, 2014

Uh oh!

SparkQA commented Jan 6, 2015

Uh oh!

AmplabJenkins commented Jan 6, 2015

Uh oh!

pwendell commented Jan 11, 2015

Uh oh!

SparkQA commented Jan 12, 2015

Uh oh!

SparkQA commented Jan 12, 2015

Uh oh!

AmplabJenkins commented Jan 12, 2015

Uh oh!

tnachen Jan 12, 2015

Choose a reason for hiding this comment

Uh oh!

JoshRosen Jan 12, 2015

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 13, 2015

Uh oh!

tnachen commented Jan 13, 2015

Uh oh!

SparkQA commented Jan 13, 2015