Skip to content

Conversation

@JoshRosen
Copy link
Contributor

TaskContext.attemptId is misleadingly-named, since it currently returns a taskId, which uniquely identifies a particular task attempt within a particular SparkContext, instead of an attempt number, which conveys how many times a task has been attempted.

This patch deprecates TaskContext.attemptId and add TaskContext.taskId and TaskContext.attemptNumber fields. Prior to this change, it was impossible to determine whether a task was being re-attempted (or was a speculative copy), which made it difficult to write unit tests for tasks that fail on early attempts or speculative tasks that complete faster than original tasks.

Earlier versions of the TaskContext docs suggest that attemptId behaves like attemptNumber, so there's an argument to be made in favor of changing this method's implementation. Since we've decided against making that change in maintenance branches, I think it's simpler to add better-named methods and retain the old behavior for attemptId; if attemptId behaved differently in different branches, then this would cause confusing build-breaks when backporting regression tests that rely on the new attemptId behavior.

Most of this patch is fairly straightforward, but there is a bit of trickiness related to Mesos tasks: since there's no field in MesosTaskInfo to encode the attemptId, I packed it into the data field alongside the task binary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's the Mesos trickiness that I alluded to in the PR description.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we type data explicitly?

@JoshRosen
Copy link
Contributor Author

/cc @koeninger, who raised this issue on the mailing list, and @yhuai, who filed the original JIRA issue.

Also, /cc @pwendell, @andrewor14, and @tdas for review.

Technically, this changes the behavior of attemptId; I'd argue that the old behavior was a bug, but we should consider whether we need to preserve it and introduce a new method which returns the attempt number. I don't think there's anything particularly useful that you can do with the taskId, but maybe I'm mistaken.

Also, "attempt number" would be a better name than "attempt ID", but I think we're kind of stuck with attemptId for binary compatibility reasons.

@SparkQA
Copy link

SparkQA commented Dec 30, 2014

Test build #24908 has started for PR 3849 at commit 9d8d4d1.

  • This patch merges cleanly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use named argument for the two zeros and the true?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call; I didn't do this for the test code, but this line is in DAGScheduler so it should use named arguments.

@koeninger
Copy link
Contributor

Thanks for this. Most of the uses of attemptId I've seen look like they were assuming it meant the 0-based attempt number.

@pwendell
Copy link
Contributor

So personally I don't think we should change the semantics of attemptId because this has been exposed to user applications and they could silently break if we modify the meaning of the field (my original JIRA referred to an internal use of this). What it means right now is "a global GUID over all attempts" - that is a bit of an awkward definition, but I don't think it's fair to call this a bug - it was just a weird definition.

So I'd be in favor of deprecating this in favor of taskAttemptId (a new field) and say that it was renamed to avoid confusion. Then we can add another field, attemptCount or attemptNum or something to convey the more intuitive thing.

It will be slightly awkward, but if anyone reads the docs it should be obvious. In fact, we should probably spruce up the docs here for things like partitionID which right now are probably not super clear to users.

@koeninger
Copy link
Contributor

The flip side is that it's already documented as doing the "right" thing:

http://spark.apache.org/docs/1.1.1/api/scala/index.html#org.apache.spark.TaskContext

val attemptId: Long

the number of attempts to execute this task

On Tue, Dec 30, 2014 at 4:38 PM, Patrick Wendell [email protected]
wrote:

So personally I don't think we should change the semantics of attemptId
because this has been exposed to user applications and they could silently
break if we modify the meaning of the field (my original JIRA referred to
an internal use of this). What it means right now is "a global GUID over
all attempts" - that is a bit of an awkward definition, but I don't think
it's fair to call this a bug - it was just a weird definition.

So I'd be in favor of deprecating this in favor of taskAttemptId (a new
field) and say that it was renamed to avoid confusion. Then we can add
another field, attemptCount or attemptNum or something to convey the more
intuitive thing.

It will be slightly awkward, but if anyone reads the docs it should be
obvious. In fact, we should probably spruce up the docs here for things
like partitionID which right now are probably not super clear to users.


Reply to this email directly or view it on GitHub
#3849 (comment).

@pwendell
Copy link
Contributor

Ah I see - I didn't see the doc. I'm more on the fence in this case (because there was a doc that created a specification). So I guess I'm fine either way.

@JoshRosen
Copy link
Contributor Author

Annoyingly, it looks like ScalaDoc doesn't display Javadoc annotations, so in the Scala documentation for 1.2.0 the TaskContext class appears to have lost all documentation, even though it still shows up in the Java docs:

I don't know how the docs for attemptId managed to get lost during the TaskContext stabilization patch, but this suggests that we need a better review checklist for public APIs which mandates full Scaladoc / Javadoc for all public interfaces.

@JoshRosen
Copy link
Contributor Author

One potential consideration is backporing: if we agree that attemptId's meaning should be changed to reflect the 1.1.1 documentation, then we should make the same fix in branch-1.1, which will require a separate patch because that branch uses a different implementation of TaskContext.

@pwendell
Copy link
Contributor

Hm - we probably just shouldn't backport it, again I think users might be depending on the hold behavior, putting it in a patch release is a bit iffy.

@JoshRosen
Copy link
Contributor Author

Should we at least leave a documentation note to inform users about the difference in behavior? I'm worried that someone will look at the 1.2 docs, write some code which relies on the correct behavior, then be surprised if they run it on an older release.

@SparkQA
Copy link

SparkQA commented Dec 30, 2014

Test build #24908 has finished for PR 3849 at commit 9d8d4d1.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Sort(

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24908/
Test FAILed.

@JoshRosen
Copy link
Contributor Author

Hmm, looks like this failed MiMa checks due to the addition of a new method to a public interface:

[info] spark-core: found 1 potential binary incompatibilities (filtered 185)
[error]  * abstract method taskId()Long in class org.apache.spark.TaskContext does not have a correspondent in old version
[error]    filter with: ProblemFilters.exclude[MissingMethodProblem]("org.apache.spark.TaskContext.taskId")

I'll add an exclusion.

- Introduce new `attemptNumber` and `taskAttemptId` methods to avoid
  ambuiguity.
- Change `attemptNumber` to return Int instead of Long, since it was
  being treated as an Int elsewhere.
- Add more Javadocs.
- Add Mima excludes for new methods.
@SparkQA
Copy link

SparkQA commented Dec 31, 2014

Test build #24923 has started for PR 3849 at commit cbe4d76.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Dec 31, 2014

Test build #24923 has finished for PR 3849 at commit cbe4d76.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24923/
Test FAILed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just realized that I can change the maxFailures property on local instead of having to use local-cluster. Let me make that change, since it's a better practice and will speed up the tests.

@SparkQA
Copy link

SparkQA commented Dec 31, 2014

Test build #24945 has started for PR 3849 at commit 8c387ce.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Dec 31, 2014

Test build #24945 has finished for PR 3849 at commit 8c387ce.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 6, 2015

Test build #25075 has finished for PR 3849 at commit 1d43aa6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25075/
Test PASSed.

@pwendell
Copy link
Contributor

@JoshRosen LGTM relating the renaming.

Conflicts:
	core/src/test/scala/org/apache/spark/scheduler/TaskSchedulerImplSuite.scala
@SparkQA
Copy link

SparkQA commented Jan 12, 2015

Test build #25425 has started for PR 3849 at commit 38574d4.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 12, 2015

Test build #25425 has finished for PR 3849 at commit 38574d4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25425/
Test PASSed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we introduce another wrapper here instead? I'm imagine we will be adding more fields to serialize to Mesos executors, and it's a lot easier to maintain a struct then position with types and offsets.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good idea; putting the serialization / deserialization code in the same wrapper class will make it much easier to verify that it's correct / test it separately. I'll push a new commit to do this.

@SparkQA
Copy link

SparkQA commented Jan 13, 2015

Test build #25483 has started for PR 3849 at commit 5cfff05.

  • This patch merges cleanly.

@tnachen
Copy link
Contributor

tnachen commented Jan 13, 2015

Thanks for adding the wrapper, the Mesos portions looks good to me.

@SparkQA
Copy link

SparkQA commented Jan 13, 2015

Test build #25483 has finished for PR 3849 at commit 5cfff05.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25483/
Test PASSed.

@JoshRosen
Copy link
Contributor Author

Noo! This became merge-conflicted. Let me bring it up to date...

@SparkQA
Copy link

SparkQA commented Jan 14, 2015

Test build #25505 has started for PR 3849 at commit 89d03e0.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 14, 2015

Test build #25505 has finished for PR 3849 at commit 89d03e0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25505/
Test PASSed.

@JoshRosen
Copy link
Contributor Author

I'm going to merge this into master (1.3.0) since it's a blocker for some tests that I want to write. I'll look into backporting this into maintenance branches, too, since that would allow me to backport regression tests that use the new methods.

@asfgit asfgit closed this in 259936b Jan 14, 2015
@JoshRosen JoshRosen deleted the SPARK-4014 branch January 20, 2015 18:19
asfgit pushed a commit that referenced this pull request Jan 20, 2015
- Rewind ByteBuffer before making ByteString

(This fixes a bug introduced in #3849 / SPARK-4014)

Author: Jongyoul Lee <[email protected]>

Closes #4119 from jongyoul/SPARK-5333 and squashes the following commits:

c6693a8 [Jongyoul Lee] [SPARK-5333][Mesos] MesosTaskLaunchData occurs BufferUnderflowException - changed logDebug location
4141f58 [Jongyoul Lee] [SPARK-5333][Mesos] MesosTaskLaunchData occurs BufferUnderflowException - Added license information
2190606 [Jongyoul Lee] [SPARK-5333][Mesos] MesosTaskLaunchData occurs BufferUnderflowException - Adjusted imported libraries
b7f5517 [Jongyoul Lee] [SPARK-5333][Mesos] MesosTaskLaunchData occurs BufferUnderflowException - Rewind ByteBuffer before making ByteString
bomeng pushed a commit to Huawei-Spark/spark that referenced this pull request Jan 21, 2015
- Rewind ByteBuffer before making ByteString

(This fixes a bug introduced in apache#3849 / SPARK-4014)

Author: Jongyoul Lee <[email protected]>

Closes apache#4119 from jongyoul/SPARK-5333 and squashes the following commits:

c6693a8 [Jongyoul Lee] [SPARK-5333][Mesos] MesosTaskLaunchData occurs BufferUnderflowException - changed logDebug location
4141f58 [Jongyoul Lee] [SPARK-5333][Mesos] MesosTaskLaunchData occurs BufferUnderflowException - Added license information
2190606 [Jongyoul Lee] [SPARK-5333][Mesos] MesosTaskLaunchData occurs BufferUnderflowException - Adjusted imported libraries
b7f5517 [Jongyoul Lee] [SPARK-5333][Mesos] MesosTaskLaunchData occurs BufferUnderflowException - Rewind ByteBuffer before making ByteString
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants