-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-4014] Add TaskContext.attemptNumber and deprecate TaskContext.attemptId #3849
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's the Mesos trickiness that I alluded to in the PR description.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we type data explicitly?
|
/cc @koeninger, who raised this issue on the mailing list, and @yhuai, who filed the original JIRA issue. Also, /cc @pwendell, @andrewor14, and @tdas for review. Technically, this changes the behavior of Also, "attempt number" would be a better name than "attempt ID", but I think we're kind of stuck with attemptId for binary compatibility reasons. |
|
Test build #24908 has started for PR 3849 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we use named argument for the two zeros and the true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call; I didn't do this for the test code, but this line is in DAGScheduler so it should use named arguments.
|
Thanks for this. Most of the uses of attemptId I've seen look like they were assuming it meant the 0-based attempt number. |
|
So personally I don't think we should change the semantics of So I'd be in favor of deprecating this in favor of It will be slightly awkward, but if anyone reads the docs it should be obvious. In fact, we should probably spruce up the docs here for things like |
|
The flip side is that it's already documented as doing the "right" thing: http://spark.apache.org/docs/1.1.1/api/scala/index.html#org.apache.spark.TaskContext val attemptId: Long the number of attempts to execute this task On Tue, Dec 30, 2014 at 4:38 PM, Patrick Wendell [email protected]
|
|
Ah I see - I didn't see the doc. I'm more on the fence in this case (because there was a doc that created a specification). So I guess I'm fine either way. |
|
Annoyingly, it looks like ScalaDoc doesn't display Javadoc annotations, so in the Scala documentation for 1.2.0 the TaskContext class appears to have lost all documentation, even though it still shows up in the Java docs:
I don't know how the docs for |
|
One potential consideration is backporing: if we agree that |
|
Hm - we probably just shouldn't backport it, again I think users might be depending on the hold behavior, putting it in a patch release is a bit iffy. |
|
Should we at least leave a documentation note to inform users about the difference in behavior? I'm worried that someone will look at the 1.2 docs, write some code which relies on the correct behavior, then be surprised if they run it on an older release. |
|
Test build #24908 has finished for PR 3849 at commit
|
|
Test FAILed. |
|
Hmm, looks like this failed MiMa checks due to the addition of a new method to a public interface: I'll add an exclusion. |
- Introduce new `attemptNumber` and `taskAttemptId` methods to avoid ambuiguity. - Change `attemptNumber` to return Int instead of Long, since it was being treated as an Int elsewhere. - Add more Javadocs. - Add Mima excludes for new methods.
|
Test build #24923 has started for PR 3849 at commit
|
|
Test build #24923 has finished for PR 3849 at commit
|
|
Test FAILed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just realized that I can change the maxFailures property on local instead of having to use local-cluster. Let me make that change, since it's a better practice and will speed up the tests.
|
Test build #24945 has started for PR 3849 at commit
|
|
Test build #24945 has finished for PR 3849 at commit
|
|
Test build #25075 has finished for PR 3849 at commit
|
|
Test PASSed. |
|
@JoshRosen LGTM relating the renaming. |
Conflicts: core/src/test/scala/org/apache/spark/scheduler/TaskSchedulerImplSuite.scala
|
Test build #25425 has started for PR 3849 at commit
|
|
Test build #25425 has finished for PR 3849 at commit
|
|
Test PASSed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we introduce another wrapper here instead? I'm imagine we will be adding more fields to serialize to Mesos executors, and it's a lot easier to maintain a struct then position with types and offsets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good idea; putting the serialization / deserialization code in the same wrapper class will make it much easier to verify that it's correct / test it separately. I'll push a new commit to do this.
|
Test build #25483 has started for PR 3849 at commit
|
|
Thanks for adding the wrapper, the Mesos portions looks good to me. |
|
Test build #25483 has finished for PR 3849 at commit
|
|
Test PASSed. |
|
Noo! This became merge-conflicted. Let me bring it up to date... |
Conflicts: project/MimaExcludes.scala
|
Test build #25505 has started for PR 3849 at commit
|
|
Test build #25505 has finished for PR 3849 at commit
|
|
Test PASSed. |
|
I'm going to merge this into |
- Rewind ByteBuffer before making ByteString (This fixes a bug introduced in #3849 / SPARK-4014) Author: Jongyoul Lee <[email protected]> Closes #4119 from jongyoul/SPARK-5333 and squashes the following commits: c6693a8 [Jongyoul Lee] [SPARK-5333][Mesos] MesosTaskLaunchData occurs BufferUnderflowException - changed logDebug location 4141f58 [Jongyoul Lee] [SPARK-5333][Mesos] MesosTaskLaunchData occurs BufferUnderflowException - Added license information 2190606 [Jongyoul Lee] [SPARK-5333][Mesos] MesosTaskLaunchData occurs BufferUnderflowException - Adjusted imported libraries b7f5517 [Jongyoul Lee] [SPARK-5333][Mesos] MesosTaskLaunchData occurs BufferUnderflowException - Rewind ByteBuffer before making ByteString
- Rewind ByteBuffer before making ByteString (This fixes a bug introduced in apache#3849 / SPARK-4014) Author: Jongyoul Lee <[email protected]> Closes apache#4119 from jongyoul/SPARK-5333 and squashes the following commits: c6693a8 [Jongyoul Lee] [SPARK-5333][Mesos] MesosTaskLaunchData occurs BufferUnderflowException - changed logDebug location 4141f58 [Jongyoul Lee] [SPARK-5333][Mesos] MesosTaskLaunchData occurs BufferUnderflowException - Added license information 2190606 [Jongyoul Lee] [SPARK-5333][Mesos] MesosTaskLaunchData occurs BufferUnderflowException - Adjusted imported libraries b7f5517 [Jongyoul Lee] [SPARK-5333][Mesos] MesosTaskLaunchData occurs BufferUnderflowException - Rewind ByteBuffer before making ByteString
TaskContext.attemptIdis misleadingly-named, since it currently returns a taskId, which uniquely identifies a particular task attempt within a particular SparkContext, instead of an attempt number, which conveys how many times a task has been attempted.This patch deprecates
TaskContext.attemptIdand addTaskContext.taskIdandTaskContext.attemptNumberfields. Prior to this change, it was impossible to determine whether a task was being re-attempted (or was a speculative copy), which made it difficult to write unit tests for tasks that fail on early attempts or speculative tasks that complete faster than original tasks.Earlier versions of the TaskContext docs suggest that
attemptIdbehaves likeattemptNumber, so there's an argument to be made in favor of changing this method's implementation. Since we've decided against making that change in maintenance branches, I think it's simpler to add better-named methods and retain the old behavior forattemptId; ifattemptIdbehaved differently in different branches, then this would cause confusing build-breaks when backporting regression tests that rely on the newattemptIdbehavior.Most of this patch is fairly straightforward, but there is a bit of trickiness related to Mesos tasks: since there's no field in MesosTaskInfo to encode the attemptId, I packed it into the
datafield alongside the task binary.