-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-5259][CORE]Make sure shuffle metadata already in mapOutPutTracker while submitting tasks of the shuffleMapStage #4055
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #25592 has finished for PR 4055 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like an excessively complex way of writing 31 * stageId.hashCode + partitionId.hashCode. I don't think FP is the way to do this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a better way is (stageId + partitionId) * (stageId + partitionId + 1) / 2 + partitionId.
See http://en.wikipedia.org/wiki/Pairing_function#Cantor_pairing_function
|
According to your case, I think we can do one more improvement in |
|
Test build #26086 has finished for PR 4055 at commit
|
|
@JoshRosen i think that's ok. because change of code is very small and there is no influence for current logic. |
|
@srowen, the original @cloud-fan , I think |
|
@cloud-fan re-submit occurs have failed-stage and due to a fetch failed. a fetch-failed means current running TaskSet is dead(called |
|
@cloud-fan btw, Do you know HarryZhang? ZJU VLIS Lab |
|
@suyanNone Thanks for the explanation of re-submit! |
|
@cloud-fan ZhangLei, SunHongLiang, HanLi, ChenXingYu, blabla...I am ZhangLei's classmate in ZJU. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can try Seq[Int](1).isInstanceOf[Seq[String]] in REPL, it will return true.
isInstanceOf can't work on generic type because of JVM type erasure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan yean, I know that. and in that class, it has no need to add parameter on class level, it could only use in function level run or runContext.
and, this code still have something to refine, like var partitionId to be val, I will refine it at later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean... something like other.isInstanceOf[ResultTask[_, _]] =.=
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's very slightly better. I agree
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So equals is not overridden in these subclasses because equality does not depend on their additional fields? just checking that this is definitely desirable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
eh...StageId and partitionId is like a unique composite primary key in database. In current spark context, it's sure can be identified by (StageId, PartitionId), even not need to use "canEqual".
|
Test build #26307 has finished for PR 4055 at commit
|
|
@cloud-fan --! |
|
retest this please |
|
Test build #26373 has finished for PR 4055 at commit
|
|
Test build #26374 has finished for PR 4055 at commit
|
|
retest this please |
|
Test build #26386 has finished for PR 4055 at commit
|
|
@srowen @JoshRosen can some one verify this patch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not string interpolation here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@srowen task.partitionID is Int type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I mean why how use the same s"..." syntax as in the line above? Int is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@srowen Others do that...I can't figure out the advantages and disadvantages.
There a lot sentence like:
logInfo("Finished task %s in stage %s (TID %d) in %d ms on %s (%d/%d)".format(
logError("Task %s in stage %s (TID %d) had a not serializable result: %s; not retrying"
.format(i
abort("Task %s in stage %s (TID %d) had a not serializable result: %s".format(
and also has this:
logInfo(
s"Lost task ${info.id} in stage ${taskSet.id} (TID
s"${ef.className} (${ef.description}) [duplicate $dupCount]")
Need I to refactor it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The s"..." didn't exist before Scala 2.10, so I think that's why the old style is still used in the code. There's no great need to change all that. I think the interpolated style is clearer, and I tend to think that we should match surrounding code style in issues like this. Since interpolation is used in the line above, it seems right to use it here. I agree it's a tiny issue either way.
|
Test build #28038 has finished for PR 4055 at commit
|
29684d8 to
9025cf1
Compare
|
Test build #28061 has finished for PR 4055 at commit
|
|
@cloud-fan @rxin do you have any final thoughts on this? it's looking reasonable to me though I admit I don't know this scheduler code well enough to be confident. |
|
cc @markhamstra and @kayousterhout also |
|
I'll take a look over the weekend. |
|
I'm still a little against the |
…hile stageId and partId are same
|
Test build #37907 has finished for PR 4055 at commit
|
|
Test build #37949 has finished for PR 4055 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getServerStatuses has been removed in master -- I guess both of these should be
val statuses = mapOutputTracker.getMapSizesByExecutorId(0, reduceIdx)
assert(statuses != null)
assert(statuses.nonEmpty)The new code will now throw an exception if we're missing the map output data, but I feel like its probably still good to leave those asserts in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
may the below code will be more better?
try {
mapOutputTracker.getMapSizesByExecutorId(0, reduceIdx)
} catch {
case e: Exception => fail("")
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't use try / case e: Exception => fail("") to fail tests when there is an exception -- we just let the exception fail the test directly. You get more info in the stack trace that way. So I think its better to just leave it bare.
You could just put in a comment explaining what the point is:
// this would throw an exception if the map status hadn't been registered
mapOutputTracker.getMapSizesByExecutorId(0, reduceIdx)I still slightly prefer leaving the asserts in there. Yes, they are kinda pointless with the current behavior of getMapSizesByExecutorId -- but I'd just like to be a bit more defensive, in case that behavior changes in the future. (eg., maybe some future refactoring makes them stop throwing exceptions for some reason).
Maybe to be very clear, you could include the asserts and more comments:
// this would throw an exception if the map status hadn't been registered
val statuses = mapOutputTracker.getMapSizesByExecutorId(0, reduceIdx)
// really we should have already thrown an exception rather than fail either of these
// asserts, but just to be extra defensive let's double check the statuses are OK
assert(statuses != null)
assert(statuses.nonEmpty)This is pretty minor, though, I don't feel strongly about it.
|
thanks for updating @suyanNone ! there are compile errors b/c of changes in master, and I left some really minor comments, but I think its basically ready. btw, feel free to open separate jiras / prs for the other issues you found (and cc me if you like). I do think they are worth discussing, but this the most important fix. |
|
@squito @suyanNone is this superseded by #7699? If so, would you mind closing this patch? |
|
@suyanNone can you add your git commit email to your github profile, so this commit will show up as yours? |
[SPARK-5259]Add task equal() and hashcode() to avoid stage.pendingTasks not accurate while stage was retry
desc:
while run a spark job, it occurs one stage keep retrying and keep throwing FetchMetadataException
reason:
Map Stage 1-> Map Stage2
MapStage1 retry, so have 2 taskSet, taskSet0.0 and TaskSet0.1
TaskSet0.0 and TaskSet0.1 are all running.
When to submit Map Stage2?
how numAvailableOutputs change?
When to register Map Stage1 out put?
how stage.pendingTasks change?
because, because Task not override hashcode and equal, so run same partition task in different TaskSet is different task. and pendingTask is clear when retry map stage, so the pendingTask is always for the new retry TaskSet. then the previous taskset complete some task which have same partition in the latest taskSet.
stage.pengdingTask -= tasknot affect anything. but it affectstage.numAvailableOutputs, because it just identified by partition Id.So it may result in some stage have submit while his dependency map stage have not registered its output in MapOutputTracker.