-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-27198][core] Heartbeat interval mismatch in driver and executor #24140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
12b53ee
5755b31
bc39695
fb9ea5c
76570b7
596c9b2
c62330a
9941e93
39d9b59
6c17bf8
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -28,6 +28,7 @@ import javax.annotation.concurrent.GuardedBy | |
|
|
||
| import scala.collection.JavaConverters._ | ||
| import scala.collection.mutable.{ArrayBuffer, HashMap, Map} | ||
| import scala.concurrent.duration._ | ||
| import scala.util.control.NonFatal | ||
|
|
||
| import com.google.common.util.concurrent.ThreadFactoryBuilder | ||
|
|
@@ -831,9 +832,11 @@ private[spark] class Executor( | |
| } | ||
|
|
||
| val message = Heartbeat(executorId, accumUpdates.toArray, env.blockManager.blockManagerId) | ||
| val heartbeatIntervalInSec = | ||
| conf.getTimeAsMs("spark.executor.heartbeatInterval", "10s").millis.toSeconds.seconds | ||
| try { | ||
| val response = heartbeatReceiverRef.askSync[HeartbeatResponse]( | ||
| message, RpcTimeout(conf, "spark.executor.heartbeatInterval", "10s")) | ||
| message, new RpcTimeout(heartbeatIntervalInSec, "spark.executor.heartbeatInterval")) | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The unit in the master branch is different from the unit in 2.4 after this fix. right?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The underlying problem was that it was parsed differently on the driver, vs executor. That was fixed in a different way already in Hence I don't know if there was a working behavior that changed here. I don't mind adding a release note just to be sure; my only hesitation is loading up the release notes with items that may not actually affect users. If you feel it should, I suggest you add this to "Docs text" in the JIRA: The value of
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Right now the default time unit in master is Before this fix, when a time unit is not provided, for example, using 1000, the behavior is sending the heartbeat every 1000ms and the timeout of sending the heartbeat message is 1000s (which I think is a bug introduced in #10365). I'm +1 for this fix since it has the same behavior as the master branch. However, I suggest to apply the same changes related to
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not against that so much, but, master just has a different implementation of all the configs. I don't know if it helps much to back-port part of it to achieve the same behavior. It won't be exactly the same change no matter what.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Actually, the current fix in 2.4 has a bug. See https://github.com/apache/spark/pull/24140/files#r271067277 |
||
| if (response.reregisterBlockManager) { | ||
| logInfo("Told to re-register on heartbeat") | ||
| env.blockManager.reregister() | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After discussing with @gatorsmile , we found there is a bug here. If
spark.executor.heartbeatIntervalis less than one second, it will always be 0 and timeout. (https://github.com/scala/scala/blob/v2.11.12/src/library/scala/concurrent/impl/Promise.scala#L209)This may break some user's tests that set a small timeout.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, 2.4 release voting passed. @dbtsai Could we document it in the release note?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ajithme @srowen We need to fix this ASAP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point, but that isn't new behavior. This was always parsed as 'seconds' here before, so anything less than a second would have resulted in 0. It's a separate bug but does sound like a problem
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry. I was not clear. I meant, for example, if
spark.executor.heartbeatIntervalis900without a time unit, it will be converted to 0 now.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I agree that this is a closely-related bug and fix; the
masterchange fixed both but this change just fixes the unit inconsistency, not also the truncation of this value to seconds.Release notes probably can't hurt but I am not clear a setting of < "1000" would have ever even worked in practice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll put it in the release note. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Release note added, http://spark.apache.org/releases/spark-release-2-4-1.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opened #24329 to fix the issue