-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-27198][core] Heartbeat interval mismatch in driver and executor #24140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Why is this against branch-2.4? Looks it should be against mater. |
Due to #22473 in master it has been resolved, but lower versions have this problem |
MaxGekk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes can break existing users apps in the case if the suffix is not set. Let's say there is an app for Spark 2.4 which set 10000 (10 sec) as spark.executor.heartbeatInterval. After your changes, it become 1000 * 10 sec. Isn't it?
Could you write a test when there is no suffix, and spark.executor.heartbeatInterval is set to 10000?
|
I have the same feeling. Do you have some code snippet to expose this bug? Then it's much easier to convince other people that this is a bug fix. |
Ok, will update with a test snippet shortly |
|
Here is a test snippet The test fails with Here we can see that
|
|
also at executor, rpc time out for heartbeat is considered in seconds https://github.com/apache/spark/blob/v2.4.1-rc8/core/src/main/scala/org/apache/spark/executor/Executor.scala#L835 |
| */ | ||
| private def startDriverHeartbeater(): Unit = { | ||
| val intervalMs = conf.getTimeAsMs("spark.executor.heartbeatInterval", "10s") | ||
| val intervalMs = conf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In branch 2.4 this is accessed both with getTimeAsSeconds and getTimeAsMs, yes. The only question is which interpretation was intended? it's not a best practice to not specify the units, but hey it can happen.
In master, this is interpreted as milliseconds by default. I think we should instead switch how it's read in SparkConf instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, seems more reasonable as it also avoids the scenario @MaxGekk mentioned. Will modify the PR accordingly
|
@srowen @cloud-fan @MaxGekk @HyukjinKwon i have updated the PR as per suggestion. Please review |
|
It looks nice. Let's trigger a build and run tests. |
| * Convert a passed time string (e.g. 50s, 100ms, or 250us) to a time count in the given unit. | ||
| * defaultUnit is used for string which have just number and no units mentioned | ||
| */ | ||
| public static long timeStringAs(String str, TimeUnit unit, TimeUnit defaultUnit) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, why do we want all this? I think we want to standardize the behavior w.r.t. how a unitless number is interpreted, even if it means changing behavior. The behavior is currently inconsistent and I think it should be fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without looking at the PR, +1 to Sean's comment. The "no unit" thing was always a backwards compatibility hack. I think it's time we drop support for that since it's confusing.
(I'd also vote for making all time configs return milliseconds but that's a much noisier change.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so i see two options
- Handle all other time related configurations also to fall back as milliseconds by default when units are not mentioned
- Or disallow any time related configuration to be configured without specifying unit
Please suggest which one i can proceed with.?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not disallow unit-less time, not here, though I think that's reasonable to consider separately for Spark 3. Here I think we should just focus on fixing the inconsistency in handling this one parameter, and make it consistent with how it's handled in master. I don't think we need more than a few lines of change for that.
common/network-common/src/main/java/org/apache/spark/network/util/JavaUtils.java
Outdated
Show resolved
Hide resolved
|
ok to test |
|
Test build #103778 has finished for PR 24140 at commit
|
|
@srowen @attilapiros @MaxGekk @cloud-fan @vanzin Thank you for all your inputs. I have updated the PR to fit as per latest suggestion. Please have a look. |
|
Test build #103806 has finished for PR 24140 at commit
|
core/src/main/scala/org/apache/spark/internal/config/package.scala
Outdated
Show resolved
Hide resolved
|
@srowen i have updated as per comments. Please have a look |
|
Test build #103869 has finished for PR 24140 at commit
|
|
Test build #103871 has finished for PR 24140 at commit
|
|
Test build #103870 has finished for PR 24140 at commit
|
srowen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks OK pending tests. I guess I don't mind about using TimeUnit
|
Test build #4664 has finished for PR 24140 at commit
|
|
Merged to 2.4 |
## What changes were proposed in this pull request? When heartbeat interval is configured via spark.executor.heartbeatInterval without specifying units, we have time mismatched between driver(considers in seconds) and executor(considers as milliseconds) ## How was this patch tested? Will add UTs Closes #24140 from ajithme/intervalissue. Authored-by: Ajith <[email protected]> Signed-off-by: Sean Owen <[email protected]>
| try { | ||
| val response = heartbeatReceiverRef.askSync[HeartbeatResponse]( | ||
| message, RpcTimeout(conf, "spark.executor.heartbeatInterval", "10s")) | ||
| message, new RpcTimeout(heartbeatIntervalInSec, "spark.executor.heartbeatInterval")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The unit in the master branch is different from the unit in 2.4 after this fix. right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The underlying problem was that it was parsed differently on the driver, vs executor. That was fixed in a different way already in master. The behavior of "2m" doesn't change at all. But a unitless value like "1000" was interpreted as "1000 seconds" here vs "1000 milliseconds" on the driver. It's a bug fix, and I'm not sure it would have ever worked with a unitless string like "1000", as the driver would be expecting heartbeats 1000x more frequently than the executor sent them.
Hence I don't know if there was a working behavior that changed here. I don't mind adding a release note just to be sure; my only hesitation is loading up the release notes with items that may not actually affect users. If you feel it should, I suggest you add this to "Docs text" in the JIRA:
The value of spark.executor.heartbeatInterval, when specified without time units like "1000", was interpreted differently on the driver and executor. Without units, values are now consistently interpreted as milliseconds. It's best to specify this value with units, like "10s", as this was and is correctly interpreted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now the default time unit in master is ms, which is the same after this fix.
Before this fix, when a time unit is not provided, for example, using 1000, the behavior is sending the heartbeat every 1000ms and the timeout of sending the heartbeat message is 1000s (which I think is a bug introduced in #10365).
I'm +1 for this fix since it has the same behavior as the master branch.
However, I suggest to apply the same changes related to spark.executor.heartbeatInterval from 9362c5c#diff-6bdad48cfc34314e89599655442ff210 rather than this patch to make all places consistent. @ajithme could you submit a follow PR to make the change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not against that so much, but, master just has a different implementation of all the configs. I don't know if it helps much to back-port part of it to achieve the same behavior. It won't be exactly the same change no matter what.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, the current fix in 2.4 has a bug. See https://github.com/apache/spark/pull/24140/files#r271067277
|
|
||
| val message = Heartbeat(executorId, accumUpdates.toArray, env.blockManager.blockManagerId) | ||
| val heartbeatIntervalInSec = | ||
| conf.getTimeAsMs("spark.executor.heartbeatInterval", "10s").millis.toSeconds.seconds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After discussing with @gatorsmile , we found there is a bug here. If spark.executor.heartbeatInterval is less than one second, it will always be 0 and timeout. (https://github.com/scala/scala/blob/v2.11.12/src/library/scala/concurrent/impl/Promise.scala#L209)
This may break some user's tests that set a small timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, 2.4 release voting passed. @dbtsai Could we document it in the release note?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point, but that isn't new behavior. This was always parsed as 'seconds' here before, so anything less than a second would have resulted in 0. It's a separate bug but does sound like a problem
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry. I was not clear. I meant, for example, if spark.executor.heartbeatInterval is 900 without a time unit, it will be converted to 0 now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I agree that this is a closely-related bug and fix; the master change fixed both but this change just fixes the unit inconsistency, not also the truncation of this value to seconds.
Release notes probably can't hurt but I am not clear a setting of < "1000" would have ever even worked in practice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll put it in the release note. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Release note added, http://spark.apache.org/releases/spark-release-2-4-1.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opened #24329 to fix the issue
## What changes were proposed in this pull request? When heartbeat interval is configured via spark.executor.heartbeatInterval without specifying units, we have time mismatched between driver(considers in seconds) and executor(considers as milliseconds) ## How was this patch tested? Will add UTs Closes apache#24140 from ajithme/intervalissue. Authored-by: Ajith <[email protected]> Signed-off-by: Sean Owen <[email protected]>
## What changes were proposed in this pull request? When heartbeat interval is configured via spark.executor.heartbeatInterval without specifying units, we have time mismatched between driver(considers in seconds) and executor(considers as milliseconds) ## How was this patch tested? Will add UTs Closes apache#24140 from ajithme/intervalissue. Authored-by: Ajith <[email protected]> Signed-off-by: Sean Owen <[email protected]>
## What changes were proposed in this pull request? When heartbeat interval is configured via spark.executor.heartbeatInterval without specifying units, we have time mismatched between driver(considers in seconds) and executor(considers as milliseconds) ## How was this patch tested? Will add UTs Closes apache#24140 from ajithme/intervalissue. Authored-by: Ajith <[email protected]> Signed-off-by: Sean Owen <[email protected]>
What changes were proposed in this pull request?
When heartbeat interval is configured via spark.executor.heartbeatInterval without specifying units, we have time mismatched between driver(considers in seconds) and executor(considers as milliseconds)
How was this patch tested?
Will add UTs