[SPARK-25449][CORE] Heartbeat shouldn't include accumulators for zero metrics #22473

mukulmurthy · 2018-09-19T18:51:23Z

What changes were proposed in this pull request?

Heartbeat shouldn't include accumulators for zero metrics.

Heartbeats sent from executors to the driver every 10 seconds contain metrics and are generally on the order of a few KBs. However, for large jobs with lots of tasks, heartbeats can be on the order of tens of MBs, causing tasks to die with heartbeat failures. We can mitigate this by not sending zero metrics to the driver.

How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review http://spark.apache.org/contributing.html before opening a pull request.

mukulmurthy · 2018-09-19T18:51:41Z

@zsxwing for review

zsxwing · 2018-09-19T19:50:46Z

add to whitelist

zsxwing · 2018-09-19T19:51:53Z

core/src/main/scala/org/apache/spark/executor/Executor.scala

        taskRunner.task.metrics.setJvmGCTime(curGCTime - taskRunner.startGCTime)
-        accumUpdates += ((taskRunner.taskId, taskRunner.task.metrics.accumulators()))
+        accumUpdates +=
+          ((taskRunner.taskId, taskRunner.task.metrics.accumulators().filterNot(_.isZero)))


Could you add a flag for this behavior change?

…nfig

SparkQA · 2018-09-19T22:22:45Z

Test build #96301 has finished for PR 22473 at commit fc98256.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-09-20T00:21:45Z

Test build #96284 has finished for PR 22473 at commit 3cf88a4.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-09-20T01:33:32Z

Test build #96279 has finished for PR 22473 at commit 3cf88a4.

This patch fails from timeout after a configured wait of 400m.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-09-20T02:45:58Z

Test build #96303 has finished for PR 22473 at commit 6bb91b0.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing

Left some comments. For the test, I suggest we refactor codes to make it easy to test. E.g., we can add the following method and test it directly to make the test simple:

  private def collectAccumulatorUpdates(
      taskRunners: Iterable[TaskRunner]): Array[(Long, Seq[AccumulatorV2[_, _]])] = {
    val curGCTime = computeTotalGcTime()
    val accumUpdates = new ArrayBuffer[(Long, Seq[AccumulatorV2[_, _]])]()
    for (taskRunner <- taskRunners) {
      if (taskRunner.task != null) {
        taskRunner.task.metrics.mergeShuffleReadMetrics()
        taskRunner.task.metrics.setJvmGCTime(curGCTime - taskRunner.startGCTime)
        val accumulatorsToReport =
          if (conf.getBoolean(EXECUTOR_HEARTBEAT_DROP_ZERO_METRICS.key, true)) {
            taskRunner.task.metrics.accumulators().filterNot(_.isZero)
          } else {
            taskRunner.task.metrics.accumulators()
          }
        accumUpdates += ((taskRunner.taskId, accumulatorsToReport))
      }
    }
    accumUpdates.toArray
  }

zsxwing · 2018-09-21T17:36:47Z

core/src/main/scala/org/apache/spark/executor/Executor.scala


  // Whether to load classes in user jars before those in Spark jars
-  private val userClassPathFirst = conf.getBoolean("spark.executor.userClassPathFirst", false)
+  private val userClassPathFirst = conf.getBoolean(EXECUTOR_USER_CLASS_PATH_FIRST.key, false)


nit: conf.get(EXECUTOR_USER_CLASS_PATH_FIRST)

zsxwing · 2018-09-21T17:37:25Z

core/src/main/scala/org/apache/spark/executor/Executor.scala

  // Executor for the heartbeat task.
  private val heartbeater = new Heartbeater(env.memoryManager, reportHeartBeat,
-    "executor-heartbeater", conf.getTimeAsMs("spark.executor.heartbeatInterval", "10s"))
+    "executor-heartbeater", conf.getTimeAsMs(EXECUTOR_HEARTBEAT_INTERVAL.key, "10s"))


nit: conf.get(EXECUTOR_HEARTBEAT_INTERVAL). Could you search the whole code base and update them as well?

zsxwing · 2018-09-21T17:37:39Z

core/src/main/scala/org/apache/spark/executor/Executor.scala

   * heartbeats about 10 minutes because the heartbeat interval is 10s.
   */
-  private val HEARTBEAT_MAX_FAILURES = conf.getInt("spark.executor.heartbeat.maxFailures", 60)
+  private val HEARTBEAT_MAX_FAILURES = conf.getInt(EXECUTOR_HEARTBEAT_MAX_FAILURES.key, 60)


zsxwing · 2018-09-21T17:42:07Z

core/src/main/scala/org/apache/spark/executor/Executor.scala

    try {
      val response = heartbeatReceiverRef.askSync[HeartbeatResponse](
-          message, RpcTimeout(conf, "spark.executor.heartbeatInterval", "10s"))
+          message, RpcTimeout(conf, EXECUTOR_HEARTBEAT_INTERVAL.key, "10s"))


Could you add a new apply method to object RpcTimeout to support ConfigEntry?

zsxwing · 2018-09-21T17:43:49Z

core/src/main/scala/org/apache/spark/executor/Executor.scala

        taskRunner.task.metrics.setJvmGCTime(curGCTime - taskRunner.startGCTime)
-        accumUpdates += ((taskRunner.taskId, taskRunner.task.metrics.accumulators()))
+        val accumulatorsToReport =
+          if (conf.getBoolean(EXECUTOR_HEARTBEAT_DROP_ZERO_METRICS.key, true)) {


nit: I would prefer to keep this config value close to HEARTBEAT_MAX_FAILURES to avoid searching it in configs every heartbeat.

zsxwing · 2018-09-21T17:46:17Z

core/src/main/scala/org/apache/spark/internal/config/package.scala

    ConfigBuilder(SparkLauncher.EXECUTOR_EXTRA_CLASSPATH).stringConf.createOptional

+  private[spark] val EXECUTOR_HEARTBEAT_DROP_ZERO_METRICS =
+    ConfigBuilder("spark.executor.heartbeat.dropZeroMetrics").booleanConf.createWithDefault(true)


maybe call it spark.executor.heartbeat.dropZeroAccumulatorUpdates? externalAccums may contain user accumulators and not all of them are metrics.

Also please call internal() to indicate that this is not a public config.

Question -- when would you not want this to be true? It's already changing behavior here, but what's the case where you need a safety valve to go back? it's just not broadcasting changes that can't matter because they're zero?

@srowen Since the user can see these accumulator updates in the public API SparkListenerExecutorMetricsUpdate, I would prefer to add a flag in case someone really needs these zero updates. E.g., a user may use the listener API to get all accumulators used in a task. After this change, they cannot get them until the task finishes.

OK, it's an internal-only safety-valve flag.

zsxwing · 2018-09-21T17:47:30Z

core/src/main/scala/org/apache/spark/internal/config/package.scala

+      .createWithDefaultString("10s")
+
+  private[spark] val EXECUTOR_HEARTBEAT_MAX_FAILURES =
+    ConfigBuilder("spark.executor.heartbeat.maxFailures").intConf.createWithDefault(60)


nit: call internal() to indicate that this is not a public config.

zsxwing · 2018-09-21T17:48:51Z

core/src/test/scala/org/apache/spark/executor/ExecutorSuite.scala

+  }
+
+  private def heartbeatZeroMetricTest(dropZeroMetrics: Boolean): Unit = {
+    val c = "spark.executor.heartbeat.dropZeroMetrics" -> dropZeroMetrics.toString


nit: EXECUTOR_HEARTBEAT_DROP_ZERO_METRICS.key

zsxwing · 2018-09-21T17:59:58Z

core/src/test/scala/org/apache/spark/executor/ExecutorSuite.scala

+    f(executor, heartbeats)
+  }
+
+  private def invokeReportHeartbeat(executor: Executor): Unit = {


You can mixin org.scalatest.PrivateMethodTester to replace this method, such as

val reportHeartBeat = PrivateMethod[Long]('reportHeartBeat) ... executor.invokePrivate(reportHeartBeat())

mukulmurthy · 2018-09-27T16:35:56Z

Pushed updates. @zsxwing , I can add refactor Executor.scala so testing is easier, but I had hoped that doing it this way would allow more testing for reportHeartbeat in the future. We have pretty good coverage for receiving heartbeats but not much for sending.

zsxwing

LGTM except some nits

zsxwing · 2018-09-27T18:05:09Z

core/src/main/scala/org/apache/spark/SparkConf.scala


-    val executorTimeoutThreshold = getTimeAsSeconds("spark.network.timeout", "120s")
-    val executorHeartbeatInterval = getTimeAsSeconds("spark.executor.heartbeatInterval", "10s")
+    val executorTimeoutThreshold = getTimeAsMs("spark.network.timeout", "120s")


Could you change getTimeAsMs back to getTimeAsSeconds? There is a slight difference when the user doesn't specify the time unit. getTimeAsMs uses ms as default, while getTimeAsSeconds uses seconds.

This part in the code only does some validation logic on the values and possibly throws an error. Because we changed it to use get(EXECUTOR_HEARTBEAT_INTERVAL), which returns the value in ms, I wanted the other one to return value in ms as well so we're comparing values in the same unit.

Could you use getTimeAsSeconds and manually convert it to ms?

zsxwing · 2018-09-27T18:05:36Z

core/src/main/scala/org/apache/spark/SparkConf.scala

    require(executorTimeoutThreshold > executorHeartbeatInterval, "The value of " +
-      s"spark.network.timeout=${executorTimeoutThreshold}s must be no less than the value of " +
-      s"spark.executor.heartbeatInterval=${executorHeartbeatInterval}s.")
+      s"spark.network.timeout=${executorTimeoutThreshold}ms must be no less than the value of " +


nit: "ms" -> "s" once you address the above comment

SparkQA · 2018-09-27T20:21:42Z

Test build #96704 has finished for PR 22473 at commit db14bd1.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

mukulmurthy · 2018-09-27T20:31:46Z

retest this please

SparkQA · 2018-09-27T20:51:10Z

Test build #96705 has finished for PR 22473 at commit e22ac78.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-09-28T00:14:35Z

Test build #96712 has finished for PR 22473 at commit e22ac78.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing · 2018-09-28T16:40:34Z

Looks like org.apache.spark.deploy.history.HistoryServerSuite.executor list with executor metrics json and org.apache.spark.util.JsonProtocolSuite.SparkListenerEvent are broken by the changes. Could you also fix them?

zsxwing · 2018-09-28T19:44:36Z

LGTM pending tests

SparkQA · 2018-09-28T23:24:57Z

Test build #96771 has finished for PR 22473 at commit f6fa337.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing · 2018-09-28T23:33:32Z

Thanks! Merging to master.

… metrics ## What changes were proposed in this pull request? Heartbeat shouldn't include accumulators for zero metrics. Heartbeats sent from executors to the driver every 10 seconds contain metrics and are generally on the order of a few KBs. However, for large jobs with lots of tasks, heartbeats can be on the order of tens of MBs, causing tasks to die with heartbeat failures. We can mitigate this by not sending zero metrics to the driver. ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. Closes apache#22473 from mukulmurthy/25449-heartbeat. Authored-by: Mukul Murthy <[email protected]> Signed-off-by: Shixiong Zhu <[email protected]>

## What changes were proposed in this pull request? Right now as we cast the heartbeat interval to seconds, any value less than 1 second will be casted to 0. This PR just backports the changes of the heartbeat interval in #22473 from master. ## How was this patch tested? Jenkins Closes #24329 from zsxwing/SPARK-27419. Authored-by: Shixiong Zhu <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

## What changes were proposed in this pull request? Right now as we cast the heartbeat interval to seconds, any value less than 1 second will be casted to 0. This PR just backports the changes of the heartbeat interval in apache#22473 from master. ## How was this patch tested? Jenkins Closes apache#24329 from zsxwing/SPARK-27419. Authored-by: Shixiong Zhu <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

## What changes were proposed in this pull request? Right now as we cast the heartbeat interval to seconds, any value less than 1 second will be casted to 0. This PR just backports the changes of the heartbeat interval in apache/spark#22473 from master. ## How was this patch tested? Jenkins Closes #24329 from zsxwing/SPARK-27419. Authored-by: Shixiong Zhu <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 53658ab)

## What changes were proposed in this pull request? Right now as we cast the heartbeat interval to seconds, any value less than 1 second will be casted to 0. This PR just backports the changes of the heartbeat interval in apache/spark#22473 from master. ## How was this patch tested? Jenkins Closes #24329 from zsxwing/SPARK-27419. Authored-by: Shixiong Zhu <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

mukulmurthy added 2 commits September 17, 2018 14:55

Don't send zero accumulators for metrics in heartbeat

3e0d953

add tests

3cf88a4

zsxwing reviewed Sep 19, 2018

View reviewed changes

Refactor heartbeat configs and gate dropping zero metrics behind a co…

fc98256

…nfig

update tests

6bb91b0

zsxwing requested changes Sep 21, 2018

View reviewed changes

mukulmurthy added 3 commits September 26, 2018 15:13

refactor code per Ryan's review

7d0c39b

minor refactor

db14bd1

use PrivateMethodTester

e22ac78

zsxwing reviewed Sep 27, 2018

View reviewed changes

mukulmurthy added 2 commits September 28, 2018 11:44

Use TestMemoryManager for test to not have to mock object field

9cbe14c

Use getTimeAsSeconds

f6fa337

asfgit closed this in 9362c5c Sep 28, 2018

mukulmurthy deleted the 25449-heartbeat branch October 1, 2018 17:56

ajithme mentioned this pull request Mar 19, 2019

[SPARK-27198][core] Heartbeat interval mismatch in driver and executor #24140

Closed

zsxwing mentioned this pull request Apr 9, 2019

[SPARK-27419][Core]Avoid casting heartbeat interval to seconds (2.4) #24329

Closed

[SPARK-25449][CORE] Heartbeat shouldn't include accumulators for zero metrics #22473

[SPARK-25449][CORE] Heartbeat shouldn't include accumulators for zero metrics #22473

Uh oh!

Conversation

mukulmurthy commented Sep 19, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

mukulmurthy commented Sep 19, 2018

Uh oh!

zsxwing commented Sep 19, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 19, 2018

Uh oh!

SparkQA commented Sep 20, 2018

Uh oh!

SparkQA commented Sep 20, 2018

Uh oh!

SparkQA commented Sep 20, 2018

Uh oh!

zsxwing left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mukulmurthy commented Sep 27, 2018

Uh oh!

zsxwing left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 27, 2018

Uh oh!

mukulmurthy commented Sep 27, 2018

Uh oh!

SparkQA commented Sep 27, 2018

Uh oh!

SparkQA commented Sep 28, 2018

Uh oh!

zsxwing commented Sep 28, 2018

Uh oh!

zsxwing commented Sep 28, 2018

Uh oh!

SparkQA commented Sep 28, 2018

Uh oh!

zsxwing commented Sep 28, 2018

Uh oh!