[Spark-20087][CORE] Attach accumulators / metrics to 'TaskKilled' end reason #21165

advancedxy · 2018-04-26T09:28:32Z

What changes were proposed in this pull request?

The ultimate goal is for listeners to onTaskEnd to receive metrics when a task is killed intentionally, since the data is currently just thrown away. This is already done for ExceptionFailure, so this just copies the same approach.

How was this patch tested?

Updated existing tests.

This is a rework of #17422, all credits should go to @noodle-fb

common logic in TaskRunner to reduce duplicate code

advancedxy · 2018-04-26T09:29:16Z

cc @squito @jiangxb1987

squito · 2018-04-26T14:02:09Z

Ok to test

squito

one small change, otherwise lgtm assuming tests pass

squito · 2018-04-26T14:12:31Z

core/src/main/scala/org/apache/spark/executor/Executor.scala

+     *    3. Set the finished flag to true and clear current thread's interrupt status
+     */
+    private def collectAccumulatorsAndResetStatusOnFailure(taskStart: Long) = {
+      reportGCAndExecutorTimeIfPossible(taskStart)


I don't think the extra reportGCAndExecutorTimeIfPossible is necessary, you can just inline it. and also the original if (task != null) is probably easier to follow than Option(task).map

advancedxy · 2018-04-26T15:21:03Z

core/src/main/scala/org/apache/spark/executor/Executor.scala

+     *    2. Collect accumulator updates
+     *    3. Set the finished flag to true and clear current thread's interrupt status
+     */
+    private def collectAccumulatorsAndResetStatusOnFailure(taskStart: Long) = {


@squito after address your comment, do you think we should come up with a more specific method name?

jiangxb1987 · 2018-04-26T15:40:59Z

core/src/main/scala/org/apache/spark/TaskEndReason.scala

+    private[spark] val accums: Seq[AccumulatorV2[_, _]] = Nil)
+  extends TaskFailedReason {
+
+  override def toErrorString: String = "TaskKilled ($reason)"


s"TaskKilled ($reason)"

Will do.

Didn't notice s is missing

jiangxb1987 · 2018-04-26T15:46:22Z

core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala

      case exceptionFailure: ExceptionFailure =>
        // Nothing left to do, already handled above for accumulator updates.

+      case _: TaskKilled =>


nit: combine this with the ExceptionFailure case.

jiangxb1987 · 2018-04-26T15:50:36Z

I'm not against the change, but since this changes the semantics of accumulators, we should document the changes in a migration document or something, WDYT @cloud-fan @gatorsmile ?

jiangxb1987 · 2018-04-26T15:51:38Z

Also, please update the PR title:
[Spark-20087][CORE] Attach accumulators / metrics to 'TaskKilled' end reason

advancedxy · 2018-04-26T15:58:37Z

we should document the changes in a migration document or something,

I think documentation is necessary, will update the documentation tomorrow (Beijing time)

jiangxb1987 · 2018-04-26T16:00:13Z

It should be [Spark-20087] instead of [Spark 20087] in the title.

advancedxy · 2018-04-26T16:03:13Z

It should be [Spark-20087] instead of [Spark 20087] in the title.

Thanks. Updated.

advancedxy · 2018-04-27T03:20:55Z

I add a note for accumulator update. Please comment if more document is needed.

cloud-fan · 2018-04-27T03:21:32Z

I do agree task killed event should carry metrics update, as it's reasonable to count killed tasks for something like how many bytes were read from files.

However, I don't agree user side accumulators should get updates from killed tasks, that changes the semantic of accumulators. And I don't think end-users need to care about killed tasks. Similarly, when we implement task metrics, we need to count failed tasks, but user side accumulator still skips failed tasks. I think we should also follow that approach.

I haven't read the PR yet, but please make sure this patch only touches internal accumulators that are used for metrics reporting.

advancedxy · 2018-04-27T04:04:02Z

However, I don't agree user side accumulators should get updates from killed tasks, that changes the semantic of accumulators. And I don't think end-users need to care about killed tasks. Similarly, when we implement task metrics, we need to count failed tasks, but user side accumulator still skips failed tasks. I think we should also follow that approach.

I don't agree that end-user didn't care killed tasks. For example user may want to record CPU time for every task and get the total CPU time for the application. However the default behaviour should keep backward-compatibility with existing behaviour.

private[spark] case class AccumulatorMetadata(
    id: Long,
    name: Option[String],
    countFailedValues: Boolean) extends Serializable

The metadata has countFailedValues field, we can use this or add a new field?

However we didn't expose this field to end user...

cloud-fan · 2018-04-27T05:03:47Z

For example user may want to record CPU time for every task and get the total CPU time for the application.

The problem is, shall we allow end users to collect metrics via accumulators? Currently only Spark can do that via internal accumulators which count failed tasks. We need a careful API design about how to expose this ability in the end users.

In the meanwhile, since we already count failed tasks, it makes sense to also count killed tasks for internal metrics collecting.

We should not do these 2 things together, and to me the second one is way simpler to get in and we should do it first.

advancedxy · 2018-04-27T05:38:22Z

We should not do these 2 things together, and to me the second one is way simpler to get in and we should do it first.

Agreed. For the scope of this pr, let's get killed tasks's accumulators into metrics first. After that we can discuss the possibility to expose the ability under users' request.

but please make sure this patch only touches internal accumulators that are used for metrics reporting.

After a second look, this part is already be handled by Task's collectAccumulatorUpdates:

  def collectAccumulatorUpdates(taskFailed: Boolean = false): Seq[AccumulatorV2[_, _]] = {
    if (context != null) {
      // Note: internal accumulators representing task metrics always count failed values
      context.taskMetrics.nonZeroInternalAccums() ++
        // zero value external accumulators may still be useful, e.g. SQLMetrics, we should not
        // filter them out.
        context.taskMetrics.externalAccums.filter(a => !taskFailed || a.countFailedValues)
    } else {
      Seq.empty
    }
  }

advancedxy · 2018-04-28T02:56:30Z

@jiangxb1987 @cloud-fan I think it's ready for review.

cloud-fan · 2018-04-28T03:39:09Z

core/src/main/scala/org/apache/spark/TaskEndReason.scala

+case class TaskKilled(
+    reason: String,
+    accumUpdates: Seq[AccumulableInfo] = Seq.empty,
+    private[spark] val accums: Seq[AccumulatorV2[_, _]] = Nil)


Previously we use AccumulableInfo to expose accumulator information to end users. Now AccumulatorV2 is already a public classs and we don't need to do it anymore, I think we can just do

case class TaskKilled(reason: String, accums: Seq[AccumulatorV2[_, _]])

Yeah, I noticed accumUpdates: Seq[AccumulableInfo] is only used in JsonProtocol. Is that for a reason?

The current impl is constructed to be sync with existing TaskEndReason such as ExceptionFailure

@DeveloperApi case class ExceptionFailure( className: String, description: String, stackTrace: Array[StackTraceElement], fullStackTrace: String, private val exceptionWrapper: Option[ThrowableSerializationWrapper], accumUpdates: Seq[AccumulableInfo] = Seq.empty, private[spark] var accums: Seq[AccumulatorV2[_, _]] = Nil)

I'd prefer to keep in sync, leave two options for cleanup:

leave it as it is, then cleanup with ExceptionFailure together

Cleanup ExceptionFailure first.

@cloud-fan what do you think?

let's clean up ExceptionFailure at the same time, and use only AccumulatorV2 in this PR.

@cloud-fan After a second look, I don't think we can clean up ExceptionFailure unless we can break JsonProtocol

now the question is: shall we keep the unnecessary Seq[AccumulableInfo] in new APIs, to make the API consistent? I'd like to not keep the Seq[AccumulableInfo], we may deprecate it in the existing APIs in the near future.

I'm ok with not keeping Seq[AccumulableInfo]. But it means inconsistent logic and api and may make future refactoring a bit difficult.

Let's see what I can do.

I'd like to not keep the Seq[AccumulableInfo], we may deprecate it in the existing APIs in the near future.

BTW, I think we have already deprecated AccumulableInfo. Unless we are planing to remove it in Spark 3.0 and Spark 3.0 is the next release, AccumulableInfo will be there for a long time

Hi @cloud-fan, I have looked at how to remove Seq[AccumulableInfo] tonight.
It turns out that we cannot because JsonProtocol calls taskEndReasonFromJson to reconstruct TaskEndReasons. Since AccumulatorV2 is an abstract class, we cannot simply construct AccumulatorV2s from json.

Even we are promoting AccumulatorV2, we still need AccumulableInfo when (de)serializing json.

I see, that makes sense, let's keep AccumulableInfo.

@cloud-fan so, could you trigger the test and have a look?

And looks like I am not in the whitelist again...

advancedxy · 2018-05-09T02:45:44Z

ping @cloud-fan

cloud-fan · 2018-05-14T10:32:53Z

ok to test

cloud-fan · 2018-05-14T10:34:20Z

docs/rdd-programming-guide.md


 </div>

+In new version of Spark(> 2.3), the semantic of Accumulator has been changed a bit: it now includes updates from 


it's not needed now

SparkQA · 2018-05-14T12:40:03Z

Test build #90585 has finished for PR 21165 at commit 05d1d9c.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-05-14T14:47:51Z

core/src/main/scala/org/apache/spark/executor/Executor.scala

+    private def collectAccumulatorsAndResetStatusOnFailure(taskStart: Long) = {
+      // Report executor runtime and JVM gc time
+      Option(task).foreach(t => {
+        t.metrics.setExecutorRunTime(System.currentTimeMillis() - taskStart)


taskStartTime

e, startStart is already defined previously. Do you think we need to replace all the taskStart to taskStartTime

we should at least rename it in this method as it's newly added code. We can also update the existing code if it's not a lot of work.

Will do it..

cloud-fan · 2018-05-14T14:54:15Z

LGTM

SparkQA · 2018-05-14T15:02:36Z

Test build #90595 has finished for PR 21165 at commit 945c1d5.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

advancedxy · 2018-05-14T15:51:54Z

Looks like that simply add fields with default values into case class will break binary compatibility.
How should we deal with that? Add to MimaExcludes or add missing methods? @cloud-fan

cloud-fan · 2018-05-15T02:18:38Z

I think we can just update MimaExcludes, since it's developer API. cc @JoshRosen

Rename taskStart -> taskStartTime in executor

SparkQA · 2018-05-15T17:25:38Z

Test build #90643 has finished for PR 21165 at commit 59c2807.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

advancedxy · 2018-05-16T15:57:57Z

Gently ping @cloud-fan again.

cloud-fan · 2018-05-17T11:32:58Z

core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala

    val exceptionFailure = new ExceptionFailure(
      new SparkException("fondue?"),
-      accumInfo).copy(accums = accumUpdates)
+      accumInfo1).copy(accums = accumUpdates1)


not caused by you but why we do a copy instead of passing accumUpdates1 to the constructor directly?

We can avoid the copy call.

Ah, this copy call cannot be avoided as only the 2 arguments constructor
private[spark] def this(e: Throwable, accumUpdates: Seq[AccumulableInfo]) is defined.

cloud-fan · 2018-05-17T11:33:08Z

core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala

+
+    val taskKilled = new TaskKilled(
+      "test",
+      accumInfo2).copy(accums = accumUpdates2)


We can avoid this copy call

cloud-fan · 2018-05-17T11:34:23Z

LGTM

SparkQA · 2018-05-21T12:36:08Z

Test build #90891 has finished for PR 21165 at commit 74911b7.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

advancedxy · 2018-05-21T13:28:59Z

retest this please

cloud-fan · 2018-05-21T14:01:00Z

retest this please

SparkQA · 2018-05-21T18:21:09Z

Test build #90901 has finished for PR 21165 at commit 74911b7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-05-22T13:02:40Z

thanks, merging to master!

noodle-fb and others added 5 commits April 26, 2018 11:43

report metrics for killed tasks

6262c52

add task killed to exception accum test

1830889

extra fixes for task killed reason merge

cb276bc

Fix merge conflict and semantic difference

30ae145

Make accums in TaskKilled immutable and extract

88b1ceb

common logic in TaskRunner to reduce duplicate code

squito approved these changes Apr 26, 2018

View reviewed changes

Inline reportGcAndExecutorTimeIfPossible

ea16b1c

advancedxy commented Apr 26, 2018

View reviewed changes

jiangxb1987 reviewed Apr 26, 2018

View reviewed changes

advancedxy changed the title ~~Spark 20087: Attach accumulators / metrics to 'TaskKilled' end reason~~ [Spark 20087][CORE] Attach accumulators / metrics to 'TaskKilled' end reason Apr 26, 2018

Minor fixes to address comments

5de0bcc

advancedxy changed the title ~~[Spark 20087][CORE] Attach accumulators / metrics to 'TaskKilled' end reason~~ [Spark-20087][CORE] Attach accumulators / metrics to 'TaskKilled' end reason Apr 26, 2018

Add document for semantic change of accumulator.

05d1d9c

advancedxy force-pushed the SPARK-20087 branch from 14ac737 to 05d1d9c Compare April 28, 2018 02:54

cloud-fan reviewed Apr 28, 2018

View reviewed changes

cloud-fan reviewed May 14, 2018

View reviewed changes

Revert document and fix compile error

945c1d5

cloud-fan reviewed May 14, 2018

View reviewed changes

Update MimaExcludes to ignore TaskKilled incompatible change.

59c2807

Rename taskStart -> taskStartTime in executor

cloud-fan reviewed May 17, 2018

View reviewed changes

Avoid copy call when constructing TaskKilled.

74911b7

asfgit closed this in 82fb5bf May 22, 2018

advancedxy deleted the SPARK-20087 branch May 23, 2018 03:12


		</div>

		In new version of Spark(> 2.3), the semantic of Accumulator has been changed a bit: it now includes updates from

[Spark-20087][CORE] Attach accumulators / metrics to 'TaskKilled' end reason #21165

[Spark-20087][CORE] Attach accumulators / metrics to 'TaskKilled' end reason #21165

Uh oh!

Conversation

advancedxy commented Apr 26, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

advancedxy commented Apr 26, 2018

Uh oh!

squito commented Apr 26, 2018

Uh oh!

squito left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jiangxb1987 commented Apr 26, 2018

Uh oh!

jiangxb1987 commented Apr 26, 2018

Uh oh!

advancedxy commented Apr 26, 2018

Uh oh!

jiangxb1987 commented Apr 26, 2018

Uh oh!

advancedxy commented Apr 26, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

advancedxy commented Apr 27, 2018

Uh oh!

cloud-fan commented Apr 27, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

advancedxy commented Apr 27, 2018

Uh oh!

cloud-fan commented Apr 27, 2018

Uh oh!

advancedxy commented Apr 27, 2018

Uh oh!

advancedxy commented Apr 28, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

advancedxy commented May 9, 2018

Uh oh!

cloud-fan commented May 14, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 14, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

advancedxy commented Apr 26, 2018 •

edited

Loading

cloud-fan commented Apr 27, 2018 •

edited

Loading