-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-46383] Reduce Driver Heap Usage by Reducing the Lifespan of TaskInfo.accumulables()
#44321
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala
Outdated
Show resolved
Hide resolved
| def handleSuccessfulTask(tid: Long, result: DirectTaskResult[_]): Unit = { | ||
| val info = taskInfos(tid) | ||
| // SPARK-37300: when the task was already finished state, just ignore it, | ||
| // so that there won't cause successful and tasksSuccessful wrong result. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading this comment, the partition is already completed, probably by another TaskSetManager, and we just need to reset the task info here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this branch is handling a rare corner-case where the same TaskSetManager can mark the same task as both succeeded and failed. There is some detailed prior discussion of this in https://issues.apache.org/jira/browse/SPARK-37300
cloud-fan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except for some minor comments
JoshRosen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM pending minor comments and test re-triggering (it looks like the first CI run failed in checkout).
|
I have not looked into this in a lot of detail (and given my vacation plans, might not be able to unfortunately).
|
core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala
Outdated
Show resolved
Hide resolved
|
@mridulm |
JoshRosen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On re-review, I think I may spot a potential unintended behavior change, but it's a bit of a subtle corner-case and may actually be something that we're okay with changing:
By design, TaskSetManager is only supposed to be called while holding a task scheduler lock, so the code written here can assume serial operations. Given this, as an informal proof technique we can try to establish that a given task attempt's taskInfo will be cleared exactly once (freeing us from concerns around whether the cleared cloned task info can subsequently escape the scope of the TaskSetManager and be exposed to outside code):
handleSuccessfulTask: this method exits early if a task is already finished. Otherwise, it updates the stored copy of the info and forwards the original to the DAGScheduler.handleFailedTask: similarly, this method exits early on already finished tasks, and otherwise notifies the DAGScheduler.executorLost: there are a few branches in this method:- Running tasks are marked as failed, triggering the handleFailedTask branch.
- Some completed tasks whose map output was lost may be resubmitted.
⚠️ I think there might be a subtle unintended behavior change here: the logic atwill take a completed task's task info and use it in a secondspark/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala
Lines 1125 to 1137 in 664e06d
// We may have a running task whose partition has been marked as successful, // this partition has another task completed in another stage attempt. // We treat it as a running task and will call handleFailedTask later. if (successful(index) && !info.running && !killedByOtherAttempt.contains(tid) && !isShuffleMapOutputAvailable) { successful(index) = false copiesRunning(index) -= 1 tasksSuccessful -= 1 addPendingTask(index) // Tell the DAGScheduler that this task was resubmitted so that it doesn't think our // stage finishes when a total of tasks.size tasks finish. emptyTaskInfoAccumulablesAndNotifyDagScheduler(tid, tasks(index), Resubmitted, null, Seq.empty, Array.empty, info) Resubmittedevent. Over in the DAGScheduler, it looks like the processing ofResubmittedfailures is done after we've done the listener event posting. Thus, I think this PR might result in subtle changes to the listener behavior of resubmitted tasks: previously, the task info from the original successful attempt would be posted for the resubmission DAGScheduler event (and thus listener event), but now we will pass in an event with empty accumulables and that could cause problems if downstream listener code tries to access those accumulables.
We can't realize the significant memory savings if we also want to preserve the listener-visible implicit behavior in the succeeded-then-resubmitted path.
On the other hand, there are already some significant differences in the resubmitted event path: the call at
spark/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala
Lines 1134 to 1137 in 664e06d
| // Tell the DAGScheduler that this task was resubmitted so that it doesn't think our | |
| // stage finishes when a total of tasks.size tasks finish. | |
| emptyTaskInfoAccumulablesAndNotifyDagScheduler(tid, | |
| tasks(index), Resubmitted, null, Seq.empty, Array.empty, info) |
accumUpdates and metricsPeaks. Given this, it might be possible that it's okay to make an implicit breaking change here, but we should discuss.
Even if we choose to go that route and accept the behavior change, it might mean that we cannot straightforwardly use the same throwOnAccumulablesCall logic as it is currently written, since we don't actually have an invariant that cleared task infos cannot flow to other components. If we lift that invariant, though, then we need to be extra careful to not introduce bugs of unexpected downstream flowing of a cleared task info.
|
@cloud-fan thanks for checking ! At a minimum, this should be an opt-in and not default on. |
core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala
Outdated
Show resolved
Hide resolved
|
Proposal To Gain Consensus
What do the reviewers think of the proposal? Note that the current design in the PR does not implement this proposal. Currently, accessing the empty accumulables would result in a crash. I will refactor the change if agree upon this proposal. |
|
Sounds good to me, thoughts @JoshRosen, @cloud-fan ? |
|
SGTM |
|
The proposed "make the behavior change optional and off-by-default with option for users to opt-in" approach sounds reasonable to me: users or platforms that don't rely on the hopefully-rare corner-case listener behavior can choose to opt-in in order to address a major contributor to driver memory problems with large task sets 👍 . |
|
Disabled the changes by default @JoshRosen @mridulm. Can you all PTAL? |
mridulm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took a quick pass
core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala
Outdated
Show resolved
Hide resolved
| val rdd1 = sc.parallelize(1 to 100, 4) | ||
| sc.runJob(rdd1, (items: Iterator[Int]) => items.size, Seq(0, 1)) | ||
| sc.listenerBus.waitUntilEmpty() | ||
| listener.taskInfos.size should be { 0 } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure I follow this test, what is it trying to do ?
This test will be successful even with DROP_TASK_INFO_ACCUMULABLES_ON_TASK_COMPLETION = true, right ? (Since it is simply checking for instance equality in the fired event ?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test asserts that the same TaskInfo object is sent in the onTaskStart and onTaskEnd events. This test asserts the design in this PR that we are sending the original TaskInfo object to the DAGScheduler upon task completion and not a clone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't that not simply an implementation detail ? (for ex, the resubmission case would break it)
I am not sure what is the behavior we are testing for here - and how would this test help with some future change (and validation).
I dont see a harm is keeping it, but want to make sure I am not missing something here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't mind dropping it. I was just trying to assert one of the ways SparkListeners could be used. The test is more of a general test to ensure that we preserve the behavior of SparkListeners
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Functionally that (the right task info is in the event) should be covered already (in use of SaveStageAndTaskInfo for example). Do let me know if that is not the case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SaveActiveTaskInfos is caching TaskInfos but there are no tests on TaskInfo objects and none asserting that the TaskInfo objects are expected to remain the same across listener events
core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala
Outdated
Show resolved
Hide resolved
|
@mridulm Can you PTAL? |
mridulm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a couple of minor comments.
+CC @JoshRosen, @cloud-fan
core/src/test/scala/org/apache/spark/scheduler/SparkListenerSuite.scala
Outdated
Show resolved
Hide resolved
core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala
Show resolved
Hide resolved
core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala
Show resolved
Hide resolved
| val rdd1 = sc.parallelize(1 to 100, 4) | ||
| sc.runJob(rdd1, (items: Iterator[Int]) => items.size, Seq(0, 1)) | ||
| sc.listenerBus.waitUntilEmpty() | ||
| listener.taskInfos.size should be { 0 } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Functionally that (the right task info is in the event) should be covered already (in use of SaveStageAndTaskInfo for example). Do let me know if that is not the case.
core/src/main/scala/org/apache/spark/internal/config/package.scala
Outdated
Show resolved
Hide resolved
|
thanks, merging to master! |
…askInfo.accumulables()` ### What changes were proposed in this pull request? `AccumulableInfo` is one of the top heap consumers in driver's heap dumps for stages with many tasks. For a stage with a large number of tasks (**_O(100k)_**), we saw **30%** of the heap usage stemming from `TaskInfo.accumulables()`.  The `TaskSetManager` today keeps around the TaskInfo objects ([ref1](https://github.com/apache/spark/blob/c1ba963e64a22dea28e17b1ed954e6d03d38da1e/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L134), [ref2](https://github.com/apache/spark/blob/c1ba963e64a22dea28e17b1ed954e6d03d38da1e/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L192))) and in turn the task metrics (`AccumulableInfo`) for every task attempt until the stage is completed. This means that for stages with a large number of tasks, we keep metrics for all the tasks (`AccumulableInfo`) around even when the task has completed and its metrics have been aggregated. Given a task has a large number of metrics, stages with many tasks end up with a large heap usage in the form of task metrics. This PR is an opt-in change (disabled by default) to reduce the driver's heap usage for stages with many tasks by no longer referencing the task metrics of completed tasks. Once a task is completed in `TaskSetManager`, we no longer keep its metrics around. Upon task completion, we clone the `TaskInfo` object and empty out the metrics for the clone. The cloned `TaskInfo` is retained by the `TaskSetManager` while the original `TaskInfo` object with the metrics is sent over to the `DAGScheduler` where the task metrics are aggregated. Thus for a completed task, `TaskSetManager` holds a `TaskInfo` object with empty metrics. This reduces the memory footprint by ensuring that the number of task metric objects is proportional to the number of active tasks and not to the total number of tasks in the stage. ### Config to gate changes The changes in the PR are guarded with the Spark conf `spark.scheduler.dropTaskInfoAccumulablesOnTaskCompletion.enabled` which can be used for rollback or staged rollouts. ### Why are the changes disabled by default? The PR introduces a breaking change wherein the `TaskInfo.accumulables()` are empty for `Resubmitted` tasks upon the loss of an executor. Read apache#44321 (review) for details. ### Why are the changes needed? Reduce driver's heap usage, especially for stages with many tasks ## Benchmarking On a cluster running a scan stage with 100k tasks, the TaskSetManager's heap usage dropped from 1.1 GB to 37 MB. This **reduced the total driver's heap usage by 38%**, down to 2 GB from 3.5 GB. **BEFORE**  **WITH FIX** <img width="1386" alt="image" src="https://github.com/databricks/runtime/assets/10495099/b85129c8-dc10-4ee2-898d-61c8e7449616"> ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added new tests and did benchmarking on a cluster. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Github Copilot Closes apache#44321 from utkarsh39/SPARK-46383. Authored-by: Utkarsh <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 28da1d8)
…fespan of `TaskInfo.accumulables()` (apache#479) ### What changes were proposed in this pull request? `AccumulableInfo` is one of the top heap consumers in driver's heap dumps for stages with many tasks. For a stage with a large number of tasks (**_O(100k)_**), we saw **30%** of the heap usage stemming from `TaskInfo.accumulables()`.  The `TaskSetManager` today keeps around the TaskInfo objects ([ref1](https://github.com/apache/spark/blob/c1ba963e64a22dea28e17b1ed954e6d03d38da1e/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L134), [ref2](https://github.com/apache/spark/blob/c1ba963e64a22dea28e17b1ed954e6d03d38da1e/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L192))) and in turn the task metrics (`AccumulableInfo`) for every task attempt until the stage is completed. This means that for stages with a large number of tasks, we keep metrics for all the tasks (`AccumulableInfo`) around even when the task has completed and its metrics have been aggregated. Given a task has a large number of metrics, stages with many tasks end up with a large heap usage in the form of task metrics. This PR is an opt-in change (disabled by default) to reduce the driver's heap usage for stages with many tasks by no longer referencing the task metrics of completed tasks. Once a task is completed in `TaskSetManager`, we no longer keep its metrics around. Upon task completion, we clone the `TaskInfo` object and empty out the metrics for the clone. The cloned `TaskInfo` is retained by the `TaskSetManager` while the original `TaskInfo` object with the metrics is sent over to the `DAGScheduler` where the task metrics are aggregated. Thus for a completed task, `TaskSetManager` holds a `TaskInfo` object with empty metrics. This reduces the memory footprint by ensuring that the number of task metric objects is proportional to the number of active tasks and not to the total number of tasks in the stage. ### Config to gate changes The changes in the PR are guarded with the Spark conf `spark.scheduler.dropTaskInfoAccumulablesOnTaskCompletion.enabled` which can be used for rollback or staged rollouts. ### Why are the changes disabled by default? The PR introduces a breaking change wherein the `TaskInfo.accumulables()` are empty for `Resubmitted` tasks upon the loss of an executor. Read apache#44321 (review) for details. ### Why are the changes needed? Reduce driver's heap usage, especially for stages with many tasks ## Benchmarking On a cluster running a scan stage with 100k tasks, the TaskSetManager's heap usage dropped from 1.1 GB to 37 MB. This **reduced the total driver's heap usage by 38%**, down to 2 GB from 3.5 GB. **BEFORE**  **WITH FIX** <img width="1386" alt="image" src="https://github.com/databricks/runtime/assets/10495099/b85129c8-dc10-4ee2-898d-61c8e7449616"> ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added new tests and did benchmarking on a cluster. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Github Copilot Closes apache#44321 from utkarsh39/SPARK-46383. Authored-by: Utkarsh <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 28da1d8) Co-authored-by: Utkarsh <[email protected]>
What changes were proposed in this pull request?
AccumulableInfois one of the top heap consumers in driver's heap dumps for stages with many tasks. For a stage with a large number of tasks (O(100k)), we saw 30% of the heap usage stemming fromTaskInfo.accumulables().The
TaskSetManagertoday keeps around the TaskInfo objects (ref1, ref2)) and in turn the task metrics (AccumulableInfo) for every task attempt until the stage is completed. This means that for stages with a large number of tasks, we keep metrics for all the tasks (AccumulableInfo) around even when the task has completed and its metrics have been aggregated. Given a task has a large number of metrics, stages with many tasks end up with a large heap usage in the form of task metrics.This PR is an opt-in change (disabled by default) to reduce the driver's heap usage for stages with many tasks by no longer referencing the task metrics of completed tasks. Once a task is completed in
TaskSetManager, we no longer keep its metrics around. Upon task completion, we clone theTaskInfoobject and empty out the metrics for the clone. The clonedTaskInfois retained by theTaskSetManagerwhile the originalTaskInfoobject with the metrics is sent over to theDAGSchedulerwhere the task metrics are aggregated. Thus for a completed task,TaskSetManagerholds aTaskInfoobject with empty metrics. This reduces the memory footprint by ensuring that the number of task metric objects is proportional to the number of active tasks and not to the total number of tasks in the stage.Config to gate changes
The changes in the PR are guarded with the Spark conf
spark.scheduler.dropTaskInfoAccumulablesOnTaskCompletion.enabledwhich can be used for rollback or staged rollouts.Why are the changes disabled by default?
The PR introduces a breaking change wherein the
TaskInfo.accumulables()are empty forResubmittedtasks upon the loss of an executor. Read #44321 (review) for details.Why are the changes needed?
Reduce driver's heap usage, especially for stages with many tasks
Benchmarking
On a cluster running a scan stage with 100k tasks, the TaskSetManager's heap usage dropped from 1.1 GB to 37 MB. This reduced the total driver's heap usage by 38%, down to 2 GB from 3.5 GB.
BEFORE
WITH FIX
Does this PR introduce any user-facing change?
No
How was this patch tested?
Added new tests and did benchmarking on a cluster.
Was this patch authored or co-authored using generative AI tooling?
Generated-by: Github Copilot