Commit 2f6cca5
[SPARK-45439][SQL][UI] Reduce memory usage of LiveStageMetrics.accumIdsToMetricType
### What changes were proposed in this pull request?
This PR aims to reduce the memory consumption of `LiveStageMetrics.accumIdsToMetricType`, which should help to reduce driver memory usage when running complex SQL queries that contain many operators and run many jobs.
In SQLAppStatusListener, the LiveStageMetrics.accumIdsToMetricType field holds a map which is used to look up the type of accumulators in order to perform conditional processing of a stage’s metrics.
Currently, that field is derived from `LiveExecutionData.metrics`, which contains metrics for _all_ operators used anywhere in the query. Whenever a job is submitted, we construct a fresh map containing all metrics that have ever been registered for that SQL query. If a query runs a single job, this isn't an issue: in that case, all `LiveStageMetrics` instances will hold the same immutable `accumIdsToMetricType`.
The problem arises if we have a query that runs many jobs (e.g. a complex query with many joins which gets divided into many jobs due to AQE): in that case, each job submission results in a new `accumIdsToMetricType` map being created.
This PR fixes this by changing `accumIdsToMetricType` to be a mutable `mutable.HashMap` which is shared across all `LivestageMetrics` instances belonging to the same `LiveExecutionData`.
The modified classes are `private` and are used only in SQLAppStatusListener, so I don't think this change poses any realistic risk of binary incompatibility risks to third party code.
### Why are the changes needed?
Addresses one contributing factor behind high driver memory / OOMs when executing complex queries.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing unit tests.
To demonstrate memory reduction, I performed manual benchmarking and heap dump inspection using benchmark that ran copies of a complex query: each test query launches ~200 jobs (so at least 200 stages) and contains ~3800 total operators, resulting in a huge number metric accumulators. Prior to this PR's fix, ~3700 LiveStageMetrics instances (from multiple concurrent runs of the query) consumed a combined ~3.3 GB of heap. After this PR's fix, I observed negligible memory usage from LiveStageMetrics.
### Was this patch authored or co-authored using generative AI tooling?
No.
Closes #43250 from JoshRosen/reduce-accum-ids-to-metric-type-mem-overhead.
Authored-by: Josh Rosen <[email protected]>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>1 parent 6f46ea2 commit 2f6cca5
File tree
1 file changed
+18
-7
lines changed- sql/core/src/main/scala/org/apache/spark/sql/execution/ui
1 file changed
+18
-7
lines changedLines changed: 18 additions & 7 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
95 | 95 | | |
96 | 96 | | |
97 | 97 | | |
98 | | - | |
| 98 | + | |
99 | 99 | | |
100 | 100 | | |
101 | 101 | | |
| |||
111 | 111 | | |
112 | 112 | | |
113 | 113 | | |
114 | | - | |
| 114 | + | |
115 | 115 | | |
116 | 116 | | |
117 | 117 | | |
| |||
361 | 361 | | |
362 | 362 | | |
363 | 363 | | |
364 | | - | |
| 364 | + | |
365 | 365 | | |
366 | 366 | | |
367 | 367 | | |
| |||
383 | 383 | | |
384 | 384 | | |
385 | 385 | | |
386 | | - | |
| 386 | + | |
387 | 387 | | |
388 | 388 | | |
389 | 389 | | |
390 | 390 | | |
391 | 391 | | |
392 | 392 | | |
393 | 393 | | |
394 | | - | |
| 394 | + | |
395 | 395 | | |
396 | 396 | | |
397 | 397 | | |
| |||
490 | 490 | | |
491 | 491 | | |
492 | 492 | | |
493 | | - | |
| 493 | + | |
| 494 | + | |
| 495 | + | |
| 496 | + | |
| 497 | + | |
| 498 | + | |
494 | 499 | | |
495 | 500 | | |
496 | 501 | | |
| |||
522 | 527 | | |
523 | 528 | | |
524 | 529 | | |
| 530 | + | |
| 531 | + | |
| 532 | + | |
| 533 | + | |
| 534 | + | |
| 535 | + | |
525 | 536 | | |
526 | 537 | | |
527 | 538 | | |
528 | 539 | | |
529 | 540 | | |
530 | 541 | | |
531 | | - | |
| 542 | + | |
532 | 543 | | |
533 | 544 | | |
534 | 545 | | |
| |||
0 commit comments