-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-1560]: Updated Pyrolite Dependency to be Java 6 compatible #479
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Merged build triggered. |
|
Merged build started. |
|
lgtm |
|
This might fail the build initially, because I uploaded pyrolite to sonatype just a little while ago. It might be a while before it propagates. |
|
do we actually publish our own pyrolite? |
|
Yeah, we publish our own. The Pyrolite project itself doesn't maintain anything in maven |
|
Merged build finished. All automated tests passed. |
|
All automated tests passed. |
|
@pwendell I'm not sure if this invoked the SparkSQL tests |
|
That's fine, I'm just gonna merge this. The heuristic didn't detect changes in SQL so it didn't run the tests, but we'll catch any errors in the nightly tests. |
Changed the Pyrolite dependency to a build which targets Java 6. Author: Ahir Reddy <[email protected]> Closes #479 from ahirreddy/java6-pyrolite and squashes the following commits: 8ea25d3 [Ahir Reddy] Updated maven build to use java 6 compatible pyrolite dabc703 [Ahir Reddy] Updated Pyrolite dependency to be Java 6 compatible (cherry picked from commit 0f87e6a) Signed-off-by: Patrick Wendell <[email protected]>
Changed the Pyrolite dependency to a build which targets Java 6. Author: Ahir Reddy <[email protected]> Closes apache#479 from ahirreddy/java6-pyrolite and squashes the following commits: 8ea25d3 [Ahir Reddy] Updated maven build to use java 6 compatible pyrolite dabc703 [Ahir Reddy] Updated Pyrolite dependency to be Java 6 compatible
…river/executors (#479) * Added configuration properties to inject arbitrary secrets into the driver/executors * Addressed comments
…river/executors (apache#479) * Added configuration properties to inject arbitrary secrets into the driver/executors * Addressed comments
Move outside periodic jobs back to OpenLab
…#479) * Revert "KE-37052 translate boolean column to V2Predicate (apache#477)" This reverts commit 7796f19. * KE-37052 translate boolean column to V2Predicate (apache#476) * KE-37052 translate boolean column to V2Predicate * update spark version
…fespan of `TaskInfo.accumulables()` (apache#479) ### What changes were proposed in this pull request? `AccumulableInfo` is one of the top heap consumers in driver's heap dumps for stages with many tasks. For a stage with a large number of tasks (**_O(100k)_**), we saw **30%** of the heap usage stemming from `TaskInfo.accumulables()`.  The `TaskSetManager` today keeps around the TaskInfo objects ([ref1](https://github.com/apache/spark/blob/c1ba963e64a22dea28e17b1ed954e6d03d38da1e/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L134), [ref2](https://github.com/apache/spark/blob/c1ba963e64a22dea28e17b1ed954e6d03d38da1e/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L192))) and in turn the task metrics (`AccumulableInfo`) for every task attempt until the stage is completed. This means that for stages with a large number of tasks, we keep metrics for all the tasks (`AccumulableInfo`) around even when the task has completed and its metrics have been aggregated. Given a task has a large number of metrics, stages with many tasks end up with a large heap usage in the form of task metrics. This PR is an opt-in change (disabled by default) to reduce the driver's heap usage for stages with many tasks by no longer referencing the task metrics of completed tasks. Once a task is completed in `TaskSetManager`, we no longer keep its metrics around. Upon task completion, we clone the `TaskInfo` object and empty out the metrics for the clone. The cloned `TaskInfo` is retained by the `TaskSetManager` while the original `TaskInfo` object with the metrics is sent over to the `DAGScheduler` where the task metrics are aggregated. Thus for a completed task, `TaskSetManager` holds a `TaskInfo` object with empty metrics. This reduces the memory footprint by ensuring that the number of task metric objects is proportional to the number of active tasks and not to the total number of tasks in the stage. ### Config to gate changes The changes in the PR are guarded with the Spark conf `spark.scheduler.dropTaskInfoAccumulablesOnTaskCompletion.enabled` which can be used for rollback or staged rollouts. ### Why are the changes disabled by default? The PR introduces a breaking change wherein the `TaskInfo.accumulables()` are empty for `Resubmitted` tasks upon the loss of an executor. Read apache#44321 (review) for details. ### Why are the changes needed? Reduce driver's heap usage, especially for stages with many tasks ## Benchmarking On a cluster running a scan stage with 100k tasks, the TaskSetManager's heap usage dropped from 1.1 GB to 37 MB. This **reduced the total driver's heap usage by 38%**, down to 2 GB from 3.5 GB. **BEFORE**  **WITH FIX** <img width="1386" alt="image" src="https://github.com/databricks/runtime/assets/10495099/b85129c8-dc10-4ee2-898d-61c8e7449616"> ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added new tests and did benchmarking on a cluster. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Github Copilot Closes apache#44321 from utkarsh39/SPARK-46383. Authored-by: Utkarsh <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 28da1d8) Co-authored-by: Utkarsh <[email protected]>
Changed the Pyrolite dependency to a build which targets Java 6.