Support eagerly kill redundant executors #4592

ulysses-you · 2023-03-24T03:08:20Z

Why are the changes needed?

This pr adds a new rule FinalStageResourceManager to eagerly kill redundant executors

We first get the final stage partition which is the actually required cores, then kill the redundant executors. The priority of kill executors follow:

kill executor who is younger than other (The older the JIT works better)
kill executor who produces less shuffle data first

The reason why add this feature is that, if the previous stage contains lots executors but final stage has less, then the tasks of final stage would be scheduled randomly in all exists executors which may cause resource waste. e.g., each executor only run 1 or 2 tasks but holds 4 or 5 cores.

How was this patch tested?

test manually

test for the kill executor

ulysses-you · 2023-03-24T03:10:38Z

cc @yaooqinn @pan3793 @bowenliang123 @cfmcgrady

pan3793 · 2023-03-24T03:52:17Z

...spark/kyuubi-extension-spark-common/src/main/scala/org/apache/kyuubi/sql/KyuubiSQLConf.scala

      .createWithDefault(true)
+
+  val FINAL_WRITE_STAGE_EAGERLY_KILL_EXECUTORS_ENABLED =
+    buildConf("spark.sql.finalWriteStageEagerlyKillExecutors.enabled")


it's valuable to introduce a new namespace spark.sql.finalWriteStage.

Suggested change

buildConf("spark.sql.finalWriteStageEagerlyKillExecutors.enabled")

buildConf("spark.sql.finalWriteStage.eagerlyKillExecutors.enabled")

codecov-commenter · 2023-03-24T04:14:05Z

Codecov Report

Merging #4592 (28d4230) into master (351bab3) will increase coverage by 0.03%.
The diff coverage is 90.90%.

❗ Current head 28d4230 differs from pull request most recent head f35208b. Consider uploading reports for the commit f35208b to get more accurate results

@@             Coverage Diff              @@
##             master    #4592      +/-   ##
============================================
+ Coverage     53.26%   53.30%   +0.03%     
  Complexity       13       13              
============================================
  Files           577      577              
  Lines         31557    31568      +11     
  Branches       4244     4245       +1     
============================================
+ Hits          16810    16827      +17     
+ Misses        13161    13153       -8     
- Partials       1586     1588       +2

Impacted Files	Coverage Δ
...in/scala/org/apache/kyuubi/sql/KyuubiSQLConf.scala	`98.37% <90.90%> (-0.74%)`	⬇️

... and 11 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

pan3793 · 2023-03-24T04:22:29Z

...k/kyuubi-extension-spark-3-3/src/main/scala/org/apache/spark/FinalStageResourceManager.scala

+        // - target executors > min executors
+        val numActiveExecutors = sc.getExecutorIds().length
+        val expectedCores = partitionSpecs.length
+        val targetExecutors = (((expectedCores / executorCores) + 1) * factor).toInt


(expectedCores / executorCores) + 1 => math.ceil(expectedCores.toFloat / executorCores)

yaooqinn · 2023-03-24T06:53:35Z

...k/kyuubi-extension-spark-3-3/src/main/scala/org/apache/spark/FinalStageResourceManager.scala

+        // - target executors > min executors
+        val numActiveExecutors = sc.getExecutorIds().length
+        val expectedCores = partitionSpecs.length
+        val targetExecutors = (((expectedCores / executorCores) + 1) * factor).toInt


Keep at least 1 exec?

yaooqinn · 2023-03-24T06:55:34Z

...k/kyuubi-extension-spark-3-3/src/main/scala/org/apache/spark/FinalStageResourceManager.scala

+    executorIdsToKill.toSeq
+  }
+
+  private def killExecutors(


org.apache.spark.SparkContext#killExecutors?

There is a story about DRA. Since apache/spark#20604, org.apache.spark.SparkContext#killExecutors does not allow with DRA ON, so this pr hack the internal interface to kill executors. I think that pr is not very reaonable, it should be ok to kill executors with DRA ON if the min executor is less than the target executor.

yaooqinn · 2023-03-24T07:01:30Z

kill executor who is younger than other (The older the JIT works better)

This is not always true,

the generated code may not JITed in the final stage.
The old ones are very like to overloaded in some cases
Potential Problems happen with shuffle tracking?

yaooqinn · 2023-03-24T07:03:40Z

...k/kyuubi-extension-spark-3-3/src/main/scala/org/apache/spark/FinalStageResourceManager.scala

+ */
+case class FinalStageResourceManager(session: SparkSession) extends Rule[SparkPlan] {
+  override def apply(plan: SparkPlan): SparkPlan = {
+    if (!conf.getConf(KyuubiSQLConf.FINAL_WRITE_STAGE_EAGERLY_KILL_EXECUTORS_ENABLED)) {


dynamicAllocation enabled?

ulysses-you · 2023-03-24T07:31:55Z

@yaooqinn

the generated code may not JITed in the final stage.

yea, but at least, the JIT should work better for shuffle read and scheduler related code with the older one

the old ones are very like to overloaded in some cases

it really depend on the machine .. I'm not sure how to find the bad machine in a easy way

Potential Problems happen with shuffle tracking?

We first kill executor according to the alive time and exclude the one who has shuffle data, so it should be safe for this case.

yaooqinn · 2023-03-24T09:22:35Z

...k/kyuubi-extension-spark-3-3/src/main/scala/org/apache/spark/FinalStageResourceManager.scala

+ * limitations under the License.
+ */
+
+package org.apache.spark


this is for unstable calls?

yes, e.g., CoarseGrainedSchedulerBackend is under private[spark]

ulysses-you · 2023-03-24T10:24:38Z

thanks for review, merging to master !

ulysses-you added 2 commits March 24, 2023 11:06

Support eagerly kill redundant executors

f44e484

nit

f2492ce

github-actions bot added kind:documentation Documentation is a feature! module:extensions module:spark labels Mar 24, 2023

ulysses-you mentioned this pull request Mar 24, 2023

Support stage level schedule for final write stage #4574

Closed

pan3793 reviewed Mar 24, 2023

View reviewed changes

yaooqinn reviewed Mar 24, 2023

View reviewed changes

address comments

28d4230

yaooqinn reviewed Mar 24, 2023

View reviewed changes

yaooqinn approved these changes Mar 24, 2023

View reviewed changes

ulysses-you added 2 commits March 24, 2023 17:32

nit

ec627ee

nit

f35208b

ulysses-you closed this in b8f4526 Mar 24, 2023

ulysses-you deleted the eagerly-kill-executors branch March 24, 2023 10:25

ulysses-you added this to the v1.8.0 milestone Mar 24, 2023

ulysses-you self-assigned this Mar 24, 2023

	buildConf("spark.sql.finalWriteStageEagerlyKillExecutors.enabled")
	buildConf("spark.sql.finalWriteStage.eagerlyKillExecutors.enabled")

Support eagerly kill redundant executors #4592

Support eagerly kill redundant executors #4592

Uh oh!

Conversation

ulysses-you commented Mar 24, 2023

Why are the changes needed?

How was this patch tested?

Uh oh!

ulysses-you commented Mar 24, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Mar 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yaooqinn commented Mar 24, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ulysses-you commented Mar 24, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ulysses-you commented Mar 24, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov-commenter commented Mar 24, 2023 •

edited

Loading