[SPARK-16441][YARN] Set maxNumExecutor depends on yarn cluster resources. #16819

wangyum · 2017-02-06T09:58:36Z

What changes were proposed in this pull request?

Dynamic set spark.dynamicAllocation.maxExecutors by cluster resources.

How was this patch tested?

manual test and unit test

srowen · 2017-02-06T11:55:34Z

I don't think this is a necessary change. Already, you can't ask for more resources than the cluster has; the cluster won't grant them. Capping it here means the app can't use more resources if the cluster suddenly gets more.

I see the problem you're trying to solve but the resource manager already ramps up requests slowly, so I don't think this is the issue.

SparkQA · 2017-02-06T12:42:03Z

Test build #72434 has finished for PR 16819 at commit 97e5eee.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

andrewor14 · 2017-02-06T18:07:24Z

I agree. Resource managers generally expect applications to request more than what's available already so we don't have to do it again ourselves in Spark.

wangyum · 2017-02-07T06:07:50Z

It will reduce the function call on CoarseGrainedSchedulerBackend.requestTotalExecutors() after apply this PR:
before after-apply-this-PR

Full log can be found here.

srowen · 2017-02-07T10:45:12Z

What problem does this solve though? calling that function is not a problem. It seems like you get the right behavior in both cases. Are you saying there's some RPC problem? The target goes very high, but, as far as I can see it's correctly reflecting the fact that the app would use a lot of executors if it could -- that's fine.

wangyum · 2017-02-20T06:01:48Z

@srowen . Dynamic set spark.dynamicAllocation.maxExecutors can avoid some strange problems:

Spark application hang when dynamic allocation is enabled
Report failure reason from Reporter Thread
CLI shows successful but web ui didn't, simally to this

I add a unit test just now.

SparkQA · 2017-02-20T06:14:50Z

Test build #73147 has finished for PR 16819 at commit 4f81680.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-20T06:53:55Z

Test build #73151 has finished for PR 16819 at commit 8e99701.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

Devian-ua · 2017-02-20T13:15:35Z

resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala

+      val defaultMaxNumExecutors = DYN_ALLOCATION_MAX_EXECUTORS.defaultValue.get
+      if (defaultMaxNumExecutors == sparkConf.get(DYN_ALLOCATION_MAX_EXECUTORS)) {
+        val executorCores = sparkConf.getInt("spark.executor.cores", 1)
+        val maxNumExecutors = yarnClient.getNodeReports().asScala.


Shouldn't we take queue's maxResources amount into account from ResourceManager REST APIs?

Good suggestion. I will try API first. Pseudo code:

import org.apache.hadoop.yarn.client.api.{YarnClient, YarnClientApplication} import scala.collection.JavaConverters._ import org.apache.hadoop.yarn.api.protocolrecords._ import org.apache.hadoop.yarn.api.records._ import org.apache.hadoop.yarn.conf.YarnConfiguration val yarnConf = new YarnConfiguration() val yarnClient = YarnClient.createYarnClient yarnClient.init(yarnConf) yarnClient.start() yarnClient.getRootQueueInfos

SparkQA · 2017-02-22T13:08:11Z

Test build #73277 has finished for PR 16819 at commit fabe2c5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-22T15:08:23Z

Test build #73282 has finished for PR 16819 at commit cd306e2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2017-02-22T20:33:54Z

I agree there's room for improvement in the current code; I even asked SPARK-18769 to be filed to track that work.

But I don't think setting the max to a fixed value at startup is the right approach. Queue configs change, node managers go up and down, new ones are added, old ones are removed. If this value ends up being calculated at the wrong time, the application will suffer. If you want to investigate a more dynamic approach here I'm all for that, but I'm not a big fan of the current solution.

wangyum · 2017-02-23T13:34:49Z

@vanzin We must pull the configuration from ResourceManager, ResourceManager can't push.
So setting the max before each stage? This feels too frequent.

In fact, This is suitable for periodic tasks. e.g. ML, SQL,
For streaming jobs, it is better set manually.

vanzin · 2017-02-23T17:21:02Z

Getting the config only at the beginning, to me, is not an acceptable solution.

Getting it every once in a while is better, but it's not the only possible approach. I even suggest something different in the bug I mention above.

SparkQA · 2017-02-27T19:34:06Z

Test build #73515 has finished for PR 16819 at commit e4b3b0c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tgravescs · 2017-02-27T23:25:51Z

I agree with others, this is not the way to do this. There are different schedulers in yarn, each with different configs that could affect the actual resources you get.

If you want to do something like this it should look at the available resources after calling the allocate call to yarn (allocateResponse.getAvailableResources). When yarn returns it tells you the available resources, which takes into account the various scheduler things.

MapReduce refers to that as headroom and uses it to determine things like if it needs to kill a reducer to run a map. We could use this to help with dynamic allocation and do more intelligent things.

wangyum · 2017-02-28T03:22:37Z

@vanzin What do you think about current approach? I have tested on a same Spark hive-thriftserver, the spark.dynamicAllocation.maxExecutors wiil decrease if I kill 4 NodeManager:

17/02/27 15:58:08 DEBUG ExecutorAllocationManager: Not adding executors because our current target total is already 94 (limit 94)
17/02/27 15:58:09 DEBUG ExecutorAllocationManager: Not adding executors because our current target total is already 94 (limit 94)
17/02/27 16:05:49 DEBUG ExecutorAllocationManager: Not adding executors because our current target total is already 85 (limit 85)
17/02/27 16:05:49 DEBUG ExecutorAllocationManager: Not adding executors because our current target total is already 85 (limit 85)

vanzin · 2017-03-02T19:49:57Z

So your current approach is to have a second connection to the RM, and ask for the RM's available resources every time the scheduler tries to change the number of resources.

Did you look at Tom's suggestion of using {{AllocateResponse.getAvailableResources()}} instead? Seems like it would be simpler, cheaper, and could all be handled internally in {{YarnAllocator.scala}}.

Closes apache#16819 Closes apache#13467 Closes apache#16083 Closes apache#17135 Closes apache#8785 Closes apache#16278 Closes apache#16997 Closes apache#17073 Closes apache#17220

Set maxNumExecutor depends on yarn cluster VCores Total.

97e5eee

Add a unit test

4f81680

Add a init function for Client.scala

8e99701

Devian-ua reviewed Feb 20, 2017

View reviewed changes

Take queue's maxResources

fabe2c5

Fix some typo

cd306e2

wangyum added 3 commits February 26, 2017 21:52

A more dynamic approach.

a15afd9

Change YarnConfiguration to Configuration

7b467fa

Update EnvironmentPage if spark.dynamicAllocation.maxExecutors changed.

e4b3b0c

srowen added a commit to srowen/spark that referenced this pull request Mar 22, 2017

Close stale PRs.

d88bc61

Closes apache#16819 Closes apache#13467 Closes apache#16083 Closes apache#17135 Closes apache#8785 Closes apache#16278 Closes apache#16997 Closes apache#17073 Closes apache#17220

srowen mentioned this pull request Mar 22, 2017

[INFRA] Close stale PRs #17386

Closed

asfgit closed this in b70c03a Mar 23, 2017

[SPARK-16441][YARN] Set maxNumExecutor depends on yarn cluster resources. #16819

[SPARK-16441][YARN] Set maxNumExecutor depends on yarn cluster resources. #16819

Uh oh!

Conversation

wangyum commented Feb 6, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

srowen commented Feb 6, 2017

Uh oh!

SparkQA commented Feb 6, 2017

Uh oh!

andrewor14 commented Feb 6, 2017

Uh oh!

wangyum commented Feb 7, 2017

Uh oh!

srowen commented Feb 7, 2017

Uh oh!

wangyum commented Feb 20, 2017

Uh oh!

SparkQA commented Feb 20, 2017

Uh oh!

SparkQA commented Feb 20, 2017

Uh oh!

Devian-ua Feb 20, 2017

Choose a reason for hiding this comment

Uh oh!

wangyum Feb 22, 2017

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 22, 2017

Uh oh!

SparkQA commented Feb 22, 2017

Uh oh!

vanzin commented Feb 22, 2017

Uh oh!

wangyum commented Feb 23, 2017

Uh oh!

vanzin commented Feb 23, 2017

Uh oh!

SparkQA commented Feb 27, 2017

Uh oh!

tgravescs commented Feb 27, 2017

Uh oh!

wangyum commented Feb 28, 2017

Uh oh!

vanzin commented Mar 2, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

wangyum commented Feb 6, 2017 •

edited

Loading