[SPARK-27024] Executor interface for cluster managers to support GPU and other resources #24406

tgravescs · 2019-04-18T15:48:24Z

What changes were proposed in this pull request?

Add in GPU and generic resource type allocation to the executors.

Note this is part of a bigger feature for gpu-aware scheduling and is just how the executor find the resources. The general flow :

users ask for a certain set of resources, for instance number of gpus - each cluster manager has a specific way to do this.
cluster manager allocates a container or set of resources (standalone mode)
When spark launches the executor in that container, the executor either has to be told what resources it has or it has to auto discover them.
Executor has to register with Driver and tell the driver the set of resources it has so the scheduler can use that to schedule tasks that requires a certain amount of each of those resources

In this pr I added configs and arguments to the executor to be able discover resources. The argument to the executor is intended to be used by standalone mode or other cluster managers that don't have isolation so that it can assign specific resources to specific executors in case there are multiple executors on a node. The argument is a file contains JSON Array of ResourceInformation objects.

The discovery script is meant to be used in an isolated environment where the executor only sees the resources it should use.

Note that there will be follow on PRs to add other parts like the scheduler part. See the epic high level jira: https://issues.apache.org/jira/browse/SPARK-24615

How was this patch tested?

Added unit tests and manually tested.

Please review http://spark.apache.org/contributing.html before opening a pull request.

…resources

SparkQA · 2019-04-18T17:59:12Z

Test build #104710 has finished for PR 24406 at commit abff33f.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

tgravescs · 2019-04-18T18:30:27Z

test failure looks unrelated going to kick again

tgravescs · 2019-04-18T18:30:35Z

Jenkins, test this please

SparkQA · 2019-04-18T20:48:28Z

Test build #104715 has finished for PR 24406 at commit abff33f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tgravescs · 2019-04-22T18:46:53Z

@squito @srowen @mengxr @jiangxb1987 if anyone has time

srowen

Just some more style comments to start plus question about units

core/src/main/scala/org/apache/spark/ResourceDiscoverer.scala

core/src/main/scala/org/apache/spark/ResourceInformation.scala

core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala

tgravescs · 2019-04-25T17:55:52Z

Thanks for the comments, at spark summit so will likely update next week.

…ed-executor-clean

SparkQA · 2019-04-30T18:16:46Z

Test build #105035 has finished for PR 24406 at commit e90582a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…ed-executor-clean

SparkQA · 2019-05-01T14:10:22Z

Test build #105056 has finished for PR 24406 at commit c07b405.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-05-01T16:31:53Z

Test build #105057 has finished for PR 24406 at commit 01f97c8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jiangxb1987

Looks very good, only some nits.

core/src/main/scala/org/apache/spark/ResourceDiscoverer.scala

core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala

jiangxb1987 · 2019-05-03T01:57:29Z

To clarify, will the driver side resource discovery being added in a separated following PR?

tgravescs · 2019-05-03T16:42:15Z

The driver resource discovery will be done in separate PR

core/src/main/scala/org/apache/spark/ResourceDiscoverer.scala

mengxr · 2019-05-10T16:28:50Z

core/src/main/scala/org/apache/spark/ResourceDiscoverer.scala

+      // check that script exists and try to execute
+      if (scriptFile.exists()) {
+        try {
+          val output = executeAndGetOutput(Seq(script.get), new File("."))


In standalone mode, worker needs to run the script provided by users. Could you leave a TODO inline?

core/src/main/scala/org/apache/spark/ResourceInformation.scala

mengxr · 2019-05-10T16:35:41Z

core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala

+  // requirements (spark.task.resource.*) and that they match the executor configs
+  // specified by the user (spark.executor.resource.*) to catch mismatches between what
+  // the user requested and what resource manager gave or what the discovery script found.
+  private def checkExecResourcesMeetTaskRequirements(


We can have a general function works for both driver and executor. Essentially, the request is resourcePrefix -> count and the resource is resourcePrefix -> addresses. If they don't match, we just include resourcePrefix in the error message, which is clear to users whether it is driver or executor.

I don't completely follow what you are asking for here. We are comparing 3 things
spark.task.resource.* -> count with spark.executor.resource.* -> count and with the actual found in the script or pass in which is a Map[resourceName, ResourceInformation]. You can't have the resourceprefix on the type else it won't compare properly to the Map[resourceName]

I can certainly make it more generic to handle both executor and driver and I made some code changes to go that way but I would prefer to wait til the jira that implements the Driver side to finish that to make sure we don't need anything else. This function will likely have to move somewhere anyway.

core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala

core/src/test/scala/org/apache/spark/ResourceDiscovererSuite.scala

core/src/test/scala/org/apache/spark/executor/CoarseGrainedExecutorBackendSuite.scala

squito

I just took a brief look, but seems reasonable.

If the script fails, the user would have to get the logs of this executor to try to figure out what went wrong, right? That's OK, I guess, just checking on the error behavior

core/src/test/scala/org/apache/spark/ResourceDiscovererSuite.scala

core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala

core/src/test/scala/org/apache/spark/ResourceDiscovererSuite.scala

tgravescs · 2019-05-10T19:55:32Z

Yes if the script fails its only in the executor logs

tgravescs · 2019-05-10T20:18:05Z

@mengxr I'm not sure if its a github issues or what but your comment on standalone wouldn't let me comment from this page, I had to go into the changes page, so please go there to see my response. I can add something there but not exactly sure what you were asking for

…ed-executor-clean

tgravescs · 2019-05-10T20:20:44Z

upmerging to the latest to fix the merge conflict

SparkQA · 2019-05-10T21:59:19Z

Test build #105323 has finished for PR 24406 at commit dbb61fb.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-05-10T22:14:15Z

Test build #105322 has finished for PR 24406 at commit 4165c60.

This patch fails Spark unit tests.
This patch does not merge cleanly.
This patch adds no public classes.

tgravescs · 2019-05-10T22:31:57Z

Ok to test

tgravescs · 2019-05-11T02:03:47Z

Test this please

tgravescs · 2019-05-11T02:50:03Z

Jenkins, test this please

SparkQA · 2019-05-11T04:56:27Z

Test build #105330 has finished for PR 24406 at commit dbb61fb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mengxr · 2019-05-13T18:01:48Z

core/src/main/scala/org/apache/spark/ResourceDiscoverer.scala

+      // check that script exists and try to execute
+      if (scriptFile.exists()) {
+        try {
+          val output = executeAndGetOutput(Seq(script.get), new File("."))


nvm, I was thinking about users might be able to let manager/executor run arbitrary scripts.

mengxr · 2019-05-13T18:06:47Z

core/src/test/scala/org/apache/spark/ResourceDiscovererSuite.scala

+
+  test("Resource discoverer multiple gpus") {
+    val sparkconf = new SparkConf
+    assume(!(Utils.isWindows))


I don't know what happens if user tries to run the test on Windows. Does it fail or silently skip? I don't have a Windows machine to verify. But if other tests are doing this, we might just follow.

mengxr · 2019-05-13T18:08:48Z

LGTM. cc: @srowen @squito @jiangxb1987

core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala

docs/configuration.md

SparkQA · 2019-05-13T21:24:08Z

Test build #105363 has finished for PR 24406 at commit dd60e42.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jiangxb1987

LGTM

core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala

mengxr · 2019-05-14T03:53:09Z

core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala

+  // requirements and that they match the configs specified by the user to catch
+  // mismatches between what the user requested and what resource manager gave or
+  // what the discovery script found.
+  private def checkResourcesMeetRequirements(


Had an offline discussion with @WeichenXu123 . He suggested refactoring this check to make it easier to read. Now the arguments are:

reqResourcesAndCounts: request per task (not per executor)

actualResources: resources allocated per executor

It is not easy to tell from the variable names and hence make the code harder to read. Basically we need the following:

number allocated per executor cannot be smaller than requested count for each resource name

requested count for executor cannot be smaller than requested count for task for each resource name. Note that this doesn't require resource discovery.

the set of requested resource names for executors should match the set of requested resource names for tasks.

It would be nice to refactor the method into those three. We can also do it in a follow-up PR.

I added a comment to the driver jira since it needs to be refactored for that anyway:
https://issues.apache.org/jira/browse/SPARK-27488

SparkQA · 2019-05-14T04:43:21Z

Test build #105372 has finished for PR 24406 at commit b9dacef.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tgravescs · 2019-05-14T13:42:41Z

I merged this, thanks for the reviews.

tgravescs added 9 commits April 17, 2019 11:23

[SPARK-27024] Executor interface for cluster managers to support GPU …

5905f51

…resources

cleanup

916991e

fix style issues

6ff9953

add newline to test file

bee34a0

Change to use JSON format for resources

6170342

Update test to use file based arg

4bbaf2a

update docs

55a3bd5

Add in logDebug

eb0a8ed

fix spacing

abff33f

tgravescs mentioned this pull request Apr 18, 2019

[SPARK-27366][CORE] Support GPU Resources in Spark job scheduling #24374

Closed

tgravescs changed the title ~~Gpu sched executor cl[SPARK-27024] Executor interface for cluster managers to support GPU and other resourcesean~~ [SPARK-27024] Executor interface for cluster managers to support GPU and other resourcesean Apr 18, 2019

felixcheung changed the title ~~[SPARK-27024] Executor interface for cluster managers to support GPU and other resourcesean~~ [SPARK-27024] Executor interface for cluster managers to support GPU and other resources Apr 20, 2019

srowen requested changes Apr 22, 2019

View reviewed changes

tgravescs added 2 commits April 30, 2019 10:19

review comments

484a086

Merge branch 'master' of https://github.com/apache/spark into gpu-sch…

e90582a

…ed-executor-clean

tgravescs added 2 commits May 1, 2019 08:39

Merge branch 'master' of https://github.com/apache/spark into gpu-sch…

d2ed08f

…ed-executor-clean

remove explicit getters from ResourceInformation

c07b405

add newline end of file

01f97c8

jiangxb1987 reviewed May 3, 2019

View reviewed changes

mengxr requested changes May 10, 2019

View reviewed changes

squito reviewed May 10, 2019

View reviewed changes

rework

4165c60

Merge branch 'master' of https://github.com/apache/spark into gpu-sch…

dbb61fb

…ed-executor-clean

mengxr approved these changes May 13, 2019

View reviewed changes

kiszk reviewed May 13, 2019

View reviewed changes

core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala Outdated Show resolved Hide resolved

kiszk reviewed May 13, 2019

View reviewed changes

core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala Outdated Show resolved Hide resolved

kiszk reviewed May 13, 2019

View reviewed changes

core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala Outdated Show resolved Hide resolved

kiszk reviewed May 13, 2019

View reviewed changes

core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala Outdated Show resolved Hide resolved

kiszk reviewed May 13, 2019

View reviewed changes

docs/configuration.md Outdated Show resolved Hide resolved

kiszk reviewed May 13, 2019

View reviewed changes

docs/configuration.md Outdated Show resolved Hide resolved

nit rework

dd60e42

jiangxb1987 approved these changes May 13, 2019

View reviewed changes

kiszk reviewed May 14, 2019

View reviewed changes

core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala Outdated Show resolved Hide resolved

fix missing string interpolation

b9dacef

mengxr reviewed May 14, 2019

View reviewed changes

asfgit closed this in db2e3c4 May 14, 2019

[SPARK-27024] Executor interface for cluster managers to support GPU and other resources #24406

[SPARK-27024] Executor interface for cluster managers to support GPU and other resources #24406

Uh oh!

Conversation

tgravescs commented Apr 18, 2019

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Apr 18, 2019

Uh oh!

tgravescs commented Apr 18, 2019

Uh oh!

tgravescs commented Apr 18, 2019

Uh oh!

SparkQA commented Apr 18, 2019

Uh oh!

tgravescs commented Apr 22, 2019

Uh oh!

srowen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tgravescs commented Apr 25, 2019

Uh oh!

SparkQA commented Apr 30, 2019

Uh oh!

SparkQA commented May 1, 2019

Uh oh!

SparkQA commented May 1, 2019

Uh oh!

jiangxb1987 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jiangxb1987 commented May 3, 2019

Uh oh!

tgravescs commented May 3, 2019

Uh oh!

Uh oh!

mengxr May 10, 2019

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mengxr May 10, 2019

Choose a reason for hiding this comment

Uh oh!

tgravescs May 10, 2019

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

squito left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tgravescs commented May 10, 2019

Uh oh!

tgravescs commented May 10, 2019

Uh oh!

tgravescs commented May 10, 2019

Uh oh!

SparkQA commented May 10, 2019

Uh oh!

mengxr May 14, 2019 •

edited

Loading