[SPARK-13704][CORE][YARN] Reduce rack resolution time #24245

squito · 2019-03-29T21:22:23Z

What changes were proposed in this pull request?

When you submit a stage on a large cluster, rack resolving takes a long time when initializing TaskSetManager because a script is invoked to resolve the rack of each host, one by one.
Based on current implementation, it takes 30~40 seconds to resolve the racks in our 5000 nodes' cluster. After applied the patch, it decreased to less than 15 seconds.

YARN-9332 has added an interface to handle multiple hosts in one invocation to save time. But before upgrading to the newest Hadoop, we could construct the same tool in Spark to resolve this issue.

How was this patch tested?

UT and manually testing on a 5000 node cluster.

…alizing TaskSetManager

squito · 2019-03-29T21:25:30Z

This is an update of #23951, to address the last couple review items. All credit still to @LantaoJin. Note that I also removed the skipRackResolution option entirely, rather than moving it to a new conf, because I think this should improve time enough to make that unnecessary. But I haven't tested on a 5k node cluster :)

vanzin

Looks ok but there's a conflict...

vanzin · 2019-03-29T23:16:50Z

resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnRMClient.scala

    require(registered, "Must register AM before creating allocator.")
    new YarnAllocator(driverUrl, driverRef, conf, sparkConf, amClient, appAttemptId, securityMgr,
-      localResources, new SparkRackResolver())
+      localResources, new SparkRackResolver(conf))


Do you want the shared instance here instead?

oh good point, i do want the shared instance, to use a shared CachedDnsToSwitchMapping. In cluster mode that will make sure we re-use the cache.

vanzin · 2019-03-29T23:17:49Z

resource-managers/yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnScheduler.scala

-  }
+  override val defaultRackValue: Option[String] = Some(NetworkTopology.DEFAULT_RACK)
+
+  private[spark] val resolver = new SparkRackResolver(sc.hadoopConfiguration)


Use the shared instance?

Or, if not using the shared instance, then the SparkRackResolver object can go away.

SparkQA · 2019-03-30T02:18:12Z

Test build #104096 has finished for PR 24245 at commit fa7daa4.

This patch fails Spark unit tests.
This patch does not merge cleanly.
This patch adds no public classes.

heary-cao · 2019-03-30T02:45:54Z

retest this please

attilapiros

I have found only some really minor things.

attilapiros · 2019-04-01T10:23:48Z

core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala


  // By default, rack is unknown
-  def getRackForHost(value: String): Option[String] = None
+  def getRackForHost(hosts: String): Option[String] = {


Nit: rename hosts to host

attilapiros · 2019-04-01T14:48:36Z

core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala

+
+  test("SPARK-13704 Rack Resolution is done with a batch of de-duped hosts") {
+    val conf = new SparkConf()
+      .set(config.LOCALITY_WAIT.key, "0")


Nit: .key is not needed that way setting the config will be type safe

attilapiros · 2019-04-01T14:54:45Z

core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala

    taskSetManagerSpy.handleFailedTask(taskDesc.get.taskId, TaskState.FAILED, e)

-    verify(taskSetManagerSpy, times(1)).addPendingTask(anyInt())
+    verify(taskSetManagerSpy, times(1)).addPendingTask(anyInt(), anyBoolean())


We can match here for exact arguments:

verify(taskSetManagerSpy, times(1)) .addPendingTask(argEq(0), argEq(false))

Assuming the import:

import org.mockito.ArgumentMatchers.{any, anyBoolean, anyInt, anyString, eq => argEq}

good point. while I like your naming more, the standard we've used in spark is to rename it to meq eg. https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/ExecutorAllocationManagerSuite.scala#L22

attilapiros · 2019-04-01T16:08:50Z

core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala

+          .isDefined)
+      }
+    }
+    assert(FakeRackUtil.numBatchInvocation === 1)


As addPendingTasks() is called during the construction of TaskSetManager instance (and that is the only one place where multiple hosts can be passed to getRacksForHosts()) I did not get why numBatchInvocation === 1 is so important that it is emphasised by this assert.

I assume we are thinking about any potential future code which would use the getRacksForHosts() (meanwhile it is an expensive call), am I right?

attilapiros · 2019-04-01T19:28:26Z

core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala

    val taskSet = FakeTask.createTaskSet(100, locations: _*)
    val clock = new ManualClock
+    // make sure we only do one rack resolution call, for the entire batch of hosts, as this
+    // can be expensive.  the FakeTaskScheduler calls rack resolution more than the real one


Nit: "expensive. the " => "expensive. The "

SparkQA · 2019-04-02T00:22:04Z

Test build #104170 has finished for PR 24245 at commit e598984.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-04-02T00:22:56Z

Test build #104172 has finished for PR 24245 at commit f5efc74.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

LantaoJin · 2019-04-02T12:01:40Z

resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/SparkRackResolver.scala

+  /**
+   * It will return the static resolver instance.  If there is already an instance, the passed
+   * conf is entirely ignored.  If there is not a shared instance, it will create one with the
+   * given conf.


Should we explain how to instantiate a separate resolver with a separate config here?

Instantiate a separate resolver with a separate config by new SparkRackResolver(conf)

its kinda obvious, no?

vanzin

Looks good aside from some minor comments. The comments could use some grammar updates too, but not a big deal since you didn't write them.

vanzin · 2019-04-02T18:39:45Z

core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala

+      }
+      // Resolve the rack for each host. This can be slow, so de-dupe the list of hosts,
+      // and assign the rack to all relevant task indices.
+      val racks = sched.getRacksForHosts(pendingTasksForHost.keySet.toSeq)


There's an implicit assumption here that map.keySet and map.values iterate in the same order. I'm not sure if that's guaranteed, and at the same time I don't see why that wouldn't be the case, but just wanted to point this out.

I had the exact same thought when I reached that line.
Even thought about a possible solutions:

creating a new val with the value racks.entrySet and generating the keys and values from this entry set (as in entry set the key and value is bound together the ordering will be fixed; even with one iteration the key and the value can be generated).

Another possible and more elegant solution is calling racks.asScala.unzip.

Both solutions has some performance cost.

Correction for last sentence: the first solution can be done without performance cost but probably the code will be a bit less elegant.

great point, I have updated this to use unzip

vanzin · 2019-04-02T18:42:01Z

core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala

    taskSetManagerSpy.handleFailedTask(taskDesc.get.taskId, TaskState.FAILED, e)

-    verify(taskSetManagerSpy, times(1)).addPendingTask(anyInt())
+    verify(taskSetManagerSpy, times(1)).addPendingTask(meq(0), meq(false))


You can actually omit the meq in this case.

vanzin · 2019-04-02T18:45:15Z

resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/SparkRackResolver.scala

+ * It will cache the rack for individual hosts to avoid
+ * repeatedly performing the same expensive lookup.
+ *
+ * Its logic refers [[org.apache.hadoop.yarn.util.RackResolver]] and enhanced.


This paragraph actually refers to the code in the class, now, not the object anymore.

SparkQA · 2019-04-02T22:05:13Z

Test build #104218 has finished for PR 24245 at commit cd97b62.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

LantaoJin · 2019-04-03T02:52:43Z

retest this please

squito

Thanks @attilapiros @vanzin . I also updated went through the comments and updated them a bit (I just deleted a few that I thought weren't really adding anything).

attilapiros · 2019-04-03T19:43:31Z

I have checked the new changes (the last commit) and it looks good to me.
Even run the new unittest as I was a little worried about the matcher-less mockito verify line but it works just fine.

So pending tests otherwise LGTM.

SparkQA · 2019-04-03T23:53:29Z

Test build #104257 has finished for PR 24245 at commit ad63e15.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-04-04T01:00:24Z

Test build #4683 has started for PR 24245 at commit ad63e15.

LantaoJin · 2019-04-04T15:40:48Z

retest this please

SparkQA · 2019-04-04T20:05:58Z

Test build #4688 has finished for PR 24245 at commit ad63e15.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-04-04T20:57:02Z

Test build #4689 has started for PR 24245 at commit ad63e15.

squito · 2019-04-04T21:06:07Z

The pyspark failures are very strange, I'm not sure if its an infra issue or something else. I filed https://issues.apache.org/jira/browse/SPARK-27389

vanzin · 2019-04-04T21:13:46Z

Looks fine. I still think that getRackForHosts could just return a Seq[String] and avoid using Option at all - the code in SparkRackResolver doesn't seem to ever return a null location, but not worth delaying this patch more.

LantaoJin · 2019-04-05T03:09:00Z

Looks fine. I still think that getRackForHosts could just return a Seq[String] and avoid using Option at all - the code in SparkRackResolver doesn't seem to ever return a null location, but not worth delaying this patch more.

That's because if getRacksForHosts return Seq[String], getRackForHost will return a String. If it is like this, the default value of rack will be difficult to set. None is the best one.

LantaoJin · 2019-04-05T03:12:33Z

This update PR, LGTM now

vanzin · 2019-04-05T15:56:33Z

That's because if getRacksForHosts return Seq[String], getRackForHost will return a String.

If only you could write code to wrap a String in an Option...

SparkQA · 2019-04-05T20:49:39Z

Test build #4693 has finished for PR 24245 at commit ad63e15.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

xuanyuanking

I want to give a +1 for this, seems a great performance optimization on large clusters.

squito · 2019-04-08T15:48:54Z

merged to master. Thanks @LantaoJin !

LantaoJin and others added 24 commits March 4, 2019 15:14

[SPARK-27038][CORE][YARN] Rack resolving takes a long time when initi…

56b7f70

…alizing TaskSetManager

log elapsed time for adding pending task

342a6fd

add UT

1e1180c

address comments

e521bcf

address comments

8514b3e

trival fix

63666ad

address comments

69b62e4

refactor

6246e57

update to lazy val to avoid mass invocations

7e4c729

some fix

d4a7cde

De-duping the hosts to reduce this invocation further

d3e1592

address many comments

47add6f

address comment

e8ab99e

thread-safe SparkRackResolver

11fdd41

fix the ut to address comment

ade4caa

use batch resolve in LocalityPreferredContainerPlacementStrategy

c9dace8

double-checked locking

e2faee6

handle Nil

aca97f1

address comments

75d22ed

refactor SparkRackResolver

011b4c0

fix build failure

2876ad3

fix the ut and keep the original behavior of getRackForHost

06a6264

trival fix

92ef335

review feedback; remove skipRackResolution

fa7daa4

vanzin reviewed Mar 29, 2019

View reviewed changes

squito added 2 commits March 29, 2019 20:22

use shared instance

99ea54b

Merge branch 'master' into SPARK-13704_update

6bcbc88

attilapiros reviewed Apr 1, 2019

View reviewed changes

review feedback

e598984

attilapiros reviewed Apr 1, 2019

View reviewed changes

fix

f5efc74

LantaoJin reviewed Apr 2, 2019

View reviewed changes

Merge branch 'master' into SPARK-13704_update

cd97b62

vanzin reviewed Apr 2, 2019

View reviewed changes

review feedback

ad63e15

squito commented Apr 3, 2019

View reviewed changes

xuanyuanking approved these changes Apr 8, 2019

View reviewed changes

asfgit closed this in 52838e7 Apr 8, 2019

LantaoJin mentioned this pull request Apr 9, 2019

[SPARK-13704][CORE][YARN] Re-implement RackResolver to reduce resolving time #23951

Closed

[SPARK-13704][CORE][YARN] Reduce rack resolution time #24245

[SPARK-13704][CORE][YARN] Reduce rack resolution time #24245

Uh oh!

Conversation

squito commented Mar 29, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

squito commented Mar 29, 2019

Uh oh!

vanzin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 30, 2019

Uh oh!

heary-cao commented Mar 30, 2019

Uh oh!

attilapiros left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 2, 2019

Uh oh!

SparkQA commented Apr 2, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vanzin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 2, 2019

Uh oh!

LantaoJin commented Apr 3, 2019

Uh oh!

squito left a comment

Choose a reason for hiding this comment

Uh oh!

attilapiros commented Apr 3, 2019

Uh oh!

SparkQA commented Apr 3, 2019

Uh oh!

SparkQA commented Apr 4, 2019

Uh oh!

LantaoJin commented Apr 4, 2019

Uh oh!

SparkQA commented Apr 4, 2019

Uh oh!

SparkQA commented Apr 4, 2019

squito commented Mar 29, 2019 •

edited

Loading