[SPARK-14069][SQL] Improve SparkStatusTracker to also track executor information #11888

cloud-fan · 2016-03-22T14:46:27Z

What changes were proposed in this pull request?

Track executor information like host and port, cache size, running tasks.

How was this patch tested?

manual test

cloud-fan · 2016-03-22T14:46:40Z

cc @rxin

SparkQA · 2016-03-22T16:44:18Z

Test build #53775 has finished for PR 11888 at commit f76de44.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-03-23T03:59:15Z

I found it's difficulty to write tests for it. As it just collects the informations which are already exposed by the system, is it worth to test them again? cc @rxin

rxin · 2016-03-23T04:02:27Z

It's probably ok.

SparkQA · 2016-03-23T05:58:40Z

Test build #53884 has finished for PR 11888 at commit fe80390.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-03-23T06:28:40Z

retest this please.

SparkQA · 2016-03-23T08:26:55Z

Test build #53904 has finished for PR 11888 at commit fe80390.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-03-23T09:33:36Z

retest this please.

SparkQA · 2016-03-23T09:51:17Z

Test build #53921 has finished for PR 11888 at commit fe80390.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-03-23T10:00:33Z

retest this please

SparkQA · 2016-03-23T12:09:43Z

Test build #53925 has finished for PR 11888 at commit fe80390.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

andrewor14 · 2016-03-28T23:49:51Z

core/src/main/scala/org/apache/spark/SparkStatusTracker.scala

+  /**
+   * Returns a list of all known executors, represented by string with format: "host:port"
+   */
+  def getExecutors(): Array[String] = {


It seems kind of arbitrary that getExecutors returns host:port but not IDs. I think it's better that we make a SparkExecutorInfo or something and expose the host:port there, along with other things like cache size, numRunningTasks etc. Then in the future we can add more things we want to expose without tying ourselves with the host:port identifier.

If I just want a list of executors, why shouldn't I be able to get them? I think it makes sense to have a more detailed version (maybe replace the following 2), but having a simple one that returns just the list of executors seem to make sense too.

The other thing is I don't know if we want to query the scheduler every time we want a list of executors.

we can expose both, but I'd rather call it something more explicit like getExecutorHostPort or something. Elsewhere in Spark I would think getExecutors: Array[String] returns the executor IDs.

getExecutorList

getExecutorList LGTM, I'll rename to it.

wait, how is getExecutorList different from getExecutors? Why not just be more specific what the strings are?

actually let me move this to the main thread so it doesn't get collapsed.

SparkQA · 2016-03-30T17:21:37Z

Test build #54526 has finished for PR 11888 at commit 0b9400e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

andrewor14 · 2016-03-30T18:02:31Z

@rxin I don't understand the distinction between getExecutors and getExecutorList. Why not just be more explicit about what the strings represent?

The other problem is the keys to the other maps are also expected to be host:port but I would normally expect them to be executor IDs. Also host can be confusing because it can be in one of many different formats (e.g. IP addr, all the hostname formats in EC2). We already expose the executor IDs in a few places (e.g. UI, autoscaling API) so the user already knows what they are.

andrewor14 · 2016-03-30T18:07:59Z

My proposal:

def getExecutorIds(): Array[String]
def getExecutorInfo(executorId: String): Option[SparkExecutorInfo]
// If you want we can also expose the following:
def getExecutorHostPorts(): Array[String]

private class SparkExecutorInfoImpl(
    val id: String,
    val hostport: String, // host:port
    ...
    val cacheSize: Long,
    val numRunningTasks: Int)
 extends SparkExecutorInfo

This is more consistent with the existing status API, where we have things like

def getActiveJobIds(): Array[Int]
def getJobInfo(jobId: Int): Option[SparkJobInfo]

private class SparkJobInfoImpl(
    val jobId: Int,
    ...)
  extends SparkJobInfo

rxin · 2016-03-30T22:04:05Z

What are executor ids? is that even an external concept?

rxin · 2016-03-30T22:05:07Z

Can you paste me what an executor looks like? If you just tell me "executor id" as an end user, I have no clue what you are talking about.

andrewor14 · 2016-03-30T22:24:13Z

scala> df.count()
16/03/30 15:21:58 INFO SparkContext: Starting job: count at <console>:27
...
16/03/30 15:21:58 INFO SparkDeploySchedulerBackend: Launching task 7 on executor id: 3 hostname: 192.168.0.209.
16/03/30 15:21:58 INFO SparkDeploySchedulerBackend: Launching task 8 on executor id: 0 hostname: 192.168.0.209.
16/03/30 15:21:58 INFO SparkDeploySchedulerBackend: Launching task 9 on executor id: 5 hostname: 192.168.0.209.
16/03/30 15:21:58 INFO SparkDeploySchedulerBackend: Launching task 10 on executor id: 1 hostname: 192.168.0.209.
16/03/30 15:21:58 INFO SparkDeploySchedulerBackend: Launching task 11 on executor id: 4 hostname: 192.168.0.209.
16/03/30 15:21:58 INFO SparkDeploySchedulerBackend: Launching task 12 on executor id: 2 hostname: 192.168.0.209.

rxin · 2016-03-30T22:43:04Z

Yea that integer id is completely useless to users who want to figure out what to do with their clusters.

andrewor14 · 2016-03-30T23:05:14Z

OK, @rxin and I discussed this more offline. Our proposal is:

def getExecutorInfos: Seq[SparkExecutorInfo]

private class SparkExecutorInfoImpl(
    val host: String,
    val port: Int,
    ...
    val cacheSize: Long,
    val numRunningTasks: Int)
 extends SparkExecutorInfo

Then we don't need to tie us down with the very specific host:port format, and we don't have to expose the executor IDs, which are just integers that don't mean much.

SparkQA · 2016-03-31T03:42:08Z

Test build #54577 has finished for PR 11888 at commit 39dd0ee.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

andrewor14 · 2016-03-31T04:12:51Z

LGTM retest this please

SparkQA · 2016-03-31T06:15:43Z

Test build #54588 has finished for PR 11888 at commit 39dd0ee.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

andrewor14 · 2016-03-31T19:06:58Z

Merged into master thanks guys.

JoshRosen · 2016-11-23T02:44:00Z

This looks somewhat dodgy to me from a thread-safety perspective since executorIdToTaskCount isn't thread-safe and thus we may iterate over it in SparkStatusTracker while also concurrently updating it while processing task updates. I'm going to fix this in a followup but wanted to point it out here since ideally this should not have slipped past code review.

JoshRosen · 2016-11-23T02:46:10Z

core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala

  // Number of tasks running on each executor
  private val executorIdToTaskCount = new HashMap[String, Int]

+  def runningTasksByExecutors(): Map[String, Int] = executorIdToTaskCount.toMap


Adding a synchronized here would resolve the thread-safety issue, I think. I'll do this as part of a patch fixing another bug and also touching this line.

rxin · 2016-11-23T02:48:47Z

Why did we merge this when the description says "N/A"?

markhamstra · 2016-11-23T17:46:55Z

@rxin Do you mean the N/A in "How was this patch tested?" Some guy said that the lack of tests was ok. #11888 (comment)

rxin · 2016-11-23T18:42:51Z

Yea but "TODO: tests" and tests: N/A ...

We needed to update the description.

Improve SparkStatusTracker to also track executor information

f76de44

update

fe80390

cloud-fan changed the title ~~[SPARK-14069][SQL][WIP] Improve SparkStatusTracker to also track executor information~~ [SPARK-14069][SQL] Improve SparkStatusTracker to also track executor information Mar 23, 2016

andrewor14 reviewed Mar 28, 2016
View reviewed changes

rename

0b9400e

update

39dd0ee

asfgit closed this in 0abee53 Mar 31, 2016

JoshRosen reviewed Nov 23, 2016

View reviewed changes

JoshRosen mentioned this pull request Nov 23, 2016

[SPARK-18553][CORE][branch-2.0] Fix leak of TaskSetManager following executor loss #15986

Closed

[SPARK-14069][SQL] Improve SparkStatusTracker to also track executor information #11888

[SPARK-14069][SQL] Improve SparkStatusTracker to also track executor information #11888

Uh oh!

Conversation

cloud-fan commented Mar 22, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

cloud-fan commented Mar 22, 2016

Uh oh!

SparkQA commented Mar 22, 2016

Uh oh!

cloud-fan commented Mar 23, 2016

Uh oh!

rxin commented Mar 23, 2016

Uh oh!

SparkQA commented Mar 23, 2016

Uh oh!

cloud-fan commented Mar 23, 2016

Uh oh!

SparkQA commented Mar 23, 2016

Uh oh!

cloud-fan commented Mar 23, 2016

Uh oh!

SparkQA commented Mar 23, 2016

Uh oh!

cloud-fan commented Mar 23, 2016

Uh oh!

SparkQA commented Mar 23, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 30, 2016

Uh oh!

andrewor14 commented Mar 30, 2016

Uh oh!

andrewor14 commented Mar 30, 2016

Uh oh!

rxin commented Mar 30, 2016

Uh oh!

rxin commented Mar 30, 2016

Uh oh!

andrewor14 commented Mar 30, 2016

Uh oh!

rxin commented Mar 30, 2016

Uh oh!

andrewor14 commented Mar 30, 2016

Uh oh!

SparkQA commented Mar 31, 2016

Uh oh!

andrewor14 commented Mar 31, 2016

Uh oh!

SparkQA commented Mar 31, 2016

Uh oh!

andrewor14 commented Mar 31, 2016

Uh oh!

JoshRosen commented Nov 23, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rxin commented Nov 23, 2016

Uh oh!

markhamstra commented Nov 23, 2016

Uh oh!

rxin commented Nov 23, 2016

cloud-fan commented Mar 22, 2016 •

edited

Loading