[SPARK-18039][Scheduler] fix bug maxRegisteredWaitingTime does not work for receiver scheduler #15588

Astralidea · 2016-10-21T16:59:19Z

The synchronous mode of driver and executor is through dummy job is only ensure 1 executor connect to driver.
In my cluster I need to ensure each executor have one receiver.
Thinking about following example:
If spark.cores.max=4 and spark.executor.cores=2 therefore, it will launch 2 executor instance.
The spark first job is running dummy job is always 70 tasks. it takes about 4 seconds.
case 1:
if in this 4 seconds only one executor (E1) connect to driver and another(E2) not
executor 1 will start 2 receiver and did not working tasks. because it had used 2 core.
executor 2 will only do tasks not running receiver .because I write code set 2 receiver stream.
therefore the batch running slowly and it have network data transmission.(about 3s)
case 2:
in this 4 seconds 2 executor connected to driver
executor 1 start 1 receiver used 1 core and could do task
executor 2 start 1 receiver used 1 core and could do task
it is balanced scheduler and running fast (about 0.1s)

So I hope I could set maxRegisteredWaiting to make sure if I have a slowly executor to startup and have a better receiver policy like every executor have one receiver.

update

AmplabJenkins · 2016-10-21T17:02:12Z

Can one of the admins verify this patch?

srowen · 2016-10-22T08:45:30Z

I don't think it's necessarily true that you want to wait for all receivers to begin processing. This change won't work in any event.

srowen · 2016-10-22T08:43:54Z

streaming/src/main/scala/org/apache/spark/streaming/scheduler/ReceiverTracker.scala

    }

-    runDummySparkJob()
+    while ((System.currentTimeMillis() - createTime) < maxRegisteredWaitingTimeMs) {}


You can't spin on a condition like this; it'll waste CPU in millions of system calls. This also forces a delay of this waiting time, which is not OK.

You're right. But I think it only waste a little time, and it better because it could be config
and how to write code gracefully?
I hope to make it more better but do not know how to do it.

Astralidea · 2016-10-23T02:34:32Z

@srowen But in my cluster
before change code it failed(1 executor start 2 receiver, and not do tasks) all the times.
after change code I tested 10 times. 9 times successed, 1 time failed.
Why not necessary? receiver balance scheduler affect performance.
If new executor delay add to driver. receiver won't scheduler again. Or any other solution?

lw-lin · 2016-10-23T04:42:39Z

Spark Streaming would do a very simple dummy job to ensure that all slaves have registered before the Receiver scheduling; please see https://github.com/apache/spark/blob/v2.0.0/streaming/src/main/scala/org/apache/spark/streaming/scheduler/ReceiverTracker.scala#L436-L447.

@Astralidea, spark.scheduler.minRegisteredResourcesRatio is the minimum ratio of registered resources to wait for before the dummy job begins.In our private clusters, configuring that to be 0.9 or even 1.0 helps a lot to balance our 100+ Receivers. Maybe you could also give it a try.

Astralidea · 2016-10-23T05:18:51Z

@lw-lin
Thanks for you reply. In my private cluster running spark is a little different. (I start drivr & executor by myself)
I had try maxRegisteredWaitingTime, but I had not try minRegisteredResourcesRatio.
I thought minRegisteredResourcesRatio will not work if maxRegisteredWaitingTime won't work.
Maybe it works, I will try spark.scheduler.minRegisteredResourcesRatio tommorrow.

Astralidea · 2016-10-24T03:31:04Z

@lw-lin
spark.scheduler.minRegisteredResourcesRatio does not work.
The reason it may could be I use mesos coarse-grained mode and it run executor not through driver.
but I still need to make sure it have sufficient resources registered.

jerryshao · 2016-10-24T08:20:28Z

I think this fix cannot really handle this imbalance receiver allocation problem, also blindly waste the CPU time.

What @lw-lin mentioned is a feasible solution to wait for executors to be registered, also ReceiverSchedulingPolicy should probably handle this problem well, but strictly even distribution is hard to guarantee and very costing especially when a cluster has intensive resource contention.

Astralidea · 2016-10-25T01:35:16Z

@jerryshao I agree waiting time waste the CPU time, and I have tested @lw-lin mentioned feasible solution is not work in my environment.
OK, If there is no better solution or advise. I will close this PR next week.

Closes apache#11610 Closes apache#15411 Closes apache#15501 Closes apache#12613 Closes apache#12518 Closes apache#12026 Closes apache#15524 Closes apache#12693 Closes apache#12358 Closes apache#15588 Closes apache#15635 Closes apache#15678 Closes apache#14699 Closes apache#9008

Astralidea and others added 2 commits October 21, 2016 19:21

Merge pull request #1 from apache/master

41fc65e

update

add maxRegisteredWaitingTime conf for receiver

84d533f

srowen reviewed Oct 22, 2016

View reviewed changes

Astralidea changed the title ~~[SPARK-18039][Scheduler] fix bug maxRegisteredWaitingTime does not work~~ [SPARK-18039][Scheduler] fix bug maxRegisteredWaitingTime does not work for receiver scheduler Oct 23, 2016

srowen mentioned this pull request Oct 30, 2016

[BUILD] Close stale Pull Requests. #15685

Closed

asfgit closed this in 26b07f1 Oct 31, 2016

Astralidea deleted the SPARK-18039 branch October 31, 2016 11:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-18039][Scheduler] fix bug maxRegisteredWaitingTime does not work for receiver scheduler #15588

[SPARK-18039][Scheduler] fix bug maxRegisteredWaitingTime does not work for receiver scheduler #15588

Uh oh!

Astralidea commented Oct 21, 2016 •

edited

Loading

Uh oh!

AmplabJenkins commented Oct 21, 2016

Uh oh!

srowen commented Oct 22, 2016

Uh oh!

srowen Oct 22, 2016

Uh oh!

Astralidea Oct 23, 2016 •

edited

Loading

Uh oh!

Astralidea commented Oct 23, 2016 •

edited

Loading

Uh oh!

lw-lin commented Oct 23, 2016 •

edited

Loading

Uh oh!

Astralidea commented Oct 23, 2016

Uh oh!

Astralidea commented Oct 24, 2016 •

edited

Loading

Uh oh!

jerryshao commented Oct 24, 2016

Uh oh!

Astralidea commented Oct 25, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[SPARK-18039][Scheduler] fix bug maxRegisteredWaitingTime does not work for receiver scheduler #15588

[SPARK-18039][Scheduler] fix bug maxRegisteredWaitingTime does not work for receiver scheduler #15588

Uh oh!

Conversation

Astralidea commented Oct 21, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AmplabJenkins commented Oct 21, 2016

Uh oh!

srowen commented Oct 22, 2016

Uh oh!

srowen Oct 22, 2016

Choose a reason for hiding this comment

Uh oh!

Astralidea Oct 23, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Astralidea commented Oct 23, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lw-lin commented Oct 23, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Astralidea commented Oct 23, 2016

Uh oh!

Astralidea commented Oct 24, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jerryshao commented Oct 24, 2016

Uh oh!

Astralidea commented Oct 25, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Astralidea commented Oct 21, 2016 •

edited

Loading

Astralidea Oct 23, 2016 •

edited

Loading

Astralidea commented Oct 23, 2016 •

edited

Loading

lw-lin commented Oct 23, 2016 •

edited

Loading

Astralidea commented Oct 24, 2016 •

edited

Loading