Skip to content

Conversation

@Astralidea
Copy link

@Astralidea Astralidea commented Oct 21, 2016

The synchronous mode of driver and executor is through dummy job is only ensure 1 executor connect to driver.
In my cluster I need to ensure each executor have one receiver.
Thinking about following example:
If spark.cores.max=4 and spark.executor.cores=2 therefore, it will launch 2 executor instance.
The spark first job is running dummy job is always 70 tasks. it takes about 4 seconds.
case 1:
if in this 4 seconds only one executor (E1) connect to driver and another(E2) not
executor 1 will start 2 receiver and did not working tasks. because it had used 2 core.
executor 2 will only do tasks not running receiver .because I write code set 2 receiver stream.
therefore the batch running slowly and it have network data transmission.(about 3s)
case 2:
in this 4 seconds 2 executor connected to driver
executor 1 start 1 receiver used 1 core and could do task
executor 2 start 1 receiver used 1 core and could do task
it is balanced scheduler and running fast (about 0.1s)

So I hope I could set maxRegisteredWaiting to make sure if I have a slowly executor to startup and have a better receiver policy like every executor have one receiver.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@srowen
Copy link
Member

srowen commented Oct 22, 2016

I don't think it's necessarily true that you want to wait for all receivers to begin processing. This change won't work in any event.

}

runDummySparkJob()
while ((System.currentTimeMillis() - createTime) < maxRegisteredWaitingTimeMs) {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can't spin on a condition like this; it'll waste CPU in millions of system calls. This also forces a delay of this waiting time, which is not OK.

Copy link
Author

@Astralidea Astralidea Oct 23, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. But I think it only waste a little time, and it better because it could be config
and how to write code gracefully?
I hope to make it more better but do not know how to do it.

@Astralidea
Copy link
Author

Astralidea commented Oct 23, 2016

@srowen But in my cluster
before change code it failed(1 executor start 2 receiver, and not do tasks) all the times.
after change code I tested 10 times. 9 times successed, 1 time failed.
Why not necessary? receiver balance scheduler affect performance.
If new executor delay add to driver. receiver won't scheduler again. Or any other solution?

@Astralidea Astralidea changed the title [SPARK-18039][Scheduler] fix bug maxRegisteredWaitingTime does not work [SPARK-18039][Scheduler] fix bug maxRegisteredWaitingTime does not work for receiver scheduler Oct 23, 2016
@lw-lin
Copy link
Contributor

lw-lin commented Oct 23, 2016

Spark Streaming would do a very simple dummy job to ensure that all slaves have registered before the Receiver scheduling; please see https://github.com/apache/spark/blob/v2.0.0/streaming/src/main/scala/org/apache/spark/streaming/scheduler/ReceiverTracker.scala#L436-L447.

@Astralidea, spark.scheduler.minRegisteredResourcesRatio is the minimum ratio of registered resources to wait for before the dummy job begins.In our private clusters, configuring that to be 0.9 or even 1.0 helps a lot to balance our 100+ Receivers. Maybe you could also give it a try.

@Astralidea
Copy link
Author

@lw-lin
Thanks for you reply. In my private cluster running spark is a little different. (I start drivr & executor by myself)
I had try maxRegisteredWaitingTime, but I had not try minRegisteredResourcesRatio.
I thought minRegisteredResourcesRatio will not work if maxRegisteredWaitingTime won't work.
Maybe it works, I will try spark.scheduler.minRegisteredResourcesRatio tommorrow.

@Astralidea
Copy link
Author

Astralidea commented Oct 24, 2016

@lw-lin
spark.scheduler.minRegisteredResourcesRatio does not work.
The reason it may could be I use mesos coarse-grained mode and it run executor not through driver.
but I still need to make sure it have sufficient resources registered.

@jerryshao
Copy link
Contributor

I think this fix cannot really handle this imbalance receiver allocation problem, also blindly waste the CPU time.

What @lw-lin mentioned is a feasible solution to wait for executors to be registered, also ReceiverSchedulingPolicy should probably handle this problem well, but strictly even distribution is hard to guarantee and very costing especially when a cluster has intensive resource contention.

@Astralidea
Copy link
Author

@jerryshao I agree waiting time waste the CPU time, and I have tested @lw-lin mentioned feasible solution is not work in my environment.
OK, If there is no better solution or advise. I will close this PR next week.

srowen added a commit to srowen/spark that referenced this pull request Oct 31, 2016
@asfgit asfgit closed this in 26b07f1 Oct 31, 2016
@Astralidea Astralidea deleted the SPARK-18039 branch October 31, 2016 11:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants