Skip to content

Conversation

@felixb
Copy link

@felixb felixb commented Sep 7, 2015

this change rejects offers for slaves with unmet constraints for 120s to mitigate offer starvation.
this prevents mesos to send us these offers again and again.
in return, we get more offers for slaves which might meet our constraints.
and it enables mesos to send the rejected offers to other frameworks.

@dragos
Copy link
Contributor

dragos commented Sep 8, 2015

You should probably modify the fine-grained scheduler in the same way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this configurable? Also please comment on the unit as well.

@tnachen
Copy link
Contributor

tnachen commented Sep 9, 2015

I think the change makes sense, we're planning to add dynamic attribute changes on the slave but that's not merged yet in Mesos. as @dragos mentioned please add this to coarse grain mode too.
And once you make this a configuration please also update the spark on mesos docs in docs folder .

@felixb felixb force-pushed the decline_offers_constraint_mismatch branch from c1efb1f to bb79444 Compare September 10, 2015 06:17
@felixb
Copy link
Author

felixb commented Sep 10, 2015

I made the duration configurable. Still need to add it to fine grained scheduler.

@andrewor14
Copy link
Contributor

ok to test

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is also a configurations.md that you should add this too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tnachen, none of the Yarn or Mesos specific settings are listed in there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, typo: for for.

@SparkQA
Copy link

SparkQA commented Sep 10, 2015

Test build #42277 has finished for PR 8639 at commit bb79444.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@felixb felixb force-pushed the decline_offers_constraint_mismatch branch from bb79444 to 5acfd65 Compare September 11, 2015 08:22
@felixb
Copy link
Author

felixb commented Sep 11, 2015

Added the same logic for fine grained scheduler.
All mesos configuration is only available in running-on-mesos.md. So we skipped adding it to configurations.md

@SparkQA
Copy link

SparkQA commented Sep 11, 2015

Test build #42320 has finished for PR 8639 at commit 5acfd65.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@felixb felixb force-pushed the decline_offers_constraint_mismatch branch from 5acfd65 to 7626d45 Compare September 11, 2015 08:55
@SparkQA
Copy link

SparkQA commented Sep 11, 2015

Test build #42321 has finished for PR 8639 at commit 7626d45.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@felixb felixb force-pushed the decline_offers_constraint_mismatch branch from 7626d45 to 66a1a73 Compare September 11, 2015 09:31
@SparkQA
Copy link

SparkQA commented Sep 11, 2015

Test build #42324 has finished for PR 8639 at commit 66a1a73.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@felixb felixb force-pushed the decline_offers_constraint_mismatch branch from 66a1a73 to ce84b1a Compare September 15, 2015 10:16
@felixb
Copy link
Author

felixb commented Sep 15, 2015

I just rebased to the current upstream/master.
Any hint, why the tests keep failing?

@SparkQA
Copy link

SparkQA commented Sep 15, 2015

Test build #42483 has finished for PR 8639 at commit ce84b1a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dragos
Copy link
Contributor

dragos commented Sep 15, 2015

They're probably just flaky.

@felixb felixb force-pushed the decline_offers_constraint_mismatch branch from ce84b1a to 9e00071 Compare September 16, 2015 07:07
@felixb
Copy link
Author

felixb commented Sep 16, 2015

fixed typo.

@SparkQA
Copy link

SparkQA commented Sep 16, 2015

Test build #42527 has finished for PR 8639 at commit 9e00071.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SleepyThread
Copy link
Contributor

@tnachen @andrewor14 friendly reminder..

@tnachen
Copy link
Contributor

tnachen commented Sep 17, 2015

There is a big HTML table in the bottom of this file, can you also add it
to that list?

On Thu, Sep 17, 2015 at 4:55 AM, Akash Mishra [email protected]
wrote:

@tnachen https://github.com/tnachen @andrewor14
https://github.com/andrewor14 friendly reminder..


Reply to this email directly or view it on GitHub
#8639 (comment).

@felixb felixb force-pushed the decline_offers_constraint_mismatch branch from 9e00071 to 58aaa79 Compare September 18, 2015 05:22
@felixb
Copy link
Author

felixb commented Sep 18, 2015

added to table of parameters.

@SparkQA
Copy link

SparkQA commented Sep 18, 2015

Test build #42646 has finished for PR 8639 at commit 58aaa79.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SleepyThread
Copy link
Contributor

@tnachen @andrewor14 friendly reminder..

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use conf.getTimeAsSeconds instead, in which case the default value would be "120s"

@andrewor14
Copy link
Contributor

@felixb sorry for slipping. This looks pretty good. Thanks for taking the time to fix this. Once you address the comments I will merge this.

@felixb felixb force-pushed the decline_offers_constraint_mismatch branch from 58aaa79 to 69c3e52 Compare October 19, 2015 05:20
@felixb
Copy link
Author

felixb commented Oct 19, 2015

I worked in all your comments.

@SparkQA
Copy link

SparkQA commented Oct 19, 2015

Test build #43912 has finished for PR 8639 at commit 69c3e52.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@felixb felixb force-pushed the decline_offers_constraint_mismatch branch from 69c3e52 to 72a2855 Compare October 19, 2015 05:42
@SparkQA
Copy link

SparkQA commented Oct 19, 2015

Test build #43914 has finished for PR 8639 at commit 72a2855.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please keep this protected

…ints

this change rejects offers for slaves with unmet constraints for 120s to mitigate offer starvation.
this prevents mesos to send us these offers again and again.
in return, we get more offers for slaves which might meet our constraints.
and it enables mesos to send the rejected offers to other frameworks.
@felixb felixb force-pushed the decline_offers_constraint_mismatch branch from 72a2855 to 785e4ae Compare October 20, 2015 08:29
@SparkQA
Copy link

SparkQA commented Oct 20, 2015

Test build #43971 has finished for PR 8639 at commit 785e4ae.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@felixb
Copy link
Author

felixb commented Oct 25, 2015

Is there anything else I can do?

@andrewor14
Copy link
Contributor

retest this please

@andrewor14
Copy link
Contributor

LGTM merging into master and 1.6. Thanks for your work and patience!

asfgit pushed a commit that referenced this pull request Nov 9, 2015
this change rejects offers for slaves with unmet constraints for 120s to mitigate offer starvation.
this prevents mesos to send us these offers again and again.
in return, we get more offers for slaves which might meet our constraints.
and it enables mesos to send the rejected offers to other frameworks.

Author: Felix Bechstein <[email protected]>

Closes #8639 from felixb/decline_offers_constraint_mismatch.

(cherry picked from commit 5039a49)
Signed-off-by: Andrew Or <[email protected]>
@asfgit asfgit closed this in 5039a49 Nov 9, 2015
@SparkQA
Copy link

SparkQA commented Nov 10, 2015

Test build #45416 has finished for PR 8639 at commit 785e4ae.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

asfgit pushed a commit that referenced this pull request May 4, 2016
Similar to #8639

This change rejects offers for 120s when reached `spark.cores.max` in coarse-grained mode to mitigate offer starvation. This prevents Mesos to send us offers again and again, starving other frameworks. This is especially problematic when running many small frameworks on the same Mesos cluster, e.g. many small Sparks streaming jobs, and cause the bigger spark jobs to stop receiving offers. By rejecting the offers for a long period of time, they become available to those other frameworks.

Author: Sebastien Rainville <[email protected]>

Closes #10924 from sebastienrainville/master.
asfgit pushed a commit that referenced this pull request May 4, 2016
Similar to #8639

This change rejects offers for 120s when reached `spark.cores.max` in coarse-grained mode to mitigate offer starvation. This prevents Mesos to send us offers again and again, starving other frameworks. This is especially problematic when running many small frameworks on the same Mesos cluster, e.g. many small Sparks streaming jobs, and cause the bigger spark jobs to stop receiving offers. By rejecting the offers for a long period of time, they become available to those other frameworks.

Author: Sebastien Rainville <[email protected]>

Closes #10924 from sebastienrainville/master.

(cherry picked from commit eb019af)
Signed-off-by: Andrew Or <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants