Skip to content

Conversation

@nezihyigitbasi
Copy link
Contributor

When dynamic resource allocation is enabled fetching broadcast variables from removed executors were causing job failures and SPARK-9591 fixed this problem by trying all locations of a block before giving up. However, the locations of a block is retrieved only once from the driver in this process and the locations in this list can be stale due to dynamic resource allocation. This situation gets worse when running on a large cluster as the size of this location list can be in the order of several hundreds out of which there may be tens of stale entries. What we have observed is with the default settings of 3 max retries and 5s between retries (that's 15s per location) the time it takes to read a broadcast variable can be as high as ~17m (70 failed attempts * 15s/attempt)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Set the default value to Int.MaxValue so that locations will not get refreshed by default, which I think is OK for small clusters. What do you think?

@andrewor14
Copy link
Contributor

ok to test

@SparkQA
Copy link

SparkQA commented Feb 19, 2016

Test build #51510 has finished for PR 11241 at commit 45bdec6.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@nezihyigitbasi
Copy link
Contributor Author

Updated to fix the style problems.

@SparkQA
Copy link

SparkQA commented Feb 19, 2016

Test build #51575 has finished for PR 11241 at commit f6fdfee.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a pretty brittle way to test this; the test may be flaky and it will take a long time to run it. Can you rewrite this in a way that's more of a unit test (e.g. by mocking)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't also like depending on timing, but couldn't really find a decent way to trigger this code path (a case where a previously failing block fetch succeeds after a refresh). Which component do you propose to mock?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well one thing you could do is pass in your own custom BlockTransferService that overrides fetchBlockSync to throw exceptions if it's the first N block managers. Then you can use Mockito verify to check how many times BlockManager#getLocations was called. It's a bit more work but the long term advantage is significant.

@andrewor14
Copy link
Contributor

@nezihyigitbasi thanks for explaining the issue concisely in the description. I can see how this patch fixes it, but as I mentioned in my comments I think we should just make the refresh threshold a constant instead of allowing the user to set it. Another concern I have is that whatever solution we come up with here we need to make sure we never go into an infinite loop. It's hard to prove that this patch in its current state does not potentially introduce one.

@nezihyigitbasi
Copy link
Contributor Author

@andrewor14 thanks for taking a look. We can introduce a global failure threshold to break out, but do we really want that global threshold to be a constant? Because it's possible that from run to run with the same settings one run can succeed and the other one can fail (hit the threshold) depending on the order of the live/removed executors in the location list (tl;dr from a user's point of view a job can arbitrarily fail from run to run).

@SparkQA
Copy link

SparkQA commented Feb 19, 2016

Test build #51576 has finished for PR 11241 at commit 6a5e7f5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@nezihyigitbasi nezihyigitbasi force-pushed the SPARK-13328 branch 3 times, most recently from 37fb00d to 2412504 Compare February 22, 2016 18:54
@nezihyigitbasi
Copy link
Contributor Author

@andrewor14 addressed your comments, can you please take a look?

@SparkQA
Copy link

SparkQA commented Feb 22, 2016

Test build #51675 has finished for PR 11241 at commit 2412504.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 22, 2016

Test build #51686 has finished for PR 11241 at commit e444072.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@nezihyigitbasi nezihyigitbasi force-pushed the SPARK-13328 branch 2 times, most recently from 07f731b to 44ec18b Compare February 22, 2016 22:33
@SparkQA
Copy link

SparkQA commented Feb 22, 2016

Test build #51690 has finished for PR 11241 at commit 44ec18b.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 22, 2016

Test build #51697 has finished for PR 11241 at commit b67bf56.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 23, 2016

Test build #51700 has finished for PR 11241 at commit 663e387.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@nezihyigitbasi
Copy link
Contributor Author

@andrewor14 addressed your comments && tests have passed, can you please take a look?

@nezihyigitbasi
Copy link
Contributor Author

@andrewor14 do you have any other comments for this PR?

@tgravescs
Copy link
Contributor

Sorry but I disagree on this limit not being configurable. Depending on how big your job, cluster, and broadcast are a user may want to set this differently. I think we should make this configurable, we can leave it as an undocumented internal config for now but I would like an out if my users start hitting this. @andrewor14 thoughts?

Note I recently ran into this with dynamic allocation and it took forever for those tasks to fail. I'm in process of testing this but haven't run into that condition again yet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can also make this fully private if you just get it from the conf in tests. In general it's best to minimize the number of things we expose.

@andrewor14
Copy link
Contributor

Looks great. My remaining comments are relatively minor. About making it configurable, it's probably OK as long as we don't also document it. I just don't want the user to have to think about their applications at this level of detail. We want Spark to be easy to use without a ton of tweaking. Maybe that's not really the case today but it's a goal we're striving towards.

(TL;DR keep the config but don't document it)

@nezihyigitbasi
Copy link
Contributor Author

@andrewor14 comments addressed.

@SparkQA
Copy link

SparkQA commented Mar 10, 2016

Test build #52851 has finished for PR 11241 at commit bba6d4c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@nezihyigitbasi
Copy link
Contributor Author

@andrewor14 @tgravescs @squito guys I believe this is ready to get in. Do you have any other comments?

@andrewor14
Copy link
Contributor

Have you seen my latest comments about exposing fewer things for tests?

@nezihyigitbasi nezihyigitbasi force-pushed the SPARK-13328 branch 2 times, most recently from b418e13 to 5bcf323 Compare March 10, 2016 23:28
@nezihyigitbasi
Copy link
Contributor Author

@andrewor14 just saw it and also rebased (seems like some changes have been pushed to master).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you have an extra space here

@andrewor14
Copy link
Contributor

LGTM once this passes tests I'll go ahead and merge it. Thanks everyone for your input.

@nezihyigitbasi
Copy link
Contributor Author

@andrewor14 got rid of the whitespaces. Thanks everyone for their reviews.

@SparkQA
Copy link

SparkQA commented Mar 10, 2016

Test build #52868 has finished for PR 11241 at commit 5bcf323.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 10, 2016

Test build #52869 has finished for PR 11241 at commit 7ba025f.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@nezihyigitbasi nezihyigitbasi force-pushed the SPARK-13328 branch 3 times, most recently from c8f2557 to 0875b24 Compare March 11, 2016 00:20
@SparkQA
Copy link

SparkQA commented Mar 11, 2016

Test build #52867 has finished for PR 11241 at commit b418e13.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 11, 2016

Test build #52871 has finished for PR 11241 at commit 0875b24.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@andrewor14
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Mar 11, 2016

Test build #52901 has finished for PR 11241 at commit 0875b24.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@andrewor14
Copy link
Contributor

Merging into master, thanks!

@andrewor14
Copy link
Contributor

Note to self: remember to close the issue once JIRA is back up

@asfgit asfgit closed this in ff776b2 Mar 11, 2016
roygao94 pushed a commit to roygao94/spark that referenced this pull request Mar 22, 2016
…h dynamic resource allocation

When dynamic resource allocation is enabled fetching broadcast variables from removed executors were causing job failures and SPARK-9591 fixed this problem by trying all locations of a block before giving up. However, the locations of a block is retrieved only once from the driver in this process and the locations in this list can be stale due to dynamic resource allocation. This situation gets worse when running on a large cluster as the size of this location list can be in the order of several hundreds out of which there may be tens of stale entries. What we have observed is with the default settings of 3 max retries and 5s between retries (that's 15s per location) the time it takes to read a broadcast variable can be as high as ~17m (70 failed attempts * 15s/attempt)

Author: Nezih Yigitbasi <[email protected]>

Closes apache#11241 from nezihyigitbasi/SPARK-13328.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants