[SPARK-13328][Core]: Poor read performance for broadcast variables with dynamic resource allocation #11241

nezihyigitbasi · 2016-02-17T17:47:56Z

When dynamic resource allocation is enabled fetching broadcast variables from removed executors were causing job failures and SPARK-9591 fixed this problem by trying all locations of a block before giving up. However, the locations of a block is retrieved only once from the driver in this process and the locations in this list can be stale due to dynamic resource allocation. This situation gets worse when running on a large cluster as the size of this location list can be in the order of several hundreds out of which there may be tens of stale entries. What we have observed is with the default settings of 3 max retries and 5s between retries (that's 15s per location) the time it takes to read a broadcast variable can be as high as ~17m (70 failed attempts * 15s/attempt)

nezihyigitbasi · 2016-02-17T17:54:19Z

core/src/main/scala/org/apache/spark/storage/BlockManager.scala

Set the default value to Int.MaxValue so that locations will not get refreshed by default, which I think is OK for small clusters. What do you think?

andrewor14 · 2016-02-19T02:03:30Z

ok to test

SparkQA · 2016-02-19T02:22:48Z

Test build #51510 has finished for PR 11241 at commit 45bdec6.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

nezihyigitbasi · 2016-02-19T20:51:17Z

Updated to fix the style problems.

SparkQA · 2016-02-19T21:08:05Z

Test build #51575 has finished for PR 11241 at commit f6fdfee.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

andrewor14 · 2016-02-19T22:30:24Z

core/src/test/scala/org/apache/spark/storage/BlockManagerSuite.scala

this is a pretty brittle way to test this; the test may be flaky and it will take a long time to run it. Can you rewrite this in a way that's more of a unit test (e.g. by mocking)?

I don't also like depending on timing, but couldn't really find a decent way to trigger this code path (a case where a previously failing block fetch succeeds after a refresh). Which component do you propose to mock?

well one thing you could do is pass in your own custom BlockTransferService that overrides fetchBlockSync to throw exceptions if it's the first N block managers. Then you can use Mockito verify to check how many times BlockManager#getLocations was called. It's a bit more work but the long term advantage is significant.

andrewor14 · 2016-02-19T22:43:54Z

@nezihyigitbasi thanks for explaining the issue concisely in the description. I can see how this patch fixes it, but as I mentioned in my comments I think we should just make the refresh threshold a constant instead of allowing the user to set it. Another concern I have is that whatever solution we come up with here we need to make sure we never go into an infinite loop. It's hard to prove that this patch in its current state does not potentially introduce one.

nezihyigitbasi · 2016-02-19T23:08:13Z

@andrewor14 thanks for taking a look. We can introduce a global failure threshold to break out, but do we really want that global threshold to be a constant? Because it's possible that from run to run with the same settings one run can succeed and the other one can fail (hit the threshold) depending on the order of the live/removed executors in the location list (tl;dr from a user's point of view a job can arbitrarily fail from run to run).

SparkQA · 2016-02-19T23:28:10Z

Test build #51576 has finished for PR 11241 at commit 6a5e7f5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

nezihyigitbasi · 2016-02-22T18:56:28Z

@andrewor14 addressed your comments, can you please take a look?

SparkQA · 2016-02-22T21:40:29Z

Test build #51675 has finished for PR 11241 at commit 2412504.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-02-22T21:49:59Z

Test build #51686 has finished for PR 11241 at commit e444072.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-02-22T22:53:47Z

Test build #51690 has finished for PR 11241 at commit 44ec18b.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-02-22T23:04:59Z

Test build #51697 has finished for PR 11241 at commit b67bf56.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-02-23T01:52:11Z

Test build #51700 has finished for PR 11241 at commit 663e387.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

nezihyigitbasi · 2016-02-23T17:04:08Z

@andrewor14 addressed your comments && tests have passed, can you please take a look?

nezihyigitbasi · 2016-03-02T19:33:03Z

@andrewor14 do you have any other comments for this PR?

tgravescs · 2016-03-03T13:57:21Z

Sorry but I disagree on this limit not being configurable. Depending on how big your job, cluster, and broadcast are a user may want to set this differently. I think we should make this configurable, we can leave it as an undocumented internal config for now but I would like an out if my users start hitting this. @andrewor14 thoughts?

Note I recently ran into this with dynamic allocation and it took forever for those tasks to fail. I'm in process of testing this but haven't run into that condition again yet.

andrewor14 · 2016-03-10T19:58:55Z

core/src/main/scala/org/apache/spark/storage/BlockManager.scala

You can also make this fully private if you just get it from the conf in tests. In general it's best to minimize the number of things we expose.

andrewor14 · 2016-03-10T20:01:30Z

Looks great. My remaining comments are relatively minor. About making it configurable, it's probably OK as long as we don't also document it. I just don't want the user to have to think about their applications at this level of detail. We want Spark to be easy to use without a ton of tweaking. Maybe that's not really the case today but it's a goal we're striving towards.

(TL;DR keep the config but don't document it)

nezihyigitbasi · 2016-03-10T20:02:01Z

@andrewor14 comments addressed.

SparkQA · 2016-03-10T22:12:42Z

Test build #52851 has finished for PR 11241 at commit bba6d4c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

nezihyigitbasi · 2016-03-10T22:15:32Z

@andrewor14 @tgravescs @squito guys I believe this is ready to get in. Do you have any other comments?

andrewor14 · 2016-03-10T23:10:19Z

Have you seen my latest comments about exposing fewer things for tests?

nezihyigitbasi · 2016-03-10T23:29:25Z

@andrewor14 just saw it and also rebased (seems like some changes have been pushed to master).

andrewor14 · 2016-03-10T23:29:38Z

core/src/test/scala/org/apache/spark/storage/BlockManagerSuite.scala

you have an extra space here

andrewor14 · 2016-03-10T23:30:32Z

LGTM once this passes tests I'll go ahead and merge it. Thanks everyone for your input.

nezihyigitbasi · 2016-03-10T23:36:01Z

@andrewor14 got rid of the whitespaces. Thanks everyone for their reviews.

SparkQA · 2016-03-10T23:42:41Z

Test build #52868 has finished for PR 11241 at commit 5bcf323.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-10T23:47:37Z

Test build #52869 has finished for PR 11241 at commit 7ba025f.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-11T02:14:16Z

Test build #52867 has finished for PR 11241 at commit b418e13.

This patch passes all tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2016-03-11T02:47:58Z

Test build #52871 has finished for PR 11241 at commit 0875b24.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

andrewor14 · 2016-03-11T04:51:23Z

retest this please

SparkQA · 2016-03-11T07:20:43Z

Test build #52901 has finished for PR 11241 at commit 0875b24.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

andrewor14 · 2016-03-11T19:11:50Z

Merging into master, thanks!

andrewor14 · 2016-03-11T19:12:03Z

Note to self: remember to close the issue once JIRA is back up

…h dynamic resource allocation When dynamic resource allocation is enabled fetching broadcast variables from removed executors were causing job failures and SPARK-9591 fixed this problem by trying all locations of a block before giving up. However, the locations of a block is retrieved only once from the driver in this process and the locations in this list can be stale due to dynamic resource allocation. This situation gets worse when running on a large cluster as the size of this location list can be in the order of several hundreds out of which there may be tens of stale entries. What we have observed is with the default settings of 3 max retries and 5s between retries (that's 15s per location) the time it takes to read a broadcast variable can be as high as ~17m (70 failed attempts * 15s/attempt) Author: Nezih Yigitbasi <[email protected]> Closes apache#11241 from nezihyigitbasi/SPARK-13328.

nezihyigitbasi reviewed Feb 17, 2016
View reviewed changes

nezihyigitbasi force-pushed the SPARK-13328 branch from 45bdec6 to f6fdfee Compare February 19, 2016 20:50

nezihyigitbasi force-pushed the SPARK-13328 branch from f6fdfee to 6a5e7f5 Compare February 19, 2016 21:15

andrewor14 reviewed Feb 19, 2016
View reviewed changes

nezihyigitbasi force-pushed the SPARK-13328 branch 3 times, most recently from 37fb00d to 2412504 Compare February 22, 2016 18:54

nezihyigitbasi force-pushed the SPARK-13328 branch from 2412504 to e444072 Compare February 22, 2016 21:45

nezihyigitbasi force-pushed the SPARK-13328 branch 2 times, most recently from 07f731b to 44ec18b Compare February 22, 2016 22:33

nezihyigitbasi force-pushed the SPARK-13328 branch from 44ec18b to b67bf56 Compare February 22, 2016 22:59

nezihyigitbasi force-pushed the SPARK-13328 branch from b67bf56 to 663e387 Compare February 22, 2016 23:38

andrewor14 reviewed Mar 10, 2016
View reviewed changes

nezihyigitbasi force-pushed the SPARK-13328 branch from e3045dd to bba6d4c Compare March 10, 2016 20:01

nezihyigitbasi force-pushed the SPARK-13328 branch 2 times, most recently from b418e13 to 5bcf323 Compare March 10, 2016 23:28

andrewor14 reviewed Mar 10, 2016
View reviewed changes

core/src/test/scala/org/apache/spark/storage/BlockManagerSuite.scala Outdated

Copy link

Contributor

andrewor14 Mar 10, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you have an extra space here

nezihyigitbasi force-pushed the SPARK-13328 branch from 5bcf323 to 7ba025f Compare March 10, 2016 23:34

nezihyigitbasi force-pushed the SPARK-13328 branch 3 times, most recently from c8f2557 to 0875b24 Compare March 11, 2016 00:20

Support refreshing block locations during reads

0875b24

asfgit closed this in ff776b2 Mar 11, 2016

JoshRosen mentioned this pull request Sep 10, 2016

[SPARK-17485] Prevent failed remote reads of cached blocks from failing entire job #15037

Closed

[SPARK-13328][Core]: Poor read performance for broadcast variables with dynamic resource allocation #11241

[SPARK-13328][Core]: Poor read performance for broadcast variables with dynamic resource allocation #11241

Uh oh!

Conversation

nezihyigitbasi commented Feb 17, 2016

Uh oh!

nezihyigitbasi Feb 17, 2016

Choose a reason for hiding this comment

Uh oh!

andrewor14 commented Feb 19, 2016

Uh oh!

SparkQA commented Feb 19, 2016

Uh oh!

nezihyigitbasi commented Feb 19, 2016

Uh oh!

SparkQA commented Feb 19, 2016

Uh oh!

andrewor14 Feb 19, 2016

Choose a reason for hiding this comment

Uh oh!

nezihyigitbasi Feb 19, 2016

Choose a reason for hiding this comment

Uh oh!

andrewor14 Feb 19, 2016

Choose a reason for hiding this comment

Uh oh!

andrewor14 commented Feb 19, 2016

Uh oh!

nezihyigitbasi commented Feb 19, 2016

Uh oh!

SparkQA commented Feb 19, 2016

Uh oh!

nezihyigitbasi commented Feb 22, 2016

Uh oh!

SparkQA commented Feb 22, 2016

Uh oh!

SparkQA commented Feb 22, 2016

Uh oh!

SparkQA commented Feb 22, 2016

Uh oh!

SparkQA commented Feb 22, 2016

Uh oh!

SparkQA commented Feb 23, 2016

Uh oh!

nezihyigitbasi commented Feb 23, 2016

Uh oh!

nezihyigitbasi commented Mar 2, 2016

Uh oh!

tgravescs commented Mar 3, 2016

Uh oh!

andrewor14 Mar 10, 2016

Choose a reason for hiding this comment

Uh oh!

andrewor14 commented Mar 10, 2016

Uh oh!

nezihyigitbasi commented Mar 10, 2016

Uh oh!

SparkQA commented Mar 10, 2016

Uh oh!

nezihyigitbasi commented Mar 10, 2016

Uh oh!

andrewor14 commented Mar 10, 2016

Uh oh!

nezihyigitbasi commented Mar 10, 2016

Uh oh!

andrewor14 Mar 10, 2016

Choose a reason for hiding this comment

Uh oh!

andrewor14 commented Mar 10, 2016

Uh oh!

nezihyigitbasi commented Mar 10, 2016

Uh oh!

SparkQA commented Mar 10, 2016

Uh oh!

SparkQA commented Mar 10, 2016

Uh oh!

SparkQA commented Mar 11, 2016

Uh oh!

SparkQA commented Mar 11, 2016

Uh oh!