-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-13328][Core]: Poor read performance for broadcast variables with dynamic resource allocation #11241
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Set the default value to Int.MaxValue so that locations will not get refreshed by default, which I think is OK for small clusters. What do you think?
|
ok to test |
|
Test build #51510 has finished for PR 11241 at commit
|
45bdec6 to
f6fdfee
Compare
|
Updated to fix the style problems. |
|
Test build #51575 has finished for PR 11241 at commit
|
f6fdfee to
6a5e7f5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a pretty brittle way to test this; the test may be flaky and it will take a long time to run it. Can you rewrite this in a way that's more of a unit test (e.g. by mocking)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't also like depending on timing, but couldn't really find a decent way to trigger this code path (a case where a previously failing block fetch succeeds after a refresh). Which component do you propose to mock?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well one thing you could do is pass in your own custom BlockTransferService that overrides fetchBlockSync to throw exceptions if it's the first N block managers. Then you can use Mockito verify to check how many times BlockManager#getLocations was called. It's a bit more work but the long term advantage is significant.
|
@nezihyigitbasi thanks for explaining the issue concisely in the description. I can see how this patch fixes it, but as I mentioned in my comments I think we should just make the refresh threshold a constant instead of allowing the user to set it. Another concern I have is that whatever solution we come up with here we need to make sure we never go into an infinite loop. It's hard to prove that this patch in its current state does not potentially introduce one. |
|
@andrewor14 thanks for taking a look. We can introduce a global failure threshold to break out, but do we really want that global threshold to be a constant? Because it's possible that from run to run with the same settings one run can succeed and the other one can fail (hit the threshold) depending on the order of the live/removed executors in the location list (tl;dr from a user's point of view a job can arbitrarily fail from run to run). |
|
Test build #51576 has finished for PR 11241 at commit
|
37fb00d to
2412504
Compare
|
@andrewor14 addressed your comments, can you please take a look? |
|
Test build #51675 has finished for PR 11241 at commit
|
2412504 to
e444072
Compare
|
Test build #51686 has finished for PR 11241 at commit
|
07f731b to
44ec18b
Compare
|
Test build #51690 has finished for PR 11241 at commit
|
44ec18b to
b67bf56
Compare
|
Test build #51697 has finished for PR 11241 at commit
|
b67bf56 to
663e387
Compare
|
Test build #51700 has finished for PR 11241 at commit
|
|
@andrewor14 addressed your comments && tests have passed, can you please take a look? |
|
@andrewor14 do you have any other comments for this PR? |
|
Sorry but I disagree on this limit not being configurable. Depending on how big your job, cluster, and broadcast are a user may want to set this differently. I think we should make this configurable, we can leave it as an undocumented internal config for now but I would like an out if my users start hitting this. @andrewor14 thoughts? Note I recently ran into this with dynamic allocation and it took forever for those tasks to fail. I'm in process of testing this but haven't run into that condition again yet. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can also make this fully private if you just get it from the conf in tests. In general it's best to minimize the number of things we expose.
|
Looks great. My remaining comments are relatively minor. About making it configurable, it's probably OK as long as we don't also document it. I just don't want the user to have to think about their applications at this level of detail. We want Spark to be easy to use without a ton of tweaking. Maybe that's not really the case today but it's a goal we're striving towards. (TL;DR keep the config but don't document it) |
e3045dd to
bba6d4c
Compare
|
@andrewor14 comments addressed. |
|
Test build #52851 has finished for PR 11241 at commit
|
|
@andrewor14 @tgravescs @squito guys I believe this is ready to get in. Do you have any other comments? |
|
Have you seen my latest comments about exposing fewer things for tests? |
b418e13 to
5bcf323
Compare
|
@andrewor14 just saw it and also rebased (seems like some changes have been pushed to master). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you have an extra space here
|
LGTM once this passes tests I'll go ahead and merge it. Thanks everyone for your input. |
5bcf323 to
7ba025f
Compare
|
@andrewor14 got rid of the whitespaces. Thanks everyone for their reviews. |
|
Test build #52868 has finished for PR 11241 at commit
|
|
Test build #52869 has finished for PR 11241 at commit
|
c8f2557 to
0875b24
Compare
|
Test build #52867 has finished for PR 11241 at commit
|
|
Test build #52871 has finished for PR 11241 at commit
|
|
retest this please |
|
Test build #52901 has finished for PR 11241 at commit
|
|
Merging into master, thanks! |
|
Note to self: remember to close the issue once JIRA is back up |
…h dynamic resource allocation When dynamic resource allocation is enabled fetching broadcast variables from removed executors were causing job failures and SPARK-9591 fixed this problem by trying all locations of a block before giving up. However, the locations of a block is retrieved only once from the driver in this process and the locations in this list can be stale due to dynamic resource allocation. This situation gets worse when running on a large cluster as the size of this location list can be in the order of several hundreds out of which there may be tens of stale entries. What we have observed is with the default settings of 3 max retries and 5s between retries (that's 15s per location) the time it takes to read a broadcast variable can be as high as ~17m (70 failed attempts * 15s/attempt) Author: Nezih Yigitbasi <[email protected]> Closes apache#11241 from nezihyigitbasi/SPARK-13328.
When dynamic resource allocation is enabled fetching broadcast variables from removed executors were causing job failures and SPARK-9591 fixed this problem by trying all locations of a block before giving up. However, the locations of a block is retrieved only once from the driver in this process and the locations in this list can be stale due to dynamic resource allocation. This situation gets worse when running on a large cluster as the size of this location list can be in the order of several hundreds out of which there may be tens of stale entries. What we have observed is with the default settings of 3 max retries and 5s between retries (that's 15s per location) the time it takes to read a broadcast variable can be as high as ~17m (70 failed attempts * 15s/attempt)