Reorder the BP loop to make manager persistence more reliable #1436

TheBlueMatt · 2022-04-21T02:33:52Z

The main loop of the background processor has this line:
peer_manager.process_events(); // Note that this may block on ChannelManager's locking
which does, indeed, sometimes block waiting on the ChannelManager
to finish whatever its doing. Specifically, its the only place in
the background processor loop that we block waiting on the
ChannelManager, so if the ChannelManager is relatively busy, we
may end up being blocked there most of the time.

This should be fine, except today we had a user who's node was
particularly slow in processing some channel updates, resulting in
the background processor being blocked there (as expected). Then,
when the channel updates were completed (and persisted) the next
thing the background processor did was hand the user events to
process, creating yet more channel updates. Ultimately, the users'
node crashed before finishing the event processing. This left us
with an updated monitor on disk and an outdated manager, and they
lost the channel on startup.

Here we simply move the above quoted line to after the normal event
processing, ensuring the next thing we do after blocking on
ChannelManager locks is persist the manager, prior to event
handling.

codecov-commenter · 2022-04-21T03:04:47Z

Codecov Report

Merging #1436 (9dfd5ee) into main (d0f69f7) will increase coverage by 0.54%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main    #1436      +/-   ##
==========================================
+ Coverage   90.86%   91.40%   +0.54%     
==========================================
  Files          75       75              
  Lines       41420    44598    +3178     
  Branches    41420    44598    +3178     
==========================================
+ Hits        37636    40766    +3130     
- Misses       3784     3832      +48

Impacted Files	Coverage Δ
lightning-background-processor/src/lib.rs	`95.66% <100.00%> (+0.44%)`	⬆️
lightning/src/ln/functional_tests.rs	`97.12% <0.00%> (+0.04%)`	⬆️
lightning-invoice/src/utils.rs	`97.75% <0.00%> (+1.03%)`	⬆️
lightning/src/routing/router.rs	`94.03% <0.00%> (+1.45%)`	⬆️
lightning/src/ln/channel.rs	`90.42% <0.00%> (+2.05%)`	⬆️
lightning-persister/src/util.rs	`98.87% <0.00%> (+2.82%)`	⬆️
lightning/src/ln/channelmanager.rs	`87.94% <0.00%> (+3.24%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d0f69f7...9dfd5ee. Read the comment docs.

vincenzopalazzo

LGTM

arik-so

Are we sure this will make a big difference, considering a) it's a loop and b) the other persistence calls are still happening after process_events?

TheBlueMatt · 2022-04-25T18:03:51Z

Are we sure this will make a big difference

I mean, no, but most "normal" nodes aren't doing a lot of stuff regularly - in the case that led to this PR the node was sending a payment and otherwise not doing anything. In that case, where we're starting from all threads doing nothing, I'd think this will change the "received payment failure case" to retry after persisting the ChannelManager, which is exactly the important change for this case.

the other persistence calls are still happening after process_events?

which "other persistence calls"? Nothing happens after the event processing now.

arik-so · 2022-04-25T18:10:04Z

I was mostly referring to this line: https://github.com/lightningdevkit/rust-lightning/pull/1436/files#diff-3c70a8bcbb58522dbe589e774a85bbb611adab7142bb1eacec4fb7828cb3ee06R224

But if we're reasonably confident that it was on the first iteration, I'm ACKing

lightning-background-processor/src/lib.rs

valentinewallace · 2022-04-25T18:53:15Z

lightning-background-processor/src/lib.rs

+				// hence it comes last here. When the ChannelManager finishes whatever its doing,
+				// we want to ensure we get into `persist_manager` as quickly as we can, especially
+				// without running the normal event processing above and handing events to users.


Hmm, not clear to me what "normal event processing" refers to. Also, within chanman.process_pending_events, it seems like we do hand events to users without calling persist_manager first? Missing something..

I added more docs, let me know if its a bit clearer.

The main loop of the background processor has this line: `peer_manager.process_events(); // Note that this may block on ChannelManager's locking` which does, indeed, sometimes block waiting on the `ChannelManager` to finish whatever its doing. Specifically, its the only place in the background processor loop that we block waiting on the `ChannelManager`, so if the `ChannelManager` is relatively busy, we may end up being blocked there most of the time. This should be fine, except today we had a user who's node was particularly slow in processing some channel updates, resulting in the background processor being blocked there (as expected). Then, when the channel updates were completed (and persisted) the next thing the background processor did was hand the user events to process, creating yet more channel updates. Ultimately, the users' node crashed before finishing the event processing. This left us with an updated monitor on disk and an outdated manager, and they lost the channel on startup. Here we simply move the above quoted line to after the normal event processing, ensuring the next thing we do after blocking on `ChannelManager` locks is persist the manager, prior to event handling.

TheBlueMatt · 2022-04-26T15:29:24Z

Squashed fixup.

arik-so

Re-ACKing

TheBlueMatt added this to the 0.0.107 milestone Apr 21, 2022

TheBlueMatt added the Seeking Code Review label Apr 21, 2022

vincenzopalazzo previously approved these changes Apr 21, 2022

View reviewed changes

arik-so reviewed Apr 25, 2022

View reviewed changes

arik-so previously approved these changes Apr 25, 2022

View reviewed changes

valentinewallace reviewed Apr 25, 2022

View reviewed changes

TheBlueMatt dismissed stale reviews from arik-so and vincenzopalazzo via 9dfd5ee April 25, 2022 19:31

valentinewallace previously approved these changes Apr 26, 2022

View reviewed changes

TheBlueMatt dismissed valentinewallace’s stale review via 050b19c April 26, 2022 15:29

TheBlueMatt force-pushed the 2022-04-event-process-try-lock branch from 9dfd5ee to 050b19c Compare April 26, 2022 15:29

arik-so approved these changes Apr 26, 2022

View reviewed changes

TheBlueMatt removed the Seeking Code Review label Apr 26, 2022

TheBlueMatt assigned valentinewallace Apr 26, 2022

valentinewallace approved these changes Apr 26, 2022

View reviewed changes

valentinewallace merged commit 72069bf into lightningdevkit:main Apr 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reorder the BP loop to make manager persistence more reliable #1436

Reorder the BP loop to make manager persistence more reliable #1436

Uh oh!

TheBlueMatt commented Apr 21, 2022

Uh oh!

codecov-commenter commented Apr 21, 2022 •

edited

Loading

Uh oh!

vincenzopalazzo left a comment

Uh oh!

arik-so left a comment

Uh oh!

TheBlueMatt commented Apr 25, 2022

Uh oh!

arik-so commented Apr 25, 2022

Uh oh!

Uh oh!

Uh oh!

valentinewallace Apr 25, 2022

Uh oh!

TheBlueMatt Apr 25, 2022

Uh oh!

TheBlueMatt commented Apr 26, 2022

Uh oh!

arik-so left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Reorder the BP loop to make manager persistence more reliable #1436

Reorder the BP loop to make manager persistence more reliable #1436

Uh oh!

Conversation

TheBlueMatt commented Apr 21, 2022

Uh oh!

codecov-commenter commented Apr 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

vincenzopalazzo left a comment

Choose a reason for hiding this comment

Uh oh!

arik-so left a comment

Choose a reason for hiding this comment

Uh oh!

TheBlueMatt commented Apr 25, 2022

Uh oh!

arik-so commented Apr 25, 2022

Uh oh!

Uh oh!

Uh oh!

valentinewallace Apr 25, 2022

Choose a reason for hiding this comment

Uh oh!

TheBlueMatt Apr 25, 2022

Choose a reason for hiding this comment

Uh oh!

TheBlueMatt commented Apr 26, 2022

Uh oh!

arik-so left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

codecov-commenter commented Apr 21, 2022 •

edited

Loading