-
Notifications
You must be signed in to change notification settings - Fork 421
Reorder the BP loop to make manager persistence more reliable #1436
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reorder the BP loop to make manager persistence more reliable #1436
Conversation
Codecov Report
@@ Coverage Diff @@
## main #1436 +/- ##
==========================================
+ Coverage 90.86% 91.40% +0.54%
==========================================
Files 75 75
Lines 41420 44598 +3178
Branches 41420 44598 +3178
==========================================
+ Hits 37636 40766 +3130
- Misses 3784 3832 +48
Continue to review full report at Codecov.
|
vincenzopalazzo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
arik-so
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we sure this will make a big difference, considering a) it's a loop and b) the other persistence calls are still happening after process_events?
I mean, no, but most "normal" nodes aren't doing a lot of stuff regularly - in the case that led to this PR the node was sending a payment and otherwise not doing anything. In that case, where we're starting from all threads doing nothing, I'd think this will change the "received payment failure case" to retry after persisting the ChannelManager, which is exactly the important change for this case.
which "other persistence calls"? Nothing happens after the event processing now. |
|
I was mostly referring to this line: https://github.com/lightningdevkit/rust-lightning/pull/1436/files#diff-3c70a8bcbb58522dbe589e774a85bbb611adab7142bb1eacec4fb7828cb3ee06R224 But if we're reasonably confident that it was on the first iteration, I'm ACKing |
| // hence it comes last here. When the ChannelManager finishes whatever its doing, | ||
| // we want to ensure we get into `persist_manager` as quickly as we can, especially | ||
| // without running the normal event processing above and handing events to users. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, not clear to me what "normal event processing" refers to. Also, within chanman.process_pending_events, it seems like we do hand events to users without calling persist_manager first? Missing something..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added more docs, let me know if its a bit clearer.
9dfd5ee
The main loop of the background processor has this line: `peer_manager.process_events(); // Note that this may block on ChannelManager's locking` which does, indeed, sometimes block waiting on the `ChannelManager` to finish whatever its doing. Specifically, its the only place in the background processor loop that we block waiting on the `ChannelManager`, so if the `ChannelManager` is relatively busy, we may end up being blocked there most of the time. This should be fine, except today we had a user who's node was particularly slow in processing some channel updates, resulting in the background processor being blocked there (as expected). Then, when the channel updates were completed (and persisted) the next thing the background processor did was hand the user events to process, creating yet more channel updates. Ultimately, the users' node crashed before finishing the event processing. This left us with an updated monitor on disk and an outdated manager, and they lost the channel on startup. Here we simply move the above quoted line to after the normal event processing, ensuring the next thing we do after blocking on `ChannelManager` locks is persist the manager, prior to event handling.
9dfd5ee to
050b19c
Compare
|
Squashed fixup. |
arik-so
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re-ACKing
The main loop of the background processor has this line:
peer_manager.process_events(); // Note that this may block on ChannelManager's lockingwhich does, indeed, sometimes block waiting on the
ChannelManagerto finish whatever its doing. Specifically, its the only place in
the background processor loop that we block waiting on the
ChannelManager, so if theChannelManageris relatively busy, wemay end up being blocked there most of the time.
This should be fine, except today we had a user who's node was
particularly slow in processing some channel updates, resulting in
the background processor being blocked there (as expected). Then,
when the channel updates were completed (and persisted) the next
thing the background processor did was hand the user events to
process, creating yet more channel updates. Ultimately, the users'
node crashed before finishing the event processing. This left us
with an updated monitor on disk and an outdated manager, and they
lost the channel on startup.
Here we simply move the above quoted line to after the normal event
processing, ensuring the next thing we do after blocking on
ChannelManagerlocks is persist the manager, prior to eventhandling.