Skip to content

Conversation

@TheBlueMatt
Copy link
Collaborator

The main loop of the background processor has this line:
peer_manager.process_events(); // Note that this may block on ChannelManager's locking
which does, indeed, sometimes block waiting on the ChannelManager
to finish whatever its doing. Specifically, its the only place in
the background processor loop that we block waiting on the
ChannelManager, so if the ChannelManager is relatively busy, we
may end up being blocked there most of the time.

This should be fine, except today we had a user who's node was
particularly slow in processing some channel updates, resulting in
the background processor being blocked there (as expected). Then,
when the channel updates were completed (and persisted) the next
thing the background processor did was hand the user events to
process, creating yet more channel updates. Ultimately, the users'
node crashed before finishing the event processing. This left us
with an updated monitor on disk and an outdated manager, and they
lost the channel on startup.

Here we simply move the above quoted line to after the normal event
processing, ensuring the next thing we do after blocking on
ChannelManager locks is persist the manager, prior to event
handling.

@codecov-commenter
Copy link

codecov-commenter commented Apr 21, 2022

Codecov Report

Merging #1436 (9dfd5ee) into main (d0f69f7) will increase coverage by 0.54%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main    #1436      +/-   ##
==========================================
+ Coverage   90.86%   91.40%   +0.54%     
==========================================
  Files          75       75              
  Lines       41420    44598    +3178     
  Branches    41420    44598    +3178     
==========================================
+ Hits        37636    40766    +3130     
- Misses       3784     3832      +48     
Impacted Files Coverage Δ
lightning-background-processor/src/lib.rs 95.66% <100.00%> (+0.44%) ⬆️
lightning/src/ln/functional_tests.rs 97.12% <0.00%> (+0.04%) ⬆️
lightning-invoice/src/utils.rs 97.75% <0.00%> (+1.03%) ⬆️
lightning/src/routing/router.rs 94.03% <0.00%> (+1.45%) ⬆️
lightning/src/ln/channel.rs 90.42% <0.00%> (+2.05%) ⬆️
lightning-persister/src/util.rs 98.87% <0.00%> (+2.82%) ⬆️
lightning/src/ln/channelmanager.rs 87.94% <0.00%> (+3.24%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d0f69f7...9dfd5ee. Read the comment docs.

Copy link
Contributor

@vincenzopalazzo vincenzopalazzo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@arik-so arik-so left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure this will make a big difference, considering a) it's a loop and b) the other persistence calls are still happening after process_events?

@TheBlueMatt
Copy link
Collaborator Author

Are we sure this will make a big difference

I mean, no, but most "normal" nodes aren't doing a lot of stuff regularly - in the case that led to this PR the node was sending a payment and otherwise not doing anything. In that case, where we're starting from all threads doing nothing, I'd think this will change the "received payment failure case" to retry after persisting the ChannelManager, which is exactly the important change for this case.

the other persistence calls are still happening after process_events?

which "other persistence calls"? Nothing happens after the event processing now.

@arik-so
Copy link
Contributor

arik-so commented Apr 25, 2022

I was mostly referring to this line: https://github.com/lightningdevkit/rust-lightning/pull/1436/files#diff-3c70a8bcbb58522dbe589e774a85bbb611adab7142bb1eacec4fb7828cb3ee06R224

But if we're reasonably confident that it was on the first iteration, I'm ACKing

arik-so
arik-so previously approved these changes Apr 25, 2022
Comment on lines 210 to 212
// hence it comes last here. When the ChannelManager finishes whatever its doing,
// we want to ensure we get into `persist_manager` as quickly as we can, especially
// without running the normal event processing above and handing events to users.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, not clear to me what "normal event processing" refers to. Also, within chanman.process_pending_events, it seems like we do hand events to users without calling persist_manager first? Missing something..

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added more docs, let me know if its a bit clearer.

@TheBlueMatt TheBlueMatt dismissed stale reviews from arik-so and vincenzopalazzo via 9dfd5ee April 25, 2022 19:31
The main loop of the background processor has this line:
`peer_manager.process_events(); // Note that this may block on ChannelManager's locking`
which does, indeed, sometimes block waiting on the `ChannelManager`
to finish whatever its doing. Specifically, its the only place in
the background processor loop that we block waiting on the
`ChannelManager`, so if the `ChannelManager` is relatively busy, we
may end up being blocked there most of the time.

This should be fine, except today we had a user who's node was
particularly slow in processing some channel updates, resulting in
the background processor being blocked there (as expected). Then,
when the channel updates were completed (and persisted) the next
thing the background processor did was hand the user events to
process, creating yet more channel updates. Ultimately, the users'
node crashed before finishing the event processing. This left us
with an updated monitor on disk and an outdated manager, and they
lost the channel on startup.

Here we simply move the above quoted line to after the normal event
processing, ensuring the next thing we do after blocking on
`ChannelManager` locks is persist the manager, prior to event
handling.
@TheBlueMatt
Copy link
Collaborator Author

Squashed fixup.

Copy link
Contributor

@arik-so arik-so left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-ACKing

@valentinewallace valentinewallace merged commit 72069bf into lightningdevkit:main Apr 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants