-
Notifications
You must be signed in to change notification settings - Fork 421
fix: peer is disconnected for not responding to ping #1051
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: peer is disconnected for not responding to ping #1051
Conversation
|
Thanks for tracking this down! This does sound like an lnd bug if it cannot respond to pings even though there are gossip messages in its outbound queue - how large does it allow the outbound queue to grow (presumably it has to pretty tightly limit it to prevent DoS vulns where you can crash a node by having lnd fill its outbound message buffer?). In either case, we can't just consider "messages pending" like this, we need to at least track when we last received a message from the peer. More broadly, if a peer cannot respond to messages within five seconds, we really need to consider them "disconnected" from an HTLC-relay perspective as the peer might as well be useless, so maybe we should still tell the |
Codecov Report
@@ Coverage Diff @@
## main #1051 +/- ##
==========================================
- Coverage 90.82% 90.81% -0.02%
==========================================
Files 65 65
Lines 32801 32801
==========================================
- Hits 29793 29789 -4
- Misses 3008 3012 +4
Continue to review full report at Codecov.
|
|
I checked with the lnd folks and they indicated that ping should be prioritized over gossip messages, so I'm not entirely sure how you got a 5 second lag - they should be responding to the ping pretty quick. Can you share what version of lnd you saw this on, a bit more about your setup (remote node? over what kind of internet connection? is bufferbloat likely? etc), as well as the local LDK setup (sample node? custom code? rust client or another language? built with |
|
Running LND 0.13.0-beta on a RPi 4 which is on the same LAN as me, connecting over WiFi. Also running a modified version of lightning-ldk in an int-test so debug mode I think. I can see from the logs that LND receives the ping a couple seconds after its been sent from ldk, immediately responds, but then its not received by ldk before being disconnected. Yes I think its quite likely that my network is saturated during this sync process. |
|
I just tried running my modified version of ldk in release mode and the problem appears resolved. I can see ping/pong messages being responded to immediately. So debug mode in rust appears to be too slow to respond in time. Sorry for the false alarm, I'll probably modify my int-test to use release binaries. |
Ah! Ok, so its entirely possible the issue here is just that the signature verification of the gossip messages is too slow when compiled with zero optimizations. I think the right fix would just be changing our timeout when compiled without optimization/in debug mode. In the mean time, you may try adding this to your Cargo.toml to run basic optimizations which may resolve it, if not, try opt-level=2 |
|
|
|
I updated this PR to increase the timeout to 30 seconds (I observe ~15s delay) when running in debug mode. This works without needing to increase the opt-level. |
| #[cfg(not(debug_assertions))] | ||
| const PING_TIMER: u64 = 5; | ||
| #[cfg(debug_assertions)] | ||
| const PING_TIMER: u64 = 30; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment here noting that 30 seconds on some devices results in signature operations taking long enough that we disconnect?
jkczyz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you squash into one commit?
| /// Signature operations take a lot longer without compiler optimisations. | ||
| /// Increasing the ping timer allows for this but slower devices will be disconnected if the timeout is reached. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: keep lines to 100 characters when possible
6957e0d to
af4eb23
Compare
I'm testing syncing the mainnet graph from LND. Found that the peer_handler disconnects a peer after PING_TIMER (5 seconds) if it has not responded to a ping message. It does this even in the middle of syncing when there are many pending messages to process before getting to the pong.
I checked LND, it adds the response to its outgoing queue as soon as it receives the ping https://github.com/lightningnetwork/lnd/blob/dee9b9a9b13737309bc8c1b57e0ee6a68bfee492/peer/brontide.go#L1371
I've fixed it with an extra check to see if there are currently messages from the peer being processed. Can see that both peers are sending and receiving pings/pongs during sync now.
Not sure if its correct to use pending_read_is_header for this though? Could maybe keep an Instant of the last message of any type received and check that is less than the PING_TIMER.