Skip to content

Conversation

@JosephGoulden
Copy link
Contributor

I'm testing syncing the mainnet graph from LND. Found that the peer_handler disconnects a peer after PING_TIMER (5 seconds) if it has not responded to a ping message. It does this even in the middle of syncing when there are many pending messages to process before getting to the pong.

I checked LND, it adds the response to its outgoing queue as soon as it receives the ping https://github.com/lightningnetwork/lnd/blob/dee9b9a9b13737309bc8c1b57e0ee6a68bfee492/peer/brontide.go#L1371

I've fixed it with an extra check to see if there are currently messages from the peer being processed. Can see that both peers are sending and receiving pings/pongs during sync now.

Not sure if its correct to use pending_read_is_header for this though? Could maybe keep an Instant of the last message of any type received and check that is less than the PING_TIMER.

@TheBlueMatt
Copy link
Collaborator

Thanks for tracking this down!

This does sound like an lnd bug if it cannot respond to pings even though there are gossip messages in its outbound queue - how large does it allow the outbound queue to grow (presumably it has to pretty tightly limit it to prevent DoS vulns where you can crash a node by having lnd fill its outbound message buffer?).

In either case, we can't just consider "messages pending" like this, we need to at least track when we last received a message from the peer. More broadly, if a peer cannot respond to messages within five seconds, we really need to consider them "disconnected" from an HTLC-relay perspective as the peer might as well be useless, so maybe we should still tell the ChannelManager that the peer is disconnected but keep the socket connection alive.

@codecov
Copy link

codecov bot commented Aug 18, 2021

Codecov Report

Merging #1051 (af4eb23) into main (803d804) will decrease coverage by 0.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1051      +/-   ##
==========================================
- Coverage   90.82%   90.81%   -0.02%     
==========================================
  Files          65       65              
  Lines       32801    32801              
==========================================
- Hits        29793    29789       -4     
- Misses       3008     3012       +4     
Impacted Files Coverage Δ
lightning-background-processor/src/lib.rs 93.93% <ø> (ø)
lightning/src/ln/functional_tests.rs 97.44% <0.00%> (-0.07%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 803d804...af4eb23. Read the comment docs.

@TheBlueMatt
Copy link
Collaborator

I checked with the lnd folks and they indicated that ping should be prioritized over gossip messages, so I'm not entirely sure how you got a 5 second lag - they should be responding to the ping pretty quick. Can you share what version of lnd you saw this on, a bit more about your setup (remote node? over what kind of internet connection? is bufferbloat likely? etc), as well as the local LDK setup (sample node? custom code? rust client or another language? built with --release or not?)?

@TheBlueMatt
Copy link
Collaborator

cc @alexbosworth

@JosephGoulden
Copy link
Contributor Author

Running LND 0.13.0-beta on a RPi 4 which is on the same LAN as me, connecting over WiFi. Also running a modified version of lightning-ldk in an int-test so debug mode I think.

I can see from the logs that LND receives the ping a couple seconds after its been sent from ldk, immediately responds, but then its not received by ldk before being disconnected.

Yes I think its quite likely that my network is saturated during this sync process.

@JosephGoulden
Copy link
Contributor Author

I just tried running my modified version of ldk in release mode and the problem appears resolved. I can see ping/pong messages being responded to immediately. So debug mode in rust appears to be too slow to respond in time.

Sorry for the false alarm, I'll probably modify my int-test to use release binaries.

@TheBlueMatt
Copy link
Collaborator

I just tried running my modified version of ldk in release mode and the problem appears resolved. I can see ping/pong messages being responded to immediately. So debug mode in rust appears to be too slow to respond in time.

Ah! Ok, so its entirely possible the issue here is just that the signature verification of the gossip messages is too slow when compiled with zero optimizations. I think the right fix would just be changing our timeout when compiled without optimization/in debug mode. In the mean time, you may try adding this to your Cargo.toml to run basic optimizations which may resolve it, if not, try opt-level=2

[profile.dev]
opt-level = 1

@JosephGoulden
Copy link
Contributor Author

opt-level = 1 works great.

@JosephGoulden
Copy link
Contributor Author

I updated this PR to increase the timeout to 30 seconds (I observe ~15s delay) when running in debug mode. This works without needing to increase the opt-level.

#[cfg(not(debug_assertions))]
const PING_TIMER: u64 = 5;
#[cfg(debug_assertions)]
const PING_TIMER: u64 = 30;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment here noting that 30 seconds on some devices results in signature operations taking long enough that we disconnect?

Copy link
Contributor

@jkczyz jkczyz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you squash into one commit?

Comment on lines 55 to 56
/// Signature operations take a lot longer without compiler optimisations.
/// Increasing the ping timer allows for this but slower devices will be disconnected if the timeout is reached.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: keep lines to 100 characters when possible

@TheBlueMatt TheBlueMatt merged commit 2ced708 into lightningdevkit:main Aug 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants