Skip to content

Peer Storage (Part 3): Identifying Lost Channel States #3897

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

adi2011
Copy link
Contributor

@adi2011 adi2011 commented Jun 28, 2025

In this PR, we begin serializing the ChannelMonitors and sending them over to determine whether any states were lost upon retrieval.

The next PR will be the final one, where we use FundRecoverer to initiate a force close and potentially go on-chain using a penalty transaction.

Sorry for the delay!

@ldk-reviews-bot
Copy link

ldk-reviews-bot commented Jun 28, 2025

👋 Thanks for assigning @TheBlueMatt as a reviewer!
I'll wait for their review and will help manage the review process.
Once they submit their review, I'll check if a second reviewer would be helpful.

@tnull tnull requested review from tnull and removed request for joostjager June 28, 2025 11:17
@adi2011 adi2011 force-pushed the peer-storage/serialise-deserialise branch 2 times, most recently from a35566a to 4c9f3c3 Compare June 29, 2025 05:04
Copy link

codecov bot commented Jun 29, 2025

Codecov Report

❌ Patch coverage is 65.23810% with 73 lines in your changes missing coverage. Please review.
✅ Project coverage is 88.73%. Comparing base (ff279d6) to head (b2cbe73).
⚠️ Report is 107 commits behind head on main.

Files with missing lines Patch % Lines
lightning/src/chain/channelmonitor.rs 50.35% 12 Missing and 57 partials ⚠️
lightning/src/ln/channelmanager.rs 94.02% 3 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3897      +/-   ##
==========================================
- Coverage   88.91%   88.73%   -0.18%     
==========================================
  Files         173      173              
  Lines      123393   124079     +686     
  Branches   123393   124079     +686     
==========================================
+ Hits       109717   110106     +389     
- Misses      11216    11579     +363     
+ Partials     2460     2394      -66     
Flag Coverage Δ
fuzzing ?
tests 88.73% <65.23%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

///
/// [`ChainMonitor`]: crate::chain::chainmonitor::ChainMonitor
#[rustfmt::skip]
pub(crate) fn write_util<Signer: EcdsaChannelSigner, W: Writer>(channel_monitor: &ChannelMonitorImpl<Signer>, is_stub: bool, writer: &mut W) -> Result<(), Error> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wpaulino what do you think we should reasonably cut here to reduce the size of a ChannelMonitor without making the emergency-case ChannelMonitors all that different from the regular ones to induce more code changes across channelmonitor.rs? Obviously we should avoid counterparty_claimable_outpoints, but how much code is gonna break in doing so?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not too familiar with the goals here, but if the idea is for the emergency-case ChannelMonitor to be able to recover funds, wouldn't it need to handle a commitment confirmation from either party? That means we need to track most things, even counterparty_claimable_outpoints (without the sources though) since the counterparty could broadcast a revoked commitment.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically. I think ideally we find a way to store everything (required) but counterparty_claimable_outpoints so that we can punish the counterparty on their balance+reserve if they broadcast a stale state, even if not HTLCs (though of course they can't claim the HTLCs without us being able to punish them on the next stage). Not sure how practical that is today without counterparty_claimable_outpoints but I think that's the goal.

@adi2011 maybe for now let's just write the full monitors, but leave a TODO to strip out what we can later. For larger nodes that means all our monitors will be too large and we'll never back any up but that's okay.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, eventually we'll need to figure out what can be stripped from ChannelMonitors. In CLN, we use a separate struct that holds only the minimal data needed to reconstruct a ChannelMonitor with just the essential information to grab funds or penalise the peer.

@ldk-reviews-bot
Copy link

🔔 1st Reminder

Hey @tnull! This PR has been waiting for your review.
Please take a look when you have a chance. If you're unable to review, please let us know so we can find another reviewer.

Copy link
Contributor

@tnull tnull left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a first look, but will hold off with going more into details until we decided on which way we should go with the ChannelMonitor stub,

},

Err(e) => {
panic!("Wrong serialisation of PeerStorageMonitorHolderList: {}", e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should ever panic in any of this code. Yes, something might be wrong if we have peer storage data we can't read anymore, but really no reason to refuse to at least keep other potential channels operational.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that makes sense, I think we should only panic if we have determined that we have lost some channel state.

Copy link
Collaborator

@TheBlueMatt TheBlueMatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few more comments, let's move forward without blocking on the ChannelMonitor serialization stuff.

///
/// [`ChainMonitor`]: crate::chain::chainmonitor::ChainMonitor
#[rustfmt::skip]
pub(crate) fn write_util<Signer: EcdsaChannelSigner, W: Writer>(channel_monitor: &ChannelMonitorImpl<Signer>, is_stub: bool, writer: &mut W) -> Result<(), Error> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically. I think ideally we find a way to store everything (required) but counterparty_claimable_outpoints so that we can punish the counterparty on their balance+reserve if they broadcast a stale state, even if not HTLCs (though of course they can't claim the HTLCs without us being able to punish them on the next stage). Not sure how practical that is today without counterparty_claimable_outpoints but I think that's the goal.

@adi2011 maybe for now let's just write the full monitors, but leave a TODO to strip out what we can later. For larger nodes that means all our monitors will be too large and we'll never back any up but that's okay.

let random_bytes = self.entropy_source.get_secure_random_bytes();
let serialised_channels = Vec::new();

// TODO(aditya): Choose n random channels so that peer storage does not exceed 64k.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be pretty easy? We have random bytes, just make an outer loop that selects a random monitor (by doing monitors.iter().skip(random_usize % monitors.len()).next())

Copy link
Contributor

@tnull tnull left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a ~first pass.

This needs a rebase now, in particular now that #3922 landed.

@@ -810,10 +813,53 @@ where
}

fn send_peer_storage(&self, their_node_id: PublicKey) {
// TODO: Serialize `ChannelMonitor`s inside `our_peer_storage`.

static MAX_PEER_STORAGE_SIZE: usize = 65000;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a const rather than static, I think? Also, would probably make sense to add this add the module level, with some docs.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also isn't the max size 64 KiB, not 65K?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, my bad, thanks for pointing this out, It should be 65531.

let mut curr_size = 0;

// Randomising Keys in the HashMap to fetch monitors without repetition.
let mut keys: Vec<&ChannelId> = monitors.keys().collect();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this a bit cleaner by using the proposed iterator skiping approach in the loop below, maybe while simply keeping track of which monitors we already wrote?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will address this in the next fixup.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems this is still unaddressed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gentle ping here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created a BTree to keep track of the monitors and ran a while loop until we either ran out of peer storage capacity or all monitors were stored.

monitors_list.monitors.push(peer_storage_monitor);
},
Err(_) => {
panic!("Can not write monitor for {}", mon.monitor.channel_id())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really, please avoid these explicit panics in any of this code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would panic only if there is some issue with the write_util which suggests that we are unable to serialise the channelmonitor, should we just log the error here instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mhh, while I'd still prefer to never panic in any of the peer storage code, indeed this would likely only get hit if we run out of memory or similar. If we keep it, let's just expect on the write_internal call, which avoids all this matching.

@adi2011 adi2011 force-pushed the peer-storage/serialise-deserialise branch from 676afbc to 1111f05 Compare July 16, 2025 20:41
let random_bytes = self.entropy_source.get_secure_random_bytes();
let serialised_channels = Vec::new();
let random_usize = usize::from_le_bytes(random_bytes[0..core::mem::size_of::<usize>()].try_into().unwrap());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Let's keep the length in a separate const, avoiding to make this line overly long/noisy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, thanks!

monitors_list.monitors.push(peer_storage_monitor);
},
Err(_) => {
panic!("Can not write monitor for {}", mon.monitor.channel_id())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mhh, while I'd still prefer to never panic in any of the peer storage code, indeed this would likely only get hit if we run out of memory or similar. If we keep it, let's just expect on the write_internal call, which avoids all this matching.

match write_util_internal(&chan_mon, true, &mut ser_chan) {
Ok(_) => {
// Adding size of peer_storage_monitor.
curr_size += ser_chan.0.serialized_length()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we move this check below constructing PeerStorageMonitorHolder, can we just use it's serialized_lenght implementation instead of tallying up the individual fields here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad, thanks for pointing this out. Fixed.

},
None => {
// TODO: Figure out if this channel is so old that we have forgotten about it.
panic!("Lost a channel {}", &mon_holder.channel_id);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why panic here? This should be the common case, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need this when we’ve completely forgotten about a channel but it’s still active. The only false positives occur if we’ve closed a channel and then forgotten about it. Any ideas on how we could detect those?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need this when we’ve completely forgotten about a channel but it’s still active.

I don't think this is accurate? Wouldn't we end up here simply in this scenario:

  1. We have a channel, store an encrypted backup with a peer
  2. The peer goes offline
  3. We close the channel, drop it from the per_peer_state
  4. The peer comes back online
  5. We retrieve and decrypt a peer backup with a channel state that we no longer know anything about.
  6. We panic and crash the node

This can happen very easily / is the common case, no? Also note this same flow could intentionally get abused by a malicious peer, too. The counterparty would simply need to guess that we include a certain channel in the stored blob, remember it, wait until the channel is closed, and then simply replay the old backup to crash our node.

I still maintain we should never panic in any of this PeerStorage related code, maybe mod the above case where we're positive that our local state is outdated compared to the backed-up version.

That said, we'll indeed need to discern between the two cases (unexpected data loss / expected data loss). One variant to do this would be to first check whether we see a funding transaction spend, although note that this would be an async flow that we can't just directly initiate here. Or we need to maintain a list of all closed and remember it forever. The latter might indeed be the simplest and safest way, if we're fine with keeping the additional data. In any case, we need to find a solution here before moving on, as panicking here is unacceptable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree, this is a false positive where the node crashes, but it shouldn’t. The happy path here is:

  1. We have a channel with a peer.
  2. Something happens and we lose the channel (but it is unclosed); the node cannot identify it either.
  3. We retrieve the ChannelMonitor from peer storage.
  4. We realise that this is an unknown, old channel, and the node crashes because it no longer has any record of an active channel.

There should be a way to determine whether a channel is closed, so that if we receive an unknown channel and its funding TXO is unspent, we crash the node for recovery. But if the funding is already spent, we shouldn’t worry unless they have published an old state. Right?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree, this is a false positive where the node crashes

We cannot tolerate a false positive if we're going to panic.

There should be a way to determine whether a channel is closed, so that if we receive an unknown channel and its funding TXO is unspent, we crash the node for recovery. But if the funding is already spent, we shouldn’t worry unless they have published an old state. Right?

There is no way to do this in LDK today, and I don't think we'll ever get one. We don't have some global way to query if an arbitrary UTXO is unspent (well, we do in gossip but its not always available), and once a channel is closed and the funds swept we may delete the ChannelMonitor because there's no reason to keep it around. Keeping around information about every past channel we've ever had just to detect a lost channel here doesn't seem worth it to me.

It does mean we won't detect if we lose all of our state, but I think that's okay - a user will notice that their entire balance is missing :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it should not be there. I put up the TODO so that we can have a discussion around this.
Changing the panic to log_debug.


impl_writeable_tlv_based!(PeerStorageMonitorHolder, {
(0, channel_id, required),
(1, counterparty_node_id, required),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For new objects, you should be able to make all fields even-numbered as they are required from the getgo, basically.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

/// wrapped inside [`PeerStorage`].
///
/// [`PeerStorage`]: crate::ln::msgs::PeerStorage
pub(crate) struct PeerStorageMonitorHolderList {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should be able to avoid this new type wrapper if we add a impl_writeable_for_vec entry for PeerStorageMonitorHolder in lightning/src/util/ser.rs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems this is still unaddressed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sorry, I will address this in the next fixup.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@adi2011 adi2011 requested a review from TheBlueMatt July 17, 2025 15:12
@ldk-reviews-bot
Copy link

🔔 1st Reminder

Hey @TheBlueMatt! This PR has been waiting for your review.
Please take a look when you have a chance. If you're unable to review, please let us know so we can find another reviewer.

Copy link
Collaborator

@TheBlueMatt TheBlueMatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you forgot to push after addressing @tnull's last comments? Also needs a rebase.

@adi2011 adi2011 force-pushed the peer-storage/serialise-deserialise branch from 6dede0e to c9baefb Compare July 21, 2025 13:12
@adi2011 adi2011 requested a review from TheBlueMatt July 21, 2025 13:12
@adi2011
Copy link
Contributor Author

adi2011 commented Jul 21, 2025

Oh sorry, my bad, I have pushed the fixups and rebased.

Copy link
Collaborator

@TheBlueMatt TheBlueMatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we just remove the is_stub check and the spurious panic we can land this (feel free to squash fixups, IMO). Sadly we will need to add a cfg flag before 0.2 (unless we can nail down the serialization format we want), but it doesn't have to hold up this PR or progress.


self.lockdown_from_offchain.write(writer)?;
self.holder_tx_signed.write(writer)?;
if !is_stub {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now lets not do anything differently here, I don't think skipping the onchain_tx_handler is safe, but we're not quite sure what the right answer is yet. Because we won't (yet) finalize the serialization format, we should also cfg-tag out the actual send logic in send_peer_storage, sadly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I not use is_stub, inside the function or should I remove the parameter from the function definition.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, should I make a new cfg tag for peer-storage or use an existing?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I not use is_stub, inside the function or should I remove the parameter from the function definition.

I think its fine to add it with a TODO: figure out which fields should go here and which do not.

Also, should I make a new cfg tag for peer-storage or use an existing?

I believe a new one (you'll have to add it in ci/ci-tests.sh as well)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, I will cfg-tag the sending logic. In the next PR we can discuss when to send peer storage along with what to omit from ChannelMonitors.

},
None => {
// TODO: Figure out if this channel is so old that we have forgotten about it.
panic!("Lost a channel {}", &mon_holder.channel_id);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree, this is a false positive where the node crashes

We cannot tolerate a false positive if we're going to panic.

There should be a way to determine whether a channel is closed, so that if we receive an unknown channel and its funding TXO is unspent, we crash the node for recovery. But if the funding is already spent, we shouldn’t worry unless they have published an old state. Right?

There is no way to do this in LDK today, and I don't think we'll ever get one. We don't have some global way to query if an arbitrary UTXO is unspent (well, we do in gossip but its not always available), and once a channel is closed and the funds swept we may delete the ChannelMonitor because there's no reason to keep it around. Keeping around information about every past channel we've ever had just to detect a lost channel here doesn't seem worth it to me.

It does mean we won't detect if we lose all of our state, but I think that's okay - a user will notice that their entire balance is missing :)

@adi2011 adi2011 force-pushed the peer-storage/serialise-deserialise branch 5 times, most recently from 4304853 to e76f538 Compare July 24, 2025 11:20
@adi2011 adi2011 requested review from tnull and TheBlueMatt July 24, 2025 11:20
@adi2011 adi2011 force-pushed the peer-storage/serialise-deserialise branch from e76f538 to 75a8fd6 Compare July 24, 2025 11:24
Copy link
Collaborator

@TheBlueMatt TheBlueMatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine now, after the notes here are fixed and rustfmt is fixed.


self.lockdown_from_offchain.write(writer)?;
self.holder_tx_signed.write(writer)?;
if !is_stub {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I not use is_stub, inside the function or should I remove the parameter from the function definition.

I think its fine to add it with a TODO: figure out which fields should go here and which do not.

Also, should I make a new cfg tag for peer-storage or use an existing?

I believe a new one (you'll have to add it in ci/ci-tests.sh as well)

Comment on lines 834 to 838
while curr_size < MAX_PEER_STORAGE_SIZE
&& *stored_chanmon_idx.last().unwrap_or(&zero) < monitors.len()
{
let idx = random_usize % monitors.len();
stored_chanmon_idx.insert(idx + 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The loop condition has a potential issue where it might not terminate if the same random index is repeatedly selected. Since stored_chanmon_idx is a BTreeSet, inserting the same value multiple times doesn't increase its size. If random_usize % monitors.len() keeps generating the same index, the condition *stored_chanmon_idx.last().unwrap_or(&zero) < monitors.len() will remain true indefinitely.

Consider modifying the approach to ensure termination, such as:

  1. Using a counter to limit iterations
  2. Tracking already-processed indices in a separate set
  3. Using a deterministic sequence instead of random selection

This would prevent potential infinite loops while still achieving the goal of selecting monitors for peer storage.

Suggested change
while curr_size < MAX_PEER_STORAGE_SIZE
&& *stored_chanmon_idx.last().unwrap_or(&zero) < monitors.len()
{
let idx = random_usize % monitors.len();
stored_chanmon_idx.insert(idx + 1);
while curr_size < MAX_PEER_STORAGE_SIZE
&& *stored_chanmon_idx.last().unwrap_or(&zero) < monitors.len()
&& stored_chanmon_idx.len() < monitors.len()
{
let idx = random_usize % monitors.len();
let inserted = stored_chanmon_idx.insert(idx + 1);
if inserted {
curr_size += 1;
} else {
// If we couldn't insert, try a different index next time
random_usize = random_usize.wrapping_add(1);
}

Spotted by Diamond

Is this helpful? React 👍 or 👎 to let us know.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tnull, this is the same concern that I had in my mind as well. Wasn't the previous approach more efficient and secure?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about something along these lines?:

diff --git a/lightning/src/chain/chainmonitor.rs b/lightning/src/chain/chainmonitor.rs
index 4a40ba872..dde117ca9 100644
--- a/lightning/src/chain/chainmonitor.rs
+++ b/lightning/src/chain/chainmonitor.rs
@@ -39,6 +39,7 @@ use crate::chain::{ChannelMonitorUpdateStatus, Filter, WatchedOutput};
 use crate::events::{self, Event, EventHandler, ReplayEvent};
 use crate::ln::channel_state::ChannelDetails;
 use crate::ln::msgs::{self, BaseMessageHandler, Init, MessageSendEvent, SendOnlyMessageHandler};
+#[cfg(peer_storage)]
 use crate::ln::our_peer_storage::{DecryptedOurPeerStorage, PeerStorageMonitorHolder};
 use crate::ln::types::ChannelId;
 use crate::prelude::*;
@@ -53,6 +54,8 @@ use crate::util::persist::MonitorName;
 use crate::util::ser::{VecWriter, Writeable};
 use crate::util::wakers::{Future, Notifier};
 use bitcoin::secp256k1::PublicKey;
+#[cfg(peer_storage)]
+use core::iter::Cycle;
 use core::ops::Deref;
 use core::sync::atomic::{AtomicUsize, Ordering};
 
@@ -808,6 +811,7 @@ where
 
 	/// This function collects the counterparty node IDs from all monitors into a `HashSet`,
 	/// ensuring unique IDs are returned.
+	#[cfg(peer_storage)]
 	fn all_counterparty_node_ids(&self) -> HashSet<PublicKey> {
 		let mon = self.monitors.read().unwrap();
 		mon.values().map(|monitor| monitor.monitor.get_counterparty_node_id()).collect()
@@ -815,51 +819,71 @@ where
 
 	#[cfg(peer_storage)]
 	fn send_peer_storage(&self, their_node_id: PublicKey) {
-		#[allow(unused_mut)]
 		let mut monitors_list: Vec<PeerStorageMonitorHolder> = Vec::new();
 		let random_bytes = self.entropy_source.get_secure_random_bytes();
 
 		const MAX_PEER_STORAGE_SIZE: usize = 65531;
 		const USIZE_LEN: usize = core::mem::size_of::<usize>();
-		let mut usize_bytes = [0u8; USIZE_LEN];
-		usize_bytes.copy_from_slice(&random_bytes[0..USIZE_LEN]);
-		let random_usize = usize::from_le_bytes(usize_bytes);
+		let mut random_bytes_cycle_iter = random_bytes.iter().cycle();
+
+		let mut current_size = 0;
+		let monitors_lock = self.monitors.read().unwrap();
+		let mut channel_ids = monitors_lock.keys().copied().collect();
+
+		fn next_random_id(
+			channel_ids: &mut Vec<ChannelId>,
+			random_bytes_cycle_iter: &mut Cycle<core::slice::Iter<u8>>,
+		) -> Option<ChannelId> {
+			if channel_ids.is_empty() {
+				return None;
+			}
 
-		let mut curr_size = 0;
-		let monitors = self.monitors.read().unwrap();
-		let mut stored_chanmon_idx = alloc::collections::BTreeSet::<usize>::new();
-		// Used as a fallback reference if the set is empty
-		let zero = 0;
+			let random_idx = {
+				let mut usize_bytes = [0u8; USIZE_LEN];
+				usize_bytes.iter_mut().for_each(|b| {
+					*b = *random_bytes_cycle_iter.next().expect("A cycle never ends")
+				});
+				// Take one more to introduce a slight misalignment.
+				random_bytes_cycle_iter.next().expect("A cycle never ends");
+				usize::from_le_bytes(usize_bytes) % channel_ids.len()
+			};
+
+			Some(channel_ids.swap_remove(random_idx))
+		}
 
-		while curr_size < MAX_PEER_STORAGE_SIZE
-			&& *stored_chanmon_idx.last().unwrap_or(&zero) < monitors.len()
+		while let Some(channel_id) = next_random_id(&mut channel_ids, &mut random_bytes_cycle_iter)
 		{
-			let idx = random_usize % monitors.len();
-			stored_chanmon_idx.insert(idx + 1);
-			let (cid, mon) = monitors.iter().skip(idx).next().unwrap();
+			let monitor_holder = if let Some(monitor_holder) = monitors_lock.get(&channel_id) {
+				monitor_holder
+			} else {
+				debug_assert!(
+					false,
+					"Tried to access non-existing monitor, this should never happen"
+				);
+				break;
+			};
 
-			let mut ser_chan = VecWriter(Vec::new());
-			let min_seen_secret = mon.monitor.get_min_seen_secret();
-			let counterparty_node_id = mon.monitor.get_counterparty_node_id();
+			let mut serialized_channel = VecWriter(Vec::new());
+			let min_seen_secret = monitor_holder.monitor.get_min_seen_secret();
+			let counterparty_node_id = monitor_holder.monitor.get_counterparty_node_id();
 			{
-				let chan_mon = mon.monitor.inner.lock().unwrap();
+				let inner_lock = monitor_holder.monitor.inner.lock().unwrap();
 
-				write_chanmon_internal(&chan_mon, true, &mut ser_chan)
+				write_chanmon_internal(&inner_lock, true, &mut serialized_channel)
 					.expect("can not write Channel Monitor for peer storage message");
 			}
 			let peer_storage_monitor = PeerStorageMonitorHolder {
-				channel_id: *cid,
+				channel_id,
 				min_seen_secret,
 				counterparty_node_id,
-				monitor_bytes: ser_chan.0,
+				monitor_bytes: serialized_channel.0,
 			};
 
-			// Adding size of peer_storage_monitor.
-			curr_size += peer_storage_monitor.serialized_length();
-
-			if curr_size > MAX_PEER_STORAGE_SIZE {
-				break;
+			current_size += peer_storage_monitor.serialized_length();
+			if current_size > MAX_PEER_STORAGE_SIZE {
+				continue;
 			}
+
 			monitors_list.push(peer_storage_monitor);
 		}

There might be still some room for improvement, but IMO something like this would be a lot more readable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+			if current_size > MAX_PEER_STORAGE_SIZE {
+				continue;
 			}

Why not break here?

Copy link
Contributor

@tnull tnull Aug 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Say we have one huge monitor that doesn't fit in the backup and a few smaller ones. If we'd always abort whenever we draw the large monitor, we might unnecessarily skip backups for the smaller ones. Yes, we could get fancier around how/which monitors to include to make the best use out of the space given, but for now it seems fine to just walk the entire list?

Ah, but now looking at it again I spot a bug there, it should be:

let serialized_length = peer_storage_monitor.serialized_length();
if current_size + serialized_length > MAX_PEER_STORAGE_SIZE {
     continue;
 } else {
    current_size += serialized_length;
    monitors_list.push(peer_storage_monitor);
 }

@adi2011 adi2011 requested a review from tnull July 29, 2025 16:17
Comment on lines 834 to 838
while curr_size < MAX_PEER_STORAGE_SIZE
&& *stored_chanmon_idx.last().unwrap_or(&zero) < monitors.len()
{
let idx = random_usize % monitors.len();
stored_chanmon_idx.insert(idx + 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about something along these lines?:

diff --git a/lightning/src/chain/chainmonitor.rs b/lightning/src/chain/chainmonitor.rs
index 4a40ba872..dde117ca9 100644
--- a/lightning/src/chain/chainmonitor.rs
+++ b/lightning/src/chain/chainmonitor.rs
@@ -39,6 +39,7 @@ use crate::chain::{ChannelMonitorUpdateStatus, Filter, WatchedOutput};
 use crate::events::{self, Event, EventHandler, ReplayEvent};
 use crate::ln::channel_state::ChannelDetails;
 use crate::ln::msgs::{self, BaseMessageHandler, Init, MessageSendEvent, SendOnlyMessageHandler};
+#[cfg(peer_storage)]
 use crate::ln::our_peer_storage::{DecryptedOurPeerStorage, PeerStorageMonitorHolder};
 use crate::ln::types::ChannelId;
 use crate::prelude::*;
@@ -53,6 +54,8 @@ use crate::util::persist::MonitorName;
 use crate::util::ser::{VecWriter, Writeable};
 use crate::util::wakers::{Future, Notifier};
 use bitcoin::secp256k1::PublicKey;
+#[cfg(peer_storage)]
+use core::iter::Cycle;
 use core::ops::Deref;
 use core::sync::atomic::{AtomicUsize, Ordering};
 
@@ -808,6 +811,7 @@ where
 
 	/// This function collects the counterparty node IDs from all monitors into a `HashSet`,
 	/// ensuring unique IDs are returned.
+	#[cfg(peer_storage)]
 	fn all_counterparty_node_ids(&self) -> HashSet<PublicKey> {
 		let mon = self.monitors.read().unwrap();
 		mon.values().map(|monitor| monitor.monitor.get_counterparty_node_id()).collect()
@@ -815,51 +819,71 @@ where
 
 	#[cfg(peer_storage)]
 	fn send_peer_storage(&self, their_node_id: PublicKey) {
-		#[allow(unused_mut)]
 		let mut monitors_list: Vec<PeerStorageMonitorHolder> = Vec::new();
 		let random_bytes = self.entropy_source.get_secure_random_bytes();
 
 		const MAX_PEER_STORAGE_SIZE: usize = 65531;
 		const USIZE_LEN: usize = core::mem::size_of::<usize>();
-		let mut usize_bytes = [0u8; USIZE_LEN];
-		usize_bytes.copy_from_slice(&random_bytes[0..USIZE_LEN]);
-		let random_usize = usize::from_le_bytes(usize_bytes);
+		let mut random_bytes_cycle_iter = random_bytes.iter().cycle();
+
+		let mut current_size = 0;
+		let monitors_lock = self.monitors.read().unwrap();
+		let mut channel_ids = monitors_lock.keys().copied().collect();
+
+		fn next_random_id(
+			channel_ids: &mut Vec<ChannelId>,
+			random_bytes_cycle_iter: &mut Cycle<core::slice::Iter<u8>>,
+		) -> Option<ChannelId> {
+			if channel_ids.is_empty() {
+				return None;
+			}
 
-		let mut curr_size = 0;
-		let monitors = self.monitors.read().unwrap();
-		let mut stored_chanmon_idx = alloc::collections::BTreeSet::<usize>::new();
-		// Used as a fallback reference if the set is empty
-		let zero = 0;
+			let random_idx = {
+				let mut usize_bytes = [0u8; USIZE_LEN];
+				usize_bytes.iter_mut().for_each(|b| {
+					*b = *random_bytes_cycle_iter.next().expect("A cycle never ends")
+				});
+				// Take one more to introduce a slight misalignment.
+				random_bytes_cycle_iter.next().expect("A cycle never ends");
+				usize::from_le_bytes(usize_bytes) % channel_ids.len()
+			};
+
+			Some(channel_ids.swap_remove(random_idx))
+		}
 
-		while curr_size < MAX_PEER_STORAGE_SIZE
-			&& *stored_chanmon_idx.last().unwrap_or(&zero) < monitors.len()
+		while let Some(channel_id) = next_random_id(&mut channel_ids, &mut random_bytes_cycle_iter)
 		{
-			let idx = random_usize % monitors.len();
-			stored_chanmon_idx.insert(idx + 1);
-			let (cid, mon) = monitors.iter().skip(idx).next().unwrap();
+			let monitor_holder = if let Some(monitor_holder) = monitors_lock.get(&channel_id) {
+				monitor_holder
+			} else {
+				debug_assert!(
+					false,
+					"Tried to access non-existing monitor, this should never happen"
+				);
+				break;
+			};
 
-			let mut ser_chan = VecWriter(Vec::new());
-			let min_seen_secret = mon.monitor.get_min_seen_secret();
-			let counterparty_node_id = mon.monitor.get_counterparty_node_id();
+			let mut serialized_channel = VecWriter(Vec::new());
+			let min_seen_secret = monitor_holder.monitor.get_min_seen_secret();
+			let counterparty_node_id = monitor_holder.monitor.get_counterparty_node_id();
 			{
-				let chan_mon = mon.monitor.inner.lock().unwrap();
+				let inner_lock = monitor_holder.monitor.inner.lock().unwrap();
 
-				write_chanmon_internal(&chan_mon, true, &mut ser_chan)
+				write_chanmon_internal(&inner_lock, true, &mut serialized_channel)
 					.expect("can not write Channel Monitor for peer storage message");
 			}
 			let peer_storage_monitor = PeerStorageMonitorHolder {
-				channel_id: *cid,
+				channel_id,
 				min_seen_secret,
 				counterparty_node_id,
-				monitor_bytes: ser_chan.0,
+				monitor_bytes: serialized_channel.0,
 			};
 
-			// Adding size of peer_storage_monitor.
-			curr_size += peer_storage_monitor.serialized_length();
-
-			if curr_size > MAX_PEER_STORAGE_SIZE {
-				break;
+			current_size += peer_storage_monitor.serialized_length();
+			if current_size > MAX_PEER_STORAGE_SIZE {
+				continue;
 			}
+
 			monitors_list.push(peer_storage_monitor);
 		}

There might be still some room for improvement, but IMO something like this would be a lot more readable.

@@ -809,11 +813,57 @@ where
mon.values().map(|monitor| monitor.monitor.get_counterparty_node_id()).collect()
}

#[cfg(peer_storage)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you'll also need to introduce the cfg-gate in a few other places to get rid of the warning, e.g., above for all_counterparty_node_ids and on some imports.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should I also cfg-gate entropy_source inside ChainMonitor?
If yes, I would have to cfg-gate a lot of things in other files as well...
I think, It would be better to allow unused on it, since we are going to remove the peer-storage cfg gate in the next PR...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, we need to add the cfg-gate where ever we'll get a warning. I'd prefer to avoid allow_unused, especially as it might easily slip through (i.e., we might not remember to remove it again going forward), while for the cfg guard we will be forced to make the cleanup.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, so I have cfg-gated at almost all the places in the chainmonitor monitor, but I have decided to add PhantomData in ChainMonitor, to avoid redeclaring the whole struct or the new function.

Copy link
Contributor

@tnull tnull Aug 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned elsewhere, it should be fine to just rename the entropy source _entropy_source for now, no need for PhantomData then.

@adi2011 adi2011 requested a review from tnull August 5, 2025 07:42
@adi2011 adi2011 force-pushed the peer-storage/serialise-deserialise branch 3 times, most recently from e7dddc9 to 1de28b8 Compare August 5, 2025 08:31
Copy link
Contributor

@tnull tnull left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PhantomData changes unfortunately have CI fail. I think it might be easiest to revert them and just use _entropy_source.

@adi2011 adi2011 force-pushed the peer-storage/serialise-deserialise branch from 1de28b8 to b2cbe73 Compare August 5, 2025 18:23
@adi2011 adi2011 requested a review from tnull August 6, 2025 09:08
Copy link
Contributor

@tnull tnull left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fuzz CI is failing due to:

thread 'full_stack::tests::test_no_existing_test_breakage' panicked at 'assertion failed: `(left == right)`
  left: `None`,
 right: `Some(1)`', src/full_stack.rs:1707:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

I suspect this is likely related to the cfg-gate?

'PeerStorageMonitorHolder' is used to wrap a single ChannelMonitor, here we are
adding some fields separetly so that we do not need to read the whole ChannelMonitor
to identify if we have lost some states.

`PeerStorageMonitorHolderList` is used to keep the list of all the channels which would
be sent over the wire inside Peer Storage.
Fixed formatting for write() in ChannelMonitorImpl. This would make the next
commit cleaner by ensuring it only contains direct code shifts, without
unrelated formatting changes.
Create a utililty function to prevent code duplication while writing ChannelMonitors.
Serialise them inside ChainMonitor::send_peer_storage and send them over.

Cfg-tag the sending logic because we are unsure of what to omit from ChannelMonitors stored inside
peer-storage.
Deserialise the ChannelMonitors and compare the data to determine if we have
lost some states.
Node should now determine lost states using retrieved peer storage.
@adi2011
Copy link
Contributor Author

adi2011 commented Aug 11, 2025

Yes, entropy source is not being called while sending peer storage because we have cfg-gated it. Reverting the values in the fuzz test.

This needs a rebase due to conflict.

@adi2011 adi2011 force-pushed the peer-storage/serialise-deserialise branch from 185bc52 to a588961 Compare August 11, 2025 07:04
@adi2011 adi2011 requested a review from tnull August 11, 2025 07:04
@adi2011 adi2011 force-pushed the peer-storage/serialise-deserialise branch from a588961 to 3e5d902 Compare August 11, 2025 07:06
@adi2011 adi2011 requested a review from TheBlueMatt August 11, 2025 07:19
@tnull
Copy link
Contributor

tnull commented Aug 11, 2025

Yes, entropy source is not being called while sending peer storage because we have cfg-gated it. Reverting the values in the fuzz test.

Probably also best to also switch around the values based on the cfg value, so we don't miss them, and we keep the capability to fuzz with peer storage?

Copy link
Contributor

@tnull tnull left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixups look good to me, feel free to squash. Some minor comments, but generally this seems close.

@@ -1187,7 +1187,7 @@ fn two_peer_forwarding_seed() -> Vec<u8> {
// inbound read from peer id 1 of len 255
ext_from_hex("0301ff", &mut test);
// beginning of accept_channel
ext_from_hex("0021 0000000000000000000000000000000000000000000000000000000000000e12 0000000000000162 00000000004c4b40 00000000000003e8 00000000000003e8 00000002 03f0 0005 030000000000000000000000000000000000000000000000000000000000000100 030000000000000000000000000000000000000000000000000000000000000200 030000000000000000000000000000000000000000000000000000000000000300 030000000000000000000000000000000000000000000000000000000000000400 030000000000000000000000000000000000000000000000000000000000000500 02660000000000000000000000000000", &mut test);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -266,7 +275,8 @@ pub struct ChainMonitor<
logger: L,
fee_estimator: F,
persister: P,
entropy_source: ES,

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Drop spurious newline.

let mut cursor = io::Cursor::new(decrypted);
let mon_list = <Vec<PeerStorageMonitorHolder> as Readable>::read(&mut cursor)
.unwrap_or_else(|e| {
// This should NEVER happen.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we debug_assert here?

None => {
log_debug!(
logger,
"Not able to find peer_state for the counterparty {}, channelId {}",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: channel_id

@@ -17428,38 +17476,62 @@ mod tests {

#[test]
#[rustfmt::skip]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mind adding a commit removing the skip?


create_announced_chan_between_nodes(&nodes, 0, 1);
let (_, _, cid, _) = create_announced_chan_between_nodes(&nodes, 0, 1);
send_payment(&nodes[0], &vec!(&nodes[1])[..], 1000);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please just use &[]. If you give a slice anyways, there is no reason to allocate a Vec first (here and elsewhere).

fn test_peer_storage() {
let chanmon_cfgs = create_chanmon_cfgs(2);
let (persister, chain_monitor);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line seems incomplete? Why did we add it here?

@@ -17117,38 +17165,62 @@ mod tests {

#[test]
#[rustfmt::skip]
#[cfg(peer_storage)]
#[should_panic(expected = "Lost channel state for channel ae3367da2c13bc1ceb86bf56418f62828f7ce9d6bfb15a46af5ba1f1ed8b124f.\n\
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is, if I remove the should_panic the test panics during teardown (in the drop), not in the main test logic.

Not sure I'm following, you can still do this, no?:

diff --git a/lightning/src/ln/channelmanager.rs b/lightning/src/ln/channelmanager.rs
index ec7014b42..261fd80ef 100644
--- a/lightning/src/ln/channelmanager.rs
+++ b/lightning/src/ln/channelmanager.rs
@@ -17477,9 +17477,6 @@ mod tests {
        #[test]
        #[rustfmt::skip]
        #[cfg(peer_storage)]
-       #[should_panic(expected = "Lost channel state for channel ae3367da2c13bc1ceb86bf56418f62828f7ce9d6bfb15a46af5ba1f1ed8b124f.\n\
-                               Received peer storage with a more recent state than what our node had.\n\
-                               Use the FundRecoverer to initiate a force close and sweep the funds.")]
        fn test_peer_storage() {
                let chanmon_cfgs = create_chanmon_cfgs(2);
                let (persister, chain_monitor);
@@ -17559,13 +17556,18 @@ mod tests {
                                nodes[0].node.handle_channel_reestablish(nodes[1].node.get_our_node_id(), msg);
                                assert_eq!(*node_id, nodes[0].node.get_our_node_id());
                        } else if let MessageSendEvent::SendPeerStorageRetrieval { ref node_id, ref msg } = msg {
-                               // Should Panic here!
-                               nodes[0].node.handle_peer_storage_retrieval(nodes[1].node.get_our_node_id(), msg.clone());
-                               assert_eq!(*node_id, nodes[0].node.get_our_node_id());
+                               let res = std::panic::catch_unwind(||
+                                       nodes[0].node.handle_peer_storage_retrieval(nodes[1].node.get_our_node_id(), msg.clone())
+                               );
+                               assert!(res.is_err());
+                               break;
                        } else {
                                panic!("Unexpected event")
                        }
                }
+               // When we panic'd, we expect to panic on `Drop`.
+               let res = std::panic::catch_unwind(|| drop(nodes));
+               assert!(res.is_err());
        }

        #[test]

(with casting the error you should even be able to assert the exact message if you prefer)

Copy link
Collaborator

@TheBlueMatt TheBlueMatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good, but I'll wait until @tnull signs off and the fuzzer is fixed.

@TheBlueMatt
Copy link
Collaborator

Probably also best to also switch around the values based on the cfg value, so we don't miss them, and we keep the capability to fuzz with peer storage?

Actually, don't bother, just drop the fuzz changes here, I'm gonna open a PR to make fuzz changes more robust against changes in LDK.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants