Peer Storage (Part 3): Identifying Lost Channel States #3897

adi2011 · 2025-06-28T10:21:48Z

In this PR, we begin serializing the ChannelMonitors and sending them over to determine whether any states were lost upon retrieval.

The next PR will be the final one, where we use FundRecoverer to initiate a force close and potentially go on-chain using a penalty transaction.

Sorry for the delay!

ldk-reviews-bot · 2025-06-28T10:21:51Z

👋 Thanks for assigning @TheBlueMatt as a reviewer!
I'll wait for their review and will help manage the review process.
Once they submit their review, I'll check if a second reviewer would be helpful.

codecov · 2025-06-29T05:13:45Z

Codecov Report

❌ Patch coverage is 65.23810% with 73 lines in your changes missing coverage. Please review.
✅ Project coverage is 88.73%. Comparing base (ff279d6) to head (b2cbe73).
⚠️ Report is 107 commits behind head on main.

Files with missing lines	Patch %	Lines
lightning/src/chain/channelmonitor.rs	50.35%	12 Missing and 57 partials ⚠️
lightning/src/ln/channelmanager.rs	94.02%	3 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3897      +/-   ##
==========================================
- Coverage   88.91%   88.73%   -0.18%     
==========================================
  Files         173      173              
  Lines      123393   124079     +686     
  Branches   123393   124079     +686     
==========================================
+ Hits       109717   110106     +389     
- Misses      11216    11579     +363     
+ Partials     2460     2394      -66

Flag	Coverage Δ
fuzzing	`?`
tests	`88.73% <65.23%> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

TheBlueMatt · 2025-06-30T01:05:52Z

lightning/src/chain/channelmonitor.rs

+///
+/// [`ChainMonitor`]: crate::chain::chainmonitor::ChainMonitor
+#[rustfmt::skip]
+pub(crate) fn write_util<Signer: EcdsaChannelSigner, W: Writer>(channel_monitor: &ChannelMonitorImpl<Signer>, is_stub: bool, writer: &mut W) -> Result<(), Error> {


@wpaulino what do you think we should reasonably cut here to reduce the size of a ChannelMonitor without making the emergency-case ChannelMonitors all that different from the regular ones to induce more code changes across channelmonitor.rs? Obviously we should avoid counterparty_claimable_outpoints, but how much code is gonna break in doing so?

Not too familiar with the goals here, but if the idea is for the emergency-case ChannelMonitor to be able to recover funds, wouldn't it need to handle a commitment confirmation from either party? That means we need to track most things, even counterparty_claimable_outpoints (without the sources though) since the counterparty could broadcast a revoked commitment.

Basically. I think ideally we find a way to store everything (required) but counterparty_claimable_outpoints so that we can punish the counterparty on their balance+reserve if they broadcast a stale state, even if not HTLCs (though of course they can't claim the HTLCs without us being able to punish them on the next stage). Not sure how practical that is today without counterparty_claimable_outpoints but I think that's the goal.

@adi2011 maybe for now let's just write the full monitors, but leave a TODO to strip out what we can later. For larger nodes that means all our monitors will be too large and we'll never back any up but that's okay.

Yes, eventually we'll need to figure out what can be stripped from ChannelMonitors. In CLN, we use a separate struct that holds only the minimal data needed to reconstruct a ChannelMonitor with just the essential information to grab funds or penalise the peer.

lightning/src/ln/channelmanager.rs

ldk-reviews-bot · 2025-06-30T11:18:09Z

🔔 1st Reminder

Hey @tnull! This PR has been waiting for your review.
Please take a look when you have a chance. If you're unable to review, please let us know so we can find another reviewer.

tnull

Took a first look, but will hold off with going more into details until we decided on which way we should go with the ChannelMonitor stub,

tnull · 2025-06-30T12:21:17Z

lightning/src/ln/channelmanager.rs

+			},
+
+			Err(e) => {
+				panic!("Wrong serialisation of PeerStorageMonitorHolderList: {}", e);


I don't think we should ever panic in any of this code. Yes, something might be wrong if we have peer storage data we can't read anymore, but really no reason to refuse to at least keep other potential channels operational.

Yes, that makes sense, I think we should only panic if we have determined that we have lost some channel state.

TheBlueMatt

Few more comments, let's move forward without blocking on the ChannelMonitor serialization stuff.

lightning/src/ln/our_peer_storage.rs

TheBlueMatt · 2025-07-09T22:25:29Z

lightning/src/chain/channelmonitor.rs

+///
+/// [`ChainMonitor`]: crate::chain::chainmonitor::ChainMonitor
+#[rustfmt::skip]
+pub(crate) fn write_util<Signer: EcdsaChannelSigner, W: Writer>(channel_monitor: &ChannelMonitorImpl<Signer>, is_stub: bool, writer: &mut W) -> Result<(), Error> {


Basically. I think ideally we find a way to store everything (required) but counterparty_claimable_outpoints so that we can punish the counterparty on their balance+reserve if they broadcast a stale state, even if not HTLCs (though of course they can't claim the HTLCs without us being able to punish them on the next stage). Not sure how practical that is today without counterparty_claimable_outpoints but I think that's the goal.

@adi2011 maybe for now let's just write the full monitors, but leave a TODO to strip out what we can later. For larger nodes that means all our monitors will be too large and we'll never back any up but that's okay.

TheBlueMatt · 2025-07-09T22:26:56Z

lightning/src/chain/chainmonitor.rs

 		let random_bytes = self.entropy_source.get_secure_random_bytes();
-		let serialised_channels = Vec::new();
+
+		// TODO(aditya): Choose n random channels so that peer storage does not exceed 64k.


This should be pretty easy? We have random bytes, just make an outer loop that selects a random monitor (by doing monitors.iter().skip(random_usize % monitors.len()).next())

lightning/src/ln/channelmanager.rs

tnull

Did a ~first pass.

This needs a rebase now, in particular now that #3922 landed.

tnull · 2025-07-11T08:40:19Z

lightning/src/chain/chainmonitor.rs

@@ -810,10 +813,53 @@ where
 	}

 	fn send_peer_storage(&self, their_node_id: PublicKey) {
-		// TODO: Serialize `ChannelMonitor`s inside `our_peer_storage`.
-
+		static MAX_PEER_STORAGE_SIZE: usize = 65000;


This should be a const rather than static, I think? Also, would probably make sense to add this add the module level, with some docs.

Also isn't the max size 64 KiB, not 65K?

Oh, my bad, thanks for pointing this out, It should be 65531.

lightning/src/chain/chainmonitor.rs

lightning/src/chain/channelmonitor.rs

tnull · 2025-07-11T08:52:19Z

lightning/src/chain/chainmonitor.rs

+		let mut curr_size = 0;
+
+		// Randomising Keys in the HashMap to fetch monitors without repetition.
+		let mut keys: Vec<&ChannelId> = monitors.keys().collect();


Can we make this a bit cleaner by using the proposed iterator skiping approach in the loop below, maybe while simply keeping track of which monitors we already wrote?

I will address this in the next fixup.

Seems this is still unaddressed?

Gentle ping here.

Created a BTree to keep track of the monitors and ran a while loop until we either ran out of peer storage capacity or all monitors were stored.

lightning/src/chain/chainmonitor.rs

tnull · 2025-07-11T09:02:55Z

lightning/src/chain/chainmonitor.rs

+					monitors_list.monitors.push(peer_storage_monitor);
+				},
+				Err(_) => {
+					panic!("Can not write monitor for {}", mon.monitor.channel_id())


Really, please avoid these explicit panics in any of this code.

This would panic only if there is some issue with the write_util which suggests that we are unable to serialise the channelmonitor, should we just log the error here instead?

Mhh, while I'd still prefer to never panic in any of the peer storage code, indeed this would likely only get hit if we run out of memory or similar. If we keep it, let's just expect on the write_internal call, which avoids all this matching.

lightning/src/chain/channelmonitor.rs

lightning/src/ln/channelmanager.rs

tnull · 2025-07-17T07:08:51Z

lightning/src/chain/chainmonitor.rs

 		let random_bytes = self.entropy_source.get_secure_random_bytes();
-		let serialised_channels = Vec::new();
+		let random_usize = usize::from_le_bytes(random_bytes[0..core::mem::size_of::<usize>()].try_into().unwrap());


nit: Let's keep the length in a separate const, avoiding to make this line overly long/noisy.

Fixed, thanks!

tnull · 2025-07-17T07:11:38Z

lightning/src/chain/chainmonitor.rs

+					monitors_list.monitors.push(peer_storage_monitor);
+				},
+				Err(_) => {
+					panic!("Can not write monitor for {}", mon.monitor.channel_id())


Mhh, while I'd still prefer to never panic in any of the peer storage code, indeed this would likely only get hit if we run out of memory or similar. If we keep it, let's just expect on the write_internal call, which avoids all this matching.

tnull · 2025-07-17T07:15:16Z

lightning/src/chain/chainmonitor.rs

+			match write_util_internal(&chan_mon, true, &mut ser_chan) {
+				Ok(_) => {
+					// Adding size of peer_storage_monitor.
+					curr_size += ser_chan.0.serialized_length()


If we move this check below constructing PeerStorageMonitorHolder, can we just use it's serialized_lenght implementation instead of tallying up the individual fields here?

My bad, thanks for pointing this out. Fixed.

tnull · 2025-07-17T07:18:22Z

lightning/src/ln/channelmanager.rs

+				},
+				None => {
+					// TODO: Figure out if this channel is so old that we have forgotten about it.
+					panic!("Lost a channel {}", &mon_holder.channel_id);


Why panic here? This should be the common case, no?

We need this when we’ve completely forgotten about a channel but it’s still active. The only false positives occur if we’ve closed a channel and then forgotten about it. Any ideas on how we could detect those?

We need this when we’ve completely forgotten about a channel but it’s still active.

I don't think this is accurate? Wouldn't we end up here simply in this scenario:

We have a channel, store an encrypted backup with a peer

The peer goes offline

We close the channel, drop it from the per_peer_state

The peer comes back online

We retrieve and decrypt a peer backup with a channel state that we no longer know anything about.

We panic and crash the node

This can happen very easily / is the common case, no? Also note this same flow could intentionally get abused by a malicious peer, too. The counterparty would simply need to guess that we include a certain channel in the stored blob, remember it, wait until the channel is closed, and then simply replay the old backup to crash our node.

I still maintain we should never panic in any of this PeerStorage related code, maybe mod the above case where we're positive that our local state is outdated compared to the backed-up version.

That said, we'll indeed need to discern between the two cases (unexpected data loss / expected data loss). One variant to do this would be to first check whether we see a funding transaction spend, although note that this would be an async flow that we can't just directly initiate here. Or we need to maintain a list of all closed and remember it forever. The latter might indeed be the simplest and safest way, if we're fine with keeping the additional data. In any case, we need to find a solution here before moving on, as panicking here is unacceptable.

Yes, I agree, this is a false positive where the node crashes, but it shouldn’t. The happy path here is:

We have a channel with a peer.

Something happens and we lose the channel (but it is unclosed); the node cannot identify it either.

We retrieve the ChannelMonitor from peer storage.

We realise that this is an unknown, old channel, and the node crashes because it no longer has any record of an active channel.

There should be a way to determine whether a channel is closed, so that if we receive an unknown channel and its funding TXO is unspent, we crash the node for recovery. But if the funding is already spent, we shouldn’t worry unless they have published an old state. Right?

Yes, I agree, this is a false positive where the node crashes

We cannot tolerate a false positive if we're going to panic.

There should be a way to determine whether a channel is closed, so that if we receive an unknown channel and its funding TXO is unspent, we crash the node for recovery. But if the funding is already spent, we shouldn’t worry unless they have published an old state. Right?

There is no way to do this in LDK today, and I don't think we'll ever get one. We don't have some global way to query if an arbitrary UTXO is unspent (well, we do in gossip but its not always available), and once a channel is closed and the funds swept we may delete the ChannelMonitor because there's no reason to keep it around. Keeping around information about every past channel we've ever had just to detect a lost channel here doesn't seem worth it to me.

It does mean we won't detect if we lose all of our state, but I think that's okay - a user will notice that their entire balance is missing :)

Yes, it should not be there. I put up the TODO so that we can have a discussion around this.
Changing the panic to log_debug.

tnull · 2025-07-17T07:19:48Z

lightning/src/ln/our_peer_storage.rs

+
+impl_writeable_tlv_based!(PeerStorageMonitorHolder, {
+	(0, channel_id, required),
+	(1, counterparty_node_id, required),


For new objects, you should be able to make all fields even-numbered as they are required from the getgo, basically.

tnull · 2025-07-17T07:27:29Z

lightning/src/ln/our_peer_storage.rs

+/// wrapped inside [`PeerStorage`].
+///
+/// [`PeerStorage`]: crate::ln::msgs::PeerStorage
+pub(crate) struct PeerStorageMonitorHolderList {


I think we should be able to avoid this new type wrapper if we add a impl_writeable_for_vec entry for PeerStorageMonitorHolder in lightning/src/util/ser.rs.

Seems this is still unaddressed?

Yes, sorry, I will address this in the next fixup.

lightning/src/chain/channelmonitor.rs

ldk-reviews-bot · 2025-07-19T15:12:34Z

🔔 1st Reminder

Hey @TheBlueMatt! This PR has been waiting for your review.
Please take a look when you have a chance. If you're unable to review, please let us know so we can find another reviewer.

TheBlueMatt

Looks like you forgot to push after addressing @tnull's last comments? Also needs a rebase.

adi2011 · 2025-07-21T13:15:18Z

Oh sorry, my bad, I have pushed the fixups and rebased.

TheBlueMatt

I think if we just remove the is_stub check and the spurious panic we can land this (feel free to squash fixups, IMO). Sadly we will need to add a cfg flag before 0.2 (unless we can nail down the serialization format we want), but it doesn't have to hold up this PR or progress.

TheBlueMatt · 2025-07-22T12:32:58Z

lightning/src/chain/channelmonitor.rs


-		self.lockdown_from_offchain.write(writer)?;
-		self.holder_tx_signed.write(writer)?;
+	if !is_stub {


For now lets not do anything differently here, I don't think skipping the onchain_tx_handler is safe, but we're not quite sure what the right answer is yet. Because we won't (yet) finalize the serialization format, we should also cfg-tag out the actual send logic in send_peer_storage, sadly.

Should I not use is_stub, inside the function or should I remove the parameter from the function definition.

Also, should I make a new cfg tag for peer-storage or use an existing?

Should I not use is_stub, inside the function or should I remove the parameter from the function definition.

I think its fine to add it with a TODO: figure out which fields should go here and which do not.

Also, should I make a new cfg tag for peer-storage or use an existing?

I believe a new one (you'll have to add it in ci/ci-tests.sh as well)

Cool, I will cfg-tag the sending logic. In the next PR we can discuss when to send peer storage along with what to omit from ChannelMonitors.

TheBlueMatt · 2025-07-22T12:38:35Z

lightning/src/ln/channelmanager.rs

+				},
+				None => {
+					// TODO: Figure out if this channel is so old that we have forgotten about it.
+					panic!("Lost a channel {}", &mon_holder.channel_id);


Yes, I agree, this is a false positive where the node crashes

We cannot tolerate a false positive if we're going to panic.

There should be a way to determine whether a channel is closed, so that if we receive an unknown channel and its funding TXO is unspent, we crash the node for recovery. But if the funding is already spent, we shouldn’t worry unless they have published an old state. Right?

There is no way to do this in LDK today, and I don't think we'll ever get one. We don't have some global way to query if an arbitrary UTXO is unspent (well, we do in gossip but its not always available), and once a channel is closed and the funds swept we may delete the ChannelMonitor because there's no reason to keep it around. Keeping around information about every past channel we've ever had just to detect a lost channel here doesn't seem worth it to me.

It does mean we won't detect if we lose all of our state, but I think that's okay - a user will notice that their entire balance is missing :)

TheBlueMatt

I think this is fine now, after the notes here are fixed and rustfmt is fixed.

TheBlueMatt · 2025-07-24T16:38:09Z

lightning/src/chain/channelmonitor.rs


-		self.lockdown_from_offchain.write(writer)?;
-		self.holder_tx_signed.write(writer)?;
+	if !is_stub {


Should I not use is_stub, inside the function or should I remove the parameter from the function definition.

I think its fine to add it with a TODO: figure out which fields should go here and which do not.

Also, should I make a new cfg tag for peer-storage or use an existing?

I believe a new one (you'll have to add it in ci/ci-tests.sh as well)

graphite-app · 2025-07-29T12:07:51Z

lightning/src/chain/chainmonitor.rs

+		while curr_size < MAX_PEER_STORAGE_SIZE
+			&& *stored_chanmon_idx.last().unwrap_or(&zero) < monitors.len()
+		{
+			let idx = random_usize % monitors.len();
+			stored_chanmon_idx.insert(idx + 1);


The loop condition has a potential issue where it might not terminate if the same random index is repeatedly selected. Since stored_chanmon_idx is a BTreeSet, inserting the same value multiple times doesn't increase its size. If random_usize % monitors.len() keeps generating the same index, the condition *stored_chanmon_idx.last().unwrap_or(&zero) < monitors.len() will remain true indefinitely.

Consider modifying the approach to ensure termination, such as:

Using a counter to limit iterations

Tracking already-processed indices in a separate set

Using a deterministic sequence instead of random selection

This would prevent potential infinite loops while still achieving the goal of selecting monitors for peer storage.

Suggested change

while curr_size < MAX_PEER_STORAGE_SIZE

&& *stored_chanmon_idx.last().unwrap_or(&zero) < monitors.len()

{

let idx = random_usize % monitors.len();

stored_chanmon_idx.insert(idx + 1);

while curr_size < MAX_PEER_STORAGE_SIZE

&& *stored_chanmon_idx.last().unwrap_or(&zero) < monitors.len()

&& stored_chanmon_idx.len() < monitors.len()

{

let idx = random_usize % monitors.len();

let inserted = stored_chanmon_idx.insert(idx + 1);

if inserted {

curr_size += 1;

} else {

// If we couldn't insert, try a different index next time

random_usize = random_usize.wrapping_add(1);

}

Spotted by Diamond

Is this helpful? React 👍 or 👎 to let us know.

@tnull, this is the same concern that I had in my mind as well. Wasn't the previous approach more efficient and secure?

How about something along these lines?:

diff --git a/lightning/src/chain/chainmonitor.rs b/lightning/src/chain/chainmonitor.rs index 4a40ba872..dde117ca9 100644 --- a/lightning/src/chain/chainmonitor.rs +++ b/lightning/src/chain/chainmonitor.rs @@ -39,6 +39,7 @@ use crate::chain::{ChannelMonitorUpdateStatus, Filter, WatchedOutput}; use crate::events::{self, Event, EventHandler, ReplayEvent}; use crate::ln::channel_state::ChannelDetails; use crate::ln::msgs::{self, BaseMessageHandler, Init, MessageSendEvent, SendOnlyMessageHandler}; +#[cfg(peer_storage)] use crate::ln::our_peer_storage::{DecryptedOurPeerStorage, PeerStorageMonitorHolder}; use crate::ln::types::ChannelId; use crate::prelude::*; @@ -53,6 +54,8 @@ use crate::util::persist::MonitorName; use crate::util::ser::{VecWriter, Writeable}; use crate::util::wakers::{Future, Notifier}; use bitcoin::secp256k1::PublicKey; +#[cfg(peer_storage)] +use core::iter::Cycle; use core::ops::Deref; use core::sync::atomic::{AtomicUsize, Ordering}; @@ -808,6 +811,7 @@ where /// This function collects the counterparty node IDs from all monitors into a `HashSet`, /// ensuring unique IDs are returned. + #[cfg(peer_storage)] fn all_counterparty_node_ids(&self) -> HashSet<PublicKey> { let mon = self.monitors.read().unwrap(); mon.values().map(|monitor| monitor.monitor.get_counterparty_node_id()).collect() @@ -815,51 +819,71 @@ where #[cfg(peer_storage)] fn send_peer_storage(&self, their_node_id: PublicKey) { - #[allow(unused_mut)] let mut monitors_list: Vec<PeerStorageMonitorHolder> = Vec::new(); let random_bytes = self.entropy_source.get_secure_random_bytes(); const MAX_PEER_STORAGE_SIZE: usize = 65531; const USIZE_LEN: usize = core::mem::size_of::<usize>(); - let mut usize_bytes = [0u8; USIZE_LEN]; - usize_bytes.copy_from_slice(&random_bytes[0..USIZE_LEN]); - let random_usize = usize::from_le_bytes(usize_bytes); + let mut random_bytes_cycle_iter = random_bytes.iter().cycle(); + + let mut current_size = 0; + let monitors_lock = self.monitors.read().unwrap(); + let mut channel_ids = monitors_lock.keys().copied().collect(); + + fn next_random_id( + channel_ids: &mut Vec<ChannelId>, + random_bytes_cycle_iter: &mut Cycle<core::slice::Iter<u8>>, + ) -> Option<ChannelId> { + if channel_ids.is_empty() { + return None; + } - let mut curr_size = 0; - let monitors = self.monitors.read().unwrap(); - let mut stored_chanmon_idx = alloc::collections::BTreeSet::<usize>::new(); - // Used as a fallback reference if the set is empty - let zero = 0; + let random_idx = { + let mut usize_bytes = [0u8; USIZE_LEN]; + usize_bytes.iter_mut().for_each(|b| { + *b = *random_bytes_cycle_iter.next().expect("A cycle never ends") + }); + // Take one more to introduce a slight misalignment. + random_bytes_cycle_iter.next().expect("A cycle never ends"); + usize::from_le_bytes(usize_bytes) % channel_ids.len() + }; + + Some(channel_ids.swap_remove(random_idx)) + } - while curr_size < MAX_PEER_STORAGE_SIZE - && *stored_chanmon_idx.last().unwrap_or(&zero) < monitors.len() + while let Some(channel_id) = next_random_id(&mut channel_ids, &mut random_bytes_cycle_iter) { - let idx = random_usize % monitors.len(); - stored_chanmon_idx.insert(idx + 1); - let (cid, mon) = monitors.iter().skip(idx).next().unwrap(); + let monitor_holder = if let Some(monitor_holder) = monitors_lock.get(&channel_id) { + monitor_holder + } else { + debug_assert!( + false, + "Tried to access non-existing monitor, this should never happen" + ); + break; + }; - let mut ser_chan = VecWriter(Vec::new()); - let min_seen_secret = mon.monitor.get_min_seen_secret(); - let counterparty_node_id = mon.monitor.get_counterparty_node_id(); + let mut serialized_channel = VecWriter(Vec::new()); + let min_seen_secret = monitor_holder.monitor.get_min_seen_secret(); + let counterparty_node_id = monitor_holder.monitor.get_counterparty_node_id(); { - let chan_mon = mon.monitor.inner.lock().unwrap(); + let inner_lock = monitor_holder.monitor.inner.lock().unwrap(); - write_chanmon_internal(&chan_mon, true, &mut ser_chan) + write_chanmon_internal(&inner_lock, true, &mut serialized_channel) .expect("can not write Channel Monitor for peer storage message"); } let peer_storage_monitor = PeerStorageMonitorHolder { - channel_id: *cid, + channel_id, min_seen_secret, counterparty_node_id, - monitor_bytes: ser_chan.0, + monitor_bytes: serialized_channel.0, }; - // Adding size of peer_storage_monitor. - curr_size += peer_storage_monitor.serialized_length(); - - if curr_size > MAX_PEER_STORAGE_SIZE { - break; + current_size += peer_storage_monitor.serialized_length(); + if current_size > MAX_PEER_STORAGE_SIZE { + continue; } + monitors_list.push(peer_storage_monitor); }

There might be still some room for improvement, but IMO something like this would be a lot more readable.

+ if current_size > MAX_PEER_STORAGE_SIZE { + continue; }

Why not break here?

Say we have one huge monitor that doesn't fit in the backup and a few smaller ones. If we'd always abort whenever we draw the large monitor, we might unnecessarily skip backups for the smaller ones. Yes, we could get fancier around how/which monitors to include to make the best use out of the space given, but for now it seems fine to just walk the entire list?

Ah, but now looking at it again I spot a bug there, it should be:

let serialized_length = peer_storage_monitor.serialized_length(); if current_size + serialized_length > MAX_PEER_STORAGE_SIZE { continue; } else { current_size += serialized_length; monitors_list.push(peer_storage_monitor); }

tnull · 2025-07-31T11:41:10Z

lightning/src/chain/chainmonitor.rs

+		while curr_size < MAX_PEER_STORAGE_SIZE
+			&& *stored_chanmon_idx.last().unwrap_or(&zero) < monitors.len()
+		{
+			let idx = random_usize % monitors.len();
+			stored_chanmon_idx.insert(idx + 1);


How about something along these lines?:

diff --git a/lightning/src/chain/chainmonitor.rs b/lightning/src/chain/chainmonitor.rs index 4a40ba872..dde117ca9 100644 --- a/lightning/src/chain/chainmonitor.rs +++ b/lightning/src/chain/chainmonitor.rs @@ -39,6 +39,7 @@ use crate::chain::{ChannelMonitorUpdateStatus, Filter, WatchedOutput}; use crate::events::{self, Event, EventHandler, ReplayEvent}; use crate::ln::channel_state::ChannelDetails; use crate::ln::msgs::{self, BaseMessageHandler, Init, MessageSendEvent, SendOnlyMessageHandler}; +#[cfg(peer_storage)] use crate::ln::our_peer_storage::{DecryptedOurPeerStorage, PeerStorageMonitorHolder}; use crate::ln::types::ChannelId; use crate::prelude::*; @@ -53,6 +54,8 @@ use crate::util::persist::MonitorName; use crate::util::ser::{VecWriter, Writeable}; use crate::util::wakers::{Future, Notifier}; use bitcoin::secp256k1::PublicKey; +#[cfg(peer_storage)] +use core::iter::Cycle; use core::ops::Deref; use core::sync::atomic::{AtomicUsize, Ordering}; @@ -808,6 +811,7 @@ where /// This function collects the counterparty node IDs from all monitors into a `HashSet`, /// ensuring unique IDs are returned. + #[cfg(peer_storage)] fn all_counterparty_node_ids(&self) -> HashSet<PublicKey> { let mon = self.monitors.read().unwrap(); mon.values().map(|monitor| monitor.monitor.get_counterparty_node_id()).collect() @@ -815,51 +819,71 @@ where #[cfg(peer_storage)] fn send_peer_storage(&self, their_node_id: PublicKey) { - #[allow(unused_mut)] let mut monitors_list: Vec<PeerStorageMonitorHolder> = Vec::new(); let random_bytes = self.entropy_source.get_secure_random_bytes(); const MAX_PEER_STORAGE_SIZE: usize = 65531; const USIZE_LEN: usize = core::mem::size_of::<usize>(); - let mut usize_bytes = [0u8; USIZE_LEN]; - usize_bytes.copy_from_slice(&random_bytes[0..USIZE_LEN]); - let random_usize = usize::from_le_bytes(usize_bytes); + let mut random_bytes_cycle_iter = random_bytes.iter().cycle(); + + let mut current_size = 0; + let monitors_lock = self.monitors.read().unwrap(); + let mut channel_ids = monitors_lock.keys().copied().collect(); + + fn next_random_id( + channel_ids: &mut Vec<ChannelId>, + random_bytes_cycle_iter: &mut Cycle<core::slice::Iter<u8>>, + ) -> Option<ChannelId> { + if channel_ids.is_empty() { + return None; + } - let mut curr_size = 0; - let monitors = self.monitors.read().unwrap(); - let mut stored_chanmon_idx = alloc::collections::BTreeSet::<usize>::new(); - // Used as a fallback reference if the set is empty - let zero = 0; + let random_idx = { + let mut usize_bytes = [0u8; USIZE_LEN]; + usize_bytes.iter_mut().for_each(|b| { + *b = *random_bytes_cycle_iter.next().expect("A cycle never ends") + }); + // Take one more to introduce a slight misalignment. + random_bytes_cycle_iter.next().expect("A cycle never ends"); + usize::from_le_bytes(usize_bytes) % channel_ids.len() + }; + + Some(channel_ids.swap_remove(random_idx)) + } - while curr_size < MAX_PEER_STORAGE_SIZE - && *stored_chanmon_idx.last().unwrap_or(&zero) < monitors.len() + while let Some(channel_id) = next_random_id(&mut channel_ids, &mut random_bytes_cycle_iter) { - let idx = random_usize % monitors.len(); - stored_chanmon_idx.insert(idx + 1); - let (cid, mon) = monitors.iter().skip(idx).next().unwrap(); + let monitor_holder = if let Some(monitor_holder) = monitors_lock.get(&channel_id) { + monitor_holder + } else { + debug_assert!( + false, + "Tried to access non-existing monitor, this should never happen" + ); + break; + }; - let mut ser_chan = VecWriter(Vec::new()); - let min_seen_secret = mon.monitor.get_min_seen_secret(); - let counterparty_node_id = mon.monitor.get_counterparty_node_id(); + let mut serialized_channel = VecWriter(Vec::new()); + let min_seen_secret = monitor_holder.monitor.get_min_seen_secret(); + let counterparty_node_id = monitor_holder.monitor.get_counterparty_node_id(); { - let chan_mon = mon.monitor.inner.lock().unwrap(); + let inner_lock = monitor_holder.monitor.inner.lock().unwrap(); - write_chanmon_internal(&chan_mon, true, &mut ser_chan) + write_chanmon_internal(&inner_lock, true, &mut serialized_channel) .expect("can not write Channel Monitor for peer storage message"); } let peer_storage_monitor = PeerStorageMonitorHolder { - channel_id: *cid, + channel_id, min_seen_secret, counterparty_node_id, - monitor_bytes: ser_chan.0, + monitor_bytes: serialized_channel.0, }; - // Adding size of peer_storage_monitor. - curr_size += peer_storage_monitor.serialized_length(); - - if curr_size > MAX_PEER_STORAGE_SIZE { - break; + current_size += peer_storage_monitor.serialized_length(); + if current_size > MAX_PEER_STORAGE_SIZE { + continue; } + monitors_list.push(peer_storage_monitor); }

There might be still some room for improvement, but IMO something like this would be a lot more readable.

tnull · 2025-07-31T11:43:50Z

lightning/src/chain/chainmonitor.rs

@@ -809,11 +813,57 @@ where
 		mon.values().map(|monitor| monitor.monitor.get_counterparty_node_id()).collect()
 	}

+	#[cfg(peer_storage)]


I think you'll also need to introduce the cfg-gate in a few other places to get rid of the warning, e.g., above for all_counterparty_node_ids and on some imports.

should I also cfg-gate entropy_source inside ChainMonitor?
If yes, I would have to cfg-gate a lot of things in other files as well...
I think, It would be better to allow unused on it, since we are going to remove the peer-storage cfg gate in the next PR...

Well, we need to add the cfg-gate where ever we'll get a warning. I'd prefer to avoid allow_unused, especially as it might easily slip through (i.e., we might not remember to remove it again going forward), while for the cfg guard we will be forced to make the cleanup.

Okay, so I have cfg-gated at almost all the places in the chainmonitor monitor, but I have decided to add PhantomData in ChainMonitor, to avoid redeclaring the whole struct or the new function.

As mentioned elsewhere, it should be fine to just rename the entropy source _entropy_source for now, no need for PhantomData then.

tnull

The PhantomData changes unfortunately have CI fail. I think it might be easiest to revert them and just use _entropy_source.

tnull

The fuzz CI is failing due to:

thread 'full_stack::tests::test_no_existing_test_breakage' panicked at 'assertion failed: `(left == right)`
  left: `None`,
 right: `Some(1)`', src/full_stack.rs:1707:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

I suspect this is likely related to the cfg-gate?

'PeerStorageMonitorHolder' is used to wrap a single ChannelMonitor, here we are adding some fields separetly so that we do not need to read the whole ChannelMonitor to identify if we have lost some states. `PeerStorageMonitorHolderList` is used to keep the list of all the channels which would be sent over the wire inside Peer Storage.

Fixed formatting for write() in ChannelMonitorImpl. This would make the next commit cleaner by ensuring it only contains direct code shifts, without unrelated formatting changes.

Create a utililty function to prevent code duplication while writing ChannelMonitors. Serialise them inside ChainMonitor::send_peer_storage and send them over. Cfg-tag the sending logic because we are unsure of what to omit from ChannelMonitors stored inside peer-storage.

Deserialise the ChannelMonitors and compare the data to determine if we have lost some states.

Node should now determine lost states using retrieved peer storage.

adi2011 · 2025-08-11T07:04:05Z

Yes, entropy source is not being called while sending peer storage because we have cfg-gated it. Reverting the values in the fuzz test.

This needs a rebase due to conflict.

tnull · 2025-08-11T07:34:56Z

Yes, entropy source is not being called while sending peer storage because we have cfg-gated it. Reverting the values in the fuzz test.

Probably also best to also switch around the values based on the cfg value, so we don't miss them, and we keep the capability to fuzz with peer storage?

tnull

Fixups look good to me, feel free to squash. Some minor comments, but generally this seems close.

tnull · 2025-08-11T13:25:34Z

fuzz/src/full_stack.rs

@@ -1187,7 +1187,7 @@ fn two_peer_forwarding_seed() -> Vec<u8> {
 	// inbound read from peer id 1 of len 255
 	ext_from_hex("0301ff", &mut test);
 	// beginning of accept_channel
-	ext_from_hex("0021 0000000000000000000000000000000000000000000000000000000000000e12 0000000000000162 00000000004c4b40 00000000000003e8 00000000000003e8 00000002 03f0 0005 030000000000000000000000000000000000000000000000000000000000000100 030000000000000000000000000000000000000000000000000000000000000200 030000000000000000000000000000000000000000000000000000000000000300 030000000000000000000000000000000000000000000000000000000000000400 030000000000000000000000000000000000000000000000000000000000000500 02660000000000000000000000000000", &mut test);


See #3897 (comment)

tnull · 2025-08-11T13:25:57Z

lightning/src/chain/chainmonitor.rs

@@ -266,7 +275,8 @@ pub struct ChainMonitor<
 	logger: L,
 	fee_estimator: F,
 	persister: P,
-	entropy_source: ES,
+


nit: Drop spurious newline.

tnull · 2025-08-11T13:31:33Z

lightning/src/ln/channelmanager.rs

+		let mut cursor = io::Cursor::new(decrypted);
+		let mon_list = <Vec<PeerStorageMonitorHolder> as Readable>::read(&mut cursor)
+			.unwrap_or_else(|e| {
+				// This should NEVER happen.


Should we debug_assert here?

tnull · 2025-08-11T13:32:15Z

lightning/src/ln/channelmanager.rs

+				None => {
+					log_debug!(
+						logger,
+						"Not able to find peer_state for the counterparty {}, channelId {}",


nit: channel_id

tnull · 2025-08-11T13:34:15Z

lightning/src/ln/channelmanager.rs

@@ -17428,38 +17476,62 @@ mod tests {

 	#[test]
 	#[rustfmt::skip]


Mind adding a commit removing the skip?

tnull · 2025-08-11T13:36:06Z

lightning/src/ln/channelmanager.rs


-		create_announced_chan_between_nodes(&nodes, 0, 1);
+		let (_, _, cid, _) = create_announced_chan_between_nodes(&nodes, 0, 1);
+		send_payment(&nodes[0], &vec!(&nodes[1])[..], 1000);


Please just use &[]. If you give a slice anyways, there is no reason to allocate a Vec first (here and elsewhere).

tnull · 2025-08-11T13:37:00Z

lightning/src/ln/channelmanager.rs

 	fn test_peer_storage() {
 		let chanmon_cfgs = create_chanmon_cfgs(2);
+		let (persister, chain_monitor);


This line seems incomplete? Why did we add it here?

tnull · 2025-08-11T14:03:37Z

lightning/src/ln/channelmanager.rs

@@ -17117,38 +17165,62 @@ mod tests {

 	#[test]
 	#[rustfmt::skip]
+	#[cfg(peer_storage)]
+	#[should_panic(expected = "Lost channel state for channel ae3367da2c13bc1ceb86bf56418f62828f7ce9d6bfb15a46af5ba1f1ed8b124f.\n\


The issue is, if I remove the should_panic the test panics during teardown (in the drop), not in the main test logic.

Not sure I'm following, you can still do this, no?:

diff --git a/lightning/src/ln/channelmanager.rs b/lightning/src/ln/channelmanager.rs index ec7014b42..261fd80ef 100644 --- a/lightning/src/ln/channelmanager.rs +++ b/lightning/src/ln/channelmanager.rs @@ -17477,9 +17477,6 @@ mod tests { #[test] #[rustfmt::skip] #[cfg(peer_storage)] - #[should_panic(expected = "Lost channel state for channel ae3367da2c13bc1ceb86bf56418f62828f7ce9d6bfb15a46af5ba1f1ed8b124f.\n\ - Received peer storage with a more recent state than what our node had.\n\ - Use the FundRecoverer to initiate a force close and sweep the funds.")] fn test_peer_storage() { let chanmon_cfgs = create_chanmon_cfgs(2); let (persister, chain_monitor); @@ -17559,13 +17556,18 @@ mod tests { nodes[0].node.handle_channel_reestablish(nodes[1].node.get_our_node_id(), msg); assert_eq!(*node_id, nodes[0].node.get_our_node_id()); } else if let MessageSendEvent::SendPeerStorageRetrieval { ref node_id, ref msg } = msg { - // Should Panic here! - nodes[0].node.handle_peer_storage_retrieval(nodes[1].node.get_our_node_id(), msg.clone()); - assert_eq!(*node_id, nodes[0].node.get_our_node_id()); + let res = std::panic::catch_unwind(|| + nodes[0].node.handle_peer_storage_retrieval(nodes[1].node.get_our_node_id(), msg.clone()) + ); + assert!(res.is_err()); + break; } else { panic!("Unexpected event") } } + // When we panic'd, we expect to panic on `Drop`. + let res = std::panic::catch_unwind(|| drop(nodes)); + assert!(res.is_err()); } #[test]

(with casting the error you should even be able to assert the exact message if you prefer)

TheBlueMatt

I think this looks good, but I'll wait until @tnull signs off and the fuzzer is fixed.

TheBlueMatt · 2025-08-12T14:44:06Z

Probably also best to also switch around the values based on the cfg value, so we don't miss them, and we keep the capability to fuzz with peer storage?

Actually, don't bother, just drop the fuzz changes here, I'm gonna open a PR to make fuzz changes more robust against changes in LDK.

ldk-reviews-bot requested a review from joostjager June 28, 2025 10:32

tnull requested review from tnull and removed request for joostjager June 28, 2025 11:17

adi2011 force-pushed the peer-storage/serialise-deserialise branch 2 times, most recently from a35566a to 4c9f3c3 Compare June 29, 2025 05:04

TheBlueMatt reviewed Jun 30, 2025

View reviewed changes

tnull reviewed Jun 30, 2025

View reviewed changes

TheBlueMatt reviewed Jul 9, 2025

View reviewed changes

tnull reviewed Jul 11, 2025

View reviewed changes

adi2011 force-pushed the peer-storage/serialise-deserialise branch from 676afbc to 1111f05 Compare July 16, 2025 20:41

tnull reviewed Jul 17, 2025

View reviewed changes

adi2011 requested a review from TheBlueMatt July 17, 2025 15:12

TheBlueMatt reviewed Jul 21, 2025

View reviewed changes

adi2011 force-pushed the peer-storage/serialise-deserialise branch from 6dede0e to c9baefb Compare July 21, 2025 13:12

adi2011 requested a review from TheBlueMatt July 21, 2025 13:12

TheBlueMatt reviewed Jul 22, 2025

View reviewed changes

adi2011 force-pushed the peer-storage/serialise-deserialise branch 5 times, most recently from 4304853 to e76f538 Compare July 24, 2025 11:20

adi2011 requested review from tnull and TheBlueMatt July 24, 2025 11:20

adi2011 force-pushed the peer-storage/serialise-deserialise branch from e76f538 to 75a8fd6 Compare July 24, 2025 11:24

TheBlueMatt reviewed Jul 24, 2025

View reviewed changes

graphite-app bot reviewed Jul 29, 2025

View reviewed changes

adi2011 requested a review from tnull July 29, 2025 16:17

tnull reviewed Jul 31, 2025

View reviewed changes

adi2011 requested a review from tnull August 5, 2025 07:42

adi2011 force-pushed the peer-storage/serialise-deserialise branch 3 times, most recently from e7dddc9 to 1de28b8 Compare August 5, 2025 08:31

tnull reviewed Aug 5, 2025

View reviewed changes

adi2011 force-pushed the peer-storage/serialise-deserialise branch from 1de28b8 to b2cbe73 Compare August 5, 2025 18:23

adi2011 requested a review from tnull August 6, 2025 09:08

tnull reviewed Aug 6, 2025

View reviewed changes

adi2011 added 9 commits August 11, 2025 13:57

Remove #[rustfmt::skip] from fn write

83e00ae

Fixed formatting for write() in ChannelMonitorImpl. This would make the next commit cleaner by ensuring it only contains direct code shifts, without unrelated formatting changes.

Determine if we have lost data

d9c2937

Deserialise the ChannelMonitors and compare the data to determine if we have lost some states.

test: Modify test_peer_storage to check latest changes

0743b79

Node should now determine lost states using retrieved peer storage.

fixup: Allow unused_import, these are used in cfg(peer_storage)

cbba2a7

fixup: Serialise ChannelMonitors and send them over inside Peer Storage

33d0ee4

fixup: test: Modify test_peer_storage to check latest changes

167caa3

fixup: Serialise ChannelMonitors and send them over inside Peer Storage

c8b72bf

adi2011 force-pushed the peer-storage/serialise-deserialise branch from 185bc52 to a588961 Compare August 11, 2025 07:04

adi2011 requested a review from tnull August 11, 2025 07:04

fixup: Serialise ChannelMonitors and send them over inside Peer Storage

3e5d902

adi2011 force-pushed the peer-storage/serialise-deserialise branch from a588961 to 3e5d902 Compare August 11, 2025 07:06

adi2011 requested a review from TheBlueMatt August 11, 2025 07:19

tnull reviewed Aug 11, 2025

View reviewed changes

TheBlueMatt reviewed Aug 12, 2025

View reviewed changes

		@@ -17428,38 +17476,62 @@ mod tests {

		#[test]
		#[rustfmt::skip]

Peer Storage (Part 3): Identifying Lost Channel States #3897

Are you sure you want to change the base?

Peer Storage (Part 3): Identifying Lost Channel States #3897

Uh oh!

Conversation

adi2011 commented Jun 28, 2025

Uh oh!

ldk-reviews-bot commented Jun 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Jun 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ldk-reviews-bot commented Jun 30, 2025

Uh oh!

tnull left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TheBlueMatt left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tnull left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

ldk-reviews-bot commented Jun 28, 2025 •

edited

Loading

codecov bot commented Jun 29, 2025 •

edited

Loading