Fix payment retry races and inform users when a payment fails #1202

TheBlueMatt · 2021-12-05T00:28:14Z

As described in #1164, our payment retry logic is race-y and may retry payments after we should have exhausted our retry count. Worse, when using our InvoicePayer, there is no clear way for a user to detect at all when a payment is fully and irrevocably failed. This PR addresses both with a new event - PaymentFailed which is generated only after all HTLCs for a payment are failed and the payment has timed out or the user marks it as no longer retryable.

valentinewallace

Went through a first pass, mostly just nits!

lightning/src/ln/channelmanager.rs

lightning/src/util/events.rs

valentinewallace · 2021-12-07T20:35:45Z

lightning/src/util/events.rs

 		/// larger MPP payment were still in flight when this event was generated.
+		///
+		/// Note that if you are retrying individual MPP parts using this value to determine if a
+		/// payment has fully failed is race-y. Because multiple failure can happen prior to events


Sadness... Is it off to say that this field seems broken? Not sure if would be better to remove it

Yea, I mean basically - I guess we can deprecate it? If you are doing retries across the full payment and not on a per-path basis then the field is "correct" and works like you'd kinda expect. There isn't really a direct replacement absent this field, though - either you do path-level retries and call no_further_payment_retries or you do payment-level retries and then you need this bool. Obviously we'd ideally prefer users eventually move towards path-level retries, but...?

lightning-invoice/src/payment.rs

jkczyz

Largely looks good. Just need to dig into the new test but will do so after comments are addressed.

lightning/src/ln/functional_test_utils.rs

lightning/src/ln/payment_tests.rs

jkczyz · 2021-12-07T22:21:56Z

lightning/src/ln/channelmanager.rs

+							full_failure_ev = Some(events::Event::PaymentFailed {
+								payment_id,
+								payment_hash: payment.get().payment_hash().expect("RetriesExceeded PendintOutboundPayments always have a payment hash set"),
+							});


I think you can avoid creating this here by conditioning on all_paths_failed when needing to add the event later.

We also have to test that retries_exceeded is set, so I don't think we can after here where we drop the payment object itself.

Ah, I was thinking you could still create the event without the entry since the payment hash is available below. But I guess you would need another variable regardless, so probably not worth it.

lightning/src/ln/channelmanager.rs

lightning/src/util/events.rs

jkczyz · 2021-12-07T22:35:49Z

lightning-invoice/src/payment.rs

+					self.payer.no_further_payment_retries(payment_id.unwrap());
 				} else if retry.is_none() {
 					log_trace!(self.logger, "Payment {} missing retry params; not retrying", log_bytes!(payment_hash.0));
+					self.payer.no_further_payment_retries(payment_id.unwrap());


Could these two cases be handled directly inside ChannelManager such that no_further_payment_retries only needs to be called when retry_payment returns an error?

Hmm, retry.is_none is just a backwards compatibility thing - we'll remove it sooner or later - but I guess in this context no_further_payment_retries is actually a noop, just feels nice to have it for consistency? As for rejected_by_dest, I don't think we want to do that - in the case of someone implementing payment probes users may want to retry other failing paths even if some paths fail with rejected_by_dest. We don't really support that here, I guess, but its totally reasonable for a user to want to have their own InvoicePayer that uses it.

Good point on the probing use case. I guess the change here would break anyone wanting to use it with InvoicePayer's retry logic?

You can't use InvoicePayer with per-path retries and probing anyway (well, mostly cause InvoicePayer's per-path retries are broken entirely), so I don't think anything new broke here?

Not 100% sure I follow. Prior to this change, using the retry logic for the probe use case would let you retry other paths after a successful path probe (i.e., when rejected_by_dest is set). That is because the payment would be in state Retryable when calling ChannelManager::retry_payment. But now it won't work since the payment will be in state RetriesExceeded after ChannelManager::no_further_payment_retries is called for the rejected_by_dest case.

Right, my point was more that, prior to this PR, retries for individual paths were broken in general (ie cause they get retried more after retries are up). For a non-probe payment, it feels "obvious" that a rejected_by_dest failure should cause us to stop the payment - we won't be able to complete the payment anyway, so we should stop. For a probe, indeed, indeed, we want different behavior, but I'm not sure how to expose that here...the new ChannelManager API suggested is focused on informing us either when we get a single success or when all payments have failed after we tell it we're done. What we need for probes the ability to learn when all paths have failed even though we don't want to mark the payment as done until all paths have failed with some condition.

Trying to accomplish this without explicit ChannelManager support for it seems to keep bumping up against #1164 again - we can't know just by looking at the event in front of us if its the last one for a given payment or not. If its tagged rejected_by_dest we can skip retrying and expose it to users, but then we've just pushed the "is the payment complete" detection to our users, where there isn't a clear way to do it. We could try some fuzzy heuristic on the amount of funds which have "successfully failed", but that fails to account for fees in some cases (plus doesn't work across restarts), so I'm kinda hesitant to do it. Maybe if we don't care about the fees issue (since we're just collecting paths) and consider it acceptable that it fails across restarts (cause its just a probe), then we can add some new API in the InvoicePayer to explicitly probe with some path-collection and then generate a new event there that says "collected enough paths, here is a route"? We'll need to be really careful about re-using liquidity twice in that case, though, which may make things even more complicated.

lightning/src/ln/channelmanager.rs

TheBlueMatt · 2021-12-10T00:42:41Z

Reworked the expect_payment_failed macros to dry them up a lot.

jkczyz

Largely looks good outside some naming ideas.

jkczyz · 2021-12-10T21:28:22Z

lightning-invoice/src/payment.rs

+					self.payer.no_further_payment_retries(payment_id.unwrap());
 				} else if retry.is_none() {
 					log_trace!(self.logger, "Payment {} missing retry params; not retrying", log_bytes!(payment_hash.0));
+					self.payer.no_further_payment_retries(payment_id.unwrap());


Not 100% sure I follow. Prior to this change, using the retry logic for the probe use case would let you retry other paths after a successful path probe (i.e., when rejected_by_dest is set). That is because the payment would be in state Retryable when calling ChannelManager::retry_payment. But now it won't work since the payment will be in state RetriesExceeded after ChannelManager::no_further_payment_retries is called for the rejected_by_dest case.

lightning/src/ln/channelmanager.rs

jkczyz · 2021-12-10T21:43:13Z

lightning/src/ln/functional_test_utils.rs

+		$(
+			conditions = conditions.expected_htlc_error_data($expected_error_code, &$expected_error_data);
+		)*


Is the repetition only expected to be zero or one here?

Yes, it always is. IIRC we still support versions of rustc that don't support the ? macro repetition option.

lightning-invoice/src/payment.rs

jkczyz · 2021-12-11T03:32:40Z

lightning-invoice/src/payment.rs

+		let htlc_updates = SendEvent::from_node(&nodes[0]);
+		check_added_monitors!(nodes[0], 1);
+		assert_eq!(htlc_updates.msgs.len(), 1);
+
+		nodes[1].node.handle_update_add_htlc(&nodes[0].node.get_our_node_id(), &htlc_updates.msgs[0]);
+		nodes[1].node.handle_commitment_signed(&nodes[0].node.get_our_node_id(), &htlc_updates.commitment_msg);
+		check_added_monitors!(nodes[1], 1);
+		let (bs_first_raa, bs_first_cs) = get_revoke_commit_msgs!(nodes[1], nodes[0].node.get_our_node_id());
+
+		nodes[0].node.handle_revoke_and_ack(&nodes[1].node.get_our_node_id(), &bs_first_raa);
+		check_added_monitors!(nodes[0], 1);
+		let second_htlc_updates = SendEvent::from_node(&nodes[0]);
+
+		nodes[0].node.handle_commitment_signed(&nodes[1].node.get_our_node_id(), &bs_first_cs);
+		check_added_monitors!(nodes[0], 1);
+		let as_first_raa = get_event_msg!(nodes[0], MessageSendEvent::SendRevokeAndACK, nodes[1].node.get_our_node_id());
+
+		nodes[1].node.handle_update_add_htlc(&nodes[0].node.get_our_node_id(), &second_htlc_updates.msgs[0]);
+		nodes[1].node.handle_commitment_signed(&nodes[0].node.get_our_node_id(), &second_htlc_updates.commitment_msg);
+		check_added_monitors!(nodes[1], 1);
+		let bs_second_raa = get_event_msg!(nodes[1], MessageSendEvent::SendRevokeAndACK, nodes[0].node.get_our_node_id());
+
+		nodes[1].node.handle_revoke_and_ack(&nodes[0].node.get_our_node_id(), &as_first_raa);
+		check_added_monitors!(nodes[1], 1);
+		let bs_fail_update = get_htlc_update_msgs!(nodes[1], nodes[0].node.get_our_node_id());
+
+		nodes[0].node.handle_revoke_and_ack(&nodes[1].node.get_our_node_id(), &bs_second_raa);
+		check_added_monitors!(nodes[0], 1);
+
+		nodes[0].node.handle_update_fail_htlc(&nodes[1].node.get_our_node_id(), &bs_fail_update.update_fail_htlcs[0]);
+		nodes[0].node.handle_commitment_signed(&nodes[1].node.get_our_node_id(), &bs_fail_update.commitment_signed);
+		check_added_monitors!(nodes[0], 1);
+		let (as_second_raa, as_third_cs) = get_revoke_commit_msgs!(nodes[0], nodes[1].node.get_our_node_id());
+
+		nodes[1].node.handle_revoke_and_ack(&nodes[0].node.get_our_node_id(), &as_second_raa);
+		check_added_monitors!(nodes[1], 1);
+		let bs_second_fail_update = get_htlc_update_msgs!(nodes[1], nodes[0].node.get_our_node_id());
+
+		nodes[1].node.handle_commitment_signed(&nodes[0].node.get_our_node_id(), &as_third_cs);
+		check_added_monitors!(nodes[1], 1);
+		let bs_third_raa = get_event_msg!(nodes[1], MessageSendEvent::SendRevokeAndACK, nodes[0].node.get_our_node_id());
+
+		nodes[0].node.handle_update_fail_htlc(&nodes[1].node.get_our_node_id(), &bs_second_fail_update.update_fail_htlcs[0]);
+		nodes[0].node.handle_commitment_signed(&nodes[1].node.get_our_node_id(), &bs_second_fail_update.commitment_signed);
+		check_added_monitors!(nodes[0], 1);
+
+		nodes[0].node.handle_revoke_and_ack(&nodes[1].node.get_our_node_id(), &bs_third_raa);
+		check_added_monitors!(nodes[0], 1);
+		let (as_third_raa, as_fourth_cs) = get_revoke_commit_msgs!(nodes[0], nodes[1].node.get_our_node_id());
+
+		nodes[1].node.handle_revoke_and_ack(&nodes[0].node.get_our_node_id(), &as_third_raa);
+		check_added_monitors!(nodes[1], 1);
+		nodes[1].node.handle_commitment_signed(&nodes[0].node.get_our_node_id(), &as_fourth_cs);
+		check_added_monitors!(nodes[1], 1);
+		let bs_fourth_raa = get_event_msg!(nodes[1], MessageSendEvent::SendRevokeAndACK, nodes[0].node.get_our_node_id());
+
+		nodes[0].node.handle_revoke_and_ack(&nodes[1].node.get_our_node_id(), &bs_fourth_raa);
+		check_added_monitors!(nodes[0], 1);


Is there any way to shorten this up with existing macros like commitment_signed_dance? Or is this too much of a special case?

Not with the existing macros, sadly, no. I've thought about a macro that is just "exchange messages until everything is settled", but we don't have anything like that currently.

lightning/src/ln/channelmanager.rs

jkczyz · 2021-12-11T04:10:50Z

lightning/src/ln/channelmanager.rs

+				return Err(()),
+			PendingOutboundPayment::Retryable { session_privs, payment_hash, .. } |
+			PendingOutboundPayment::RetriesExceeded { session_privs, payment_hash, .. } => {
+				our_payment_hash = *payment_hash;


Consider setting outside using self.payment_hash() like in mark_fulfilled to avoid a side-effect in the match.

I didn't bother cause it implies unwrapping the payment_hash when we don't need to. If you feel strongly I can change it.

Ah, didn't notice that difference. Don't feel too strongly. Just felt a little inconsistent when I saw it.

lightning/src/ln/channelmanager.rs

codecov-commenter · 2021-12-12T00:13:12Z

Codecov Report

Merging #1202 (05d7a33) into main (3ec529d) will decrease coverage by 0.04%.
The diff coverage is 85.47%.

@@            Coverage Diff             @@
##             main    #1202      +/-   ##
==========================================
- Coverage   90.58%   90.53%   -0.05%     
==========================================
  Files          69       69              
  Lines       37485    37724     +239     
==========================================
+ Hits        33954    34154     +200     
- Misses       3531     3570      +39

Impacted Files	Coverage Δ
lightning/src/ln/monitor_tests.rs	`100.00% <ø> (ø)`
lightning/src/ln/reorg_tests.rs	`100.00% <ø> (ø)`
lightning/src/ln/shutdown_tests.rs	`96.11% <ø> (ø)`
lightning/src/util/events.rs	`32.37% <14.28%> (-1.34%)`	⬇️
lightning/src/ln/functional_tests.rs	`97.28% <76.19%> (-0.16%)`	⬇️
lightning/src/ln/channelmanager.rs	`83.81% <80.76%> (-0.07%)`	⬇️
lightning-invoice/src/payment.rs	`92.69% <95.20%> (+0.30%)`	⬆️
lightning/src/ln/functional_test_utils.rs	`95.28% <96.96%> (+0.05%)`	⬆️
lightning-invoice/src/utils.rs	`83.92% <100.00%> (+0.44%)`	⬆️
lightning/src/ln/payment_tests.rs	`99.10% <100.00%> (ø)`
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3ec529d...05d7a33. Read the comment docs.

lightning/src/ln/channelmanager.rs

jkczyz · 2021-12-13T15:45:18Z

lightning/src/ln/channelmanager.rs

+				return Err(()),
+			PendingOutboundPayment::Retryable { session_privs, payment_hash, .. } |
+			PendingOutboundPayment::RetriesExceeded { session_privs, payment_hash, .. } => {
+				our_payment_hash = *payment_hash;


Ah, didn't notice that difference. Don't feel too strongly. Just felt a little inconsistent when I saw it.

lightning/src/ln/channelmanager.rs

valentinewallace

This is looking good. I think I'm ACK after the abandon/discontinue/abort/wtv rename

lightning/src/ln/channelmanager.rs

lightning-invoice/src/payment.rs

lightning/src/ln/channelmanager.rs

lightning/src/ln/functional_test_utils.rs

... with a more extensible expectation-checking framework for them.

When a payer gives up trying to retry a payment, they don't know for sure what the current state of the event queue is. Specifically, they cannot be sure that there are not multiple additional `PaymentPathFailed` or even `PaymentSuccess` events pending which they will see later. Thus, they have a very hard time identifying whether a payment has truly failed (and informing the UI of that fact) or if it is still pending. See [1] for more information. In order to avoid this mess, we will resolve it here by having the payer give `ChannelManager` a bit more information - when they have given up on a payment - and using that to generate a `PaymentFailed` event when all paths have failed. This commit adds the neccessary storage and changes for the new state inside `ChannelManager` and a public method to mark a payment as failed, the next few commits will add the new `Event` and use the new features in our `PaymentRetrier`. [1] lightningdevkit#1164

When a payment fails, a payer needs to know when they can consider a payment as fully-failed, and when only some of the HTLCs in the payment have failed. This isn't possible with the current event scheme, as discovered recently and as described in the previous commit. This adds a new event which describes when a payment is fully and irrevocably failed, generating it only after the payment has expired or been marked as expired with `ChannelManager::mark_retries_exceeded` *and* all HTLCs for it have failed. With this, a payer can more simply deduce when a payment has failed and use that to remove payment state or finalize a payment failure.

This finally fixes the bug described in the previous commits where we retry a payment after its retry count has expired due to early removal of the payment from the retry count tracking map. A test is also added which demonstrates the bug in previous versions and which passes now. Fixes lightningdevkit#1164.

This is symmetric with the new failure once a payment is abandoned.

TheBlueMatt · 2021-12-15T03:59:35Z

Squashed without diff. Diff since Jeff's ACK was only docs and a trivial test change. Will land after CI.

$ git diff-tree -U1 403af0d 05d7a33a
diff --git a/lightning/src/ln/channelmanager.rs b/lightning/src/ln/channelmanager.rs
index 9b9dd536..bd3b3683 100644
--- a/lightning/src/ln/channelmanager.rs
+++ b/lightning/src/ln/channelmanager.rs
@@ -2358,6 +2358,8 @@ impl<Signer: Sign, M: Deref, T: Deref, K: Deref, F: Deref, L: Deref> ChannelMana
 	/// [`send_payment`] documentation for more details on errors. This method will also error if the
-	/// retry amount puts the payment more than 10% over the payment's total amount, or if the payment
-	/// for the given `payment_id` cannot be found (likely due to timeout or success).
+	/// retry amount puts the payment more than 10% over the payment's total amount, if the payment
+	/// for the given `payment_id` cannot be found (likely due to timeout or success), or if
+	/// further retries have been disabled with [`abandon_payment`].
 	///
 	/// [`send_payment`]: [`ChannelManager::send_payment`]
+	/// [`abandon_payment`]: [`ChannelManager::abandon_payment`]
 	pub fn retry_payment(&self, route: &Route, payment_id: PaymentId) -> Result<(), PaymentSendFailure> {
diff --git a/lightning/src/ln/functional_test_utils.rs b/lightning/src/ln/functional_test_utils.rs
index f83c676b..8ff793ed 100644
--- a/lightning/src/ln/functional_test_utils.rs
+++ b/lightning/src/ln/functional_test_utils.rs
@@ -1621,4 +1621,3 @@ pub fn fail_payment_along_route<'a, 'b, 'c>(origin_node: &Node<'a, 'b, 'c>, expe
 			assert_eq!(events.len(), 1);
-			let our_payment_id;
-			match events[0] {
+			let expected_payment_id = match events[0] {
 				Event::PaymentPathFailed { payment_hash, rejected_by_dest, all_paths_failed, ref path, ref payment_id, .. } => {
@@ -1630,8 +1629,8 @@ pub fn fail_payment_along_route<'a, 'b, 'c>(origin_node: &Node<'a, 'b, 'c>, expe
 					}
-					our_payment_id = payment_id.unwrap();
+					payment_id.unwrap()
 				},
 				_ => panic!("Unexpected event"),
-			}
+			};
 			if i == expected_paths.len() - 1 {
-				origin_node.node.abandon_payment(our_payment_id);
+				origin_node.node.abandon_payment(expected_payment_id);
 				let events = origin_node.node.get_and_clear_pending_events();
@@ -1641,3 +1640,3 @@ pub fn fail_payment_along_route<'a, 'b, 'c>(origin_node: &Node<'a, 'b, 'c>, expe
 						assert_eq!(*payment_hash, our_payment_hash, "unexpected second payment_hash");
-						assert_eq!(*payment_id, our_payment_id);
+						assert_eq!(*payment_id, expected_payment_id);
 					}
$ git diff-tree -U1 1b1c6d4a 05d7a33a
$

TheBlueMatt added the Seeking Code Review label Dec 5, 2021

TheBlueMatt added this to the 0.0.104 milestone Dec 5, 2021

TheBlueMatt force-pushed the 2021-12-fix-retries-races branch 2 times, most recently from 101e5b2 to c9f1f33 Compare December 5, 2021 03:11

ConorOkus assigned TheBlueMatt Dec 7, 2021

jkczyz self-requested a review December 7, 2021 15:28

TheBlueMatt removed their assignment Dec 7, 2021

TheBlueMatt force-pushed the 2021-12-fix-retries-races branch from c9f1f33 to bfb3134 Compare December 7, 2021 17:32

valentinewallace reviewed Dec 7, 2021

View reviewed changes

TheBlueMatt force-pushed the 2021-12-fix-retries-races branch from bfb3134 to bd41bb5 Compare December 7, 2021 22:09

jkczyz reviewed Dec 7, 2021

View reviewed changes

TheBlueMatt force-pushed the 2021-12-fix-retries-races branch from bd41bb5 to 2edfb41 Compare December 9, 2021 19:23

TheBlueMatt removed the Seeking Code Review label Dec 9, 2021

TheBlueMatt force-pushed the 2021-12-fix-retries-races branch from 2edfb41 to 2a5119a Compare December 10, 2021 00:42

TheBlueMatt force-pushed the 2021-12-fix-retries-races branch from 2a5119a to 034efd5 Compare December 10, 2021 00:44

TheBlueMatt assigned jkczyz and valentinewallace Dec 10, 2021

jkczyz reviewed Dec 11, 2021

View reviewed changes

TheBlueMatt force-pushed the 2021-12-fix-retries-races branch from 034efd5 to 5ef81f6 Compare December 12, 2021 00:13

jkczyz reviewed Dec 13, 2021

View reviewed changes

valentinewallace reviewed Dec 13, 2021

View reviewed changes

lightning/src/ln/channelmanager.rs Outdated Show resolved Hide resolved

TheBlueMatt force-pushed the 2021-12-fix-retries-races branch 2 times, most recently from 6a70a07 to d3985f5 Compare December 14, 2021 00:32

jkczyz reviewed Dec 14, 2021

View reviewed changes

lightning-invoice/src/payment.rs Outdated Show resolved Hide resolved

TheBlueMatt force-pushed the 2021-12-fix-retries-races branch from d3985f5 to 403af0d Compare December 14, 2021 17:29

jkczyz approved these changes Dec 14, 2021

View reviewed changes

lightning/src/ln/channelmanager.rs Show resolved Hide resolved

lightning/src/ln/functional_test_utils.rs Outdated Show resolved Hide resolved

TheBlueMatt force-pushed the 2021-12-fix-retries-races branch from 403af0d to 1b1c6d4 Compare December 14, 2021 20:30

valentinewallace approved these changes Dec 15, 2021

View reviewed changes

TheBlueMatt added 5 commits December 15, 2021 03:55

DRY up payment failure macros in functional_test_utils

8c9615e

... with a more extensible expectation-checking framework for them.

Make attempting to retry a succeeded payment an APIError, not Route

05d7a33

This is symmetric with the new failure once a payment is abandoned.

TheBlueMatt force-pushed the 2021-12-fix-retries-races branch from 1b1c6d4 to 05d7a33 Compare December 15, 2021 03:57

TheBlueMatt merged commit 54114c9 into lightningdevkit:main Dec 15, 2021

Fix payment retry races and inform users when a payment fails #1202

Fix payment retry races and inform users when a payment fails #1202

Uh oh!

Conversation

TheBlueMatt commented Dec 5, 2021

Uh oh!

valentinewallace left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jkczyz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

TheBlueMatt commented Dec 10, 2021

Uh oh!

jkczyz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov-commenter commented Dec 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Choose a reason for hiding this comment

codecov-commenter commented Dec 12, 2021 •

edited

Loading