Skip to content

Conversation

@TheBlueMatt
Copy link
Collaborator

As described in #1164, our payment retry logic is race-y and may retry payments after we should have exhausted our retry count. Worse, when using our InvoicePayer, there is no clear way for a user to detect at all when a payment is fully and irrevocably failed. This PR addresses both with a new event - PaymentFailed which is generated only after all HTLCs for a payment are failed and the payment has timed out or the user marks it as no longer retryable.

@TheBlueMatt TheBlueMatt added this to the 0.0.104 milestone Dec 5, 2021
@TheBlueMatt TheBlueMatt force-pushed the 2021-12-fix-retries-races branch 2 times, most recently from 101e5b2 to c9f1f33 Compare December 5, 2021 03:11
@jkczyz jkczyz self-requested a review December 7, 2021 15:28
@TheBlueMatt TheBlueMatt removed their assignment Dec 7, 2021
@TheBlueMatt TheBlueMatt force-pushed the 2021-12-fix-retries-races branch from c9f1f33 to bfb3134 Compare December 7, 2021 17:32
Copy link
Contributor

@valentinewallace valentinewallace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Went through a first pass, mostly just nits!

/// larger MPP payment were still in flight when this event was generated.
///
/// Note that if you are retrying individual MPP parts using this value to determine if a
/// payment has fully failed is race-y. Because multiple failure can happen prior to events
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sadness... Is it off to say that this field seems broken? Not sure if would be better to remove it

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I mean basically - I guess we can deprecate it? If you are doing retries across the full payment and not on a per-path basis then the field is "correct" and works like you'd kinda expect. There isn't really a direct replacement absent this field, though - either you do path-level retries and call no_further_payment_retries or you do payment-level retries and then you need this bool. Obviously we'd ideally prefer users eventually move towards path-level retries, but...?

@TheBlueMatt TheBlueMatt force-pushed the 2021-12-fix-retries-races branch from bfb3134 to bd41bb5 Compare December 7, 2021 22:09
Copy link
Contributor

@jkczyz jkczyz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Largely looks good. Just need to dig into the new test but will do so after comments are addressed.

Comment on lines 3331 to 3341
full_failure_ev = Some(events::Event::PaymentFailed {
payment_id,
payment_hash: payment.get().payment_hash().expect("RetriesExceeded PendintOutboundPayments always have a payment hash set"),
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can avoid creating this here by conditioning on all_paths_failed when needing to add the event later.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also have to test that retries_exceeded is set, so I don't think we can after here where we drop the payment object itself.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I was thinking you could still create the event without the entry since the payment hash is available below. But I guess you would need another variable regardless, so probably not worth it.

Comment on lines 480 to 483
self.payer.no_further_payment_retries(payment_id.unwrap());
} else if retry.is_none() {
log_trace!(self.logger, "Payment {} missing retry params; not retrying", log_bytes!(payment_hash.0));
self.payer.no_further_payment_retries(payment_id.unwrap());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could these two cases be handled directly inside ChannelManager such that no_further_payment_retries only needs to be called when retry_payment returns an error?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, retry.is_none is just a backwards compatibility thing - we'll remove it sooner or later - but I guess in this context no_further_payment_retries is actually a noop, just feels nice to have it for consistency? As for rejected_by_dest, I don't think we want to do that - in the case of someone implementing payment probes users may want to retry other failing paths even if some paths fail with rejected_by_dest. We don't really support that here, I guess, but its totally reasonable for a user to want to have their own InvoicePayer that uses it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point on the probing use case. I guess the change here would break anyone wanting to use it with InvoicePayer's retry logic?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can't use InvoicePayer with per-path retries and probing anyway (well, mostly cause InvoicePayer's per-path retries are broken entirely), so I don't think anything new broke here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not 100% sure I follow. Prior to this change, using the retry logic for the probe use case would let you retry other paths after a successful path probe (i.e., when rejected_by_dest is set). That is because the payment would be in state Retryable when calling ChannelManager::retry_payment. But now it won't work since the payment will be in state RetriesExceeded after ChannelManager::no_further_payment_retries is called for the rejected_by_dest case.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, my point was more that, prior to this PR, retries for individual paths were broken in general (ie cause they get retried more after retries are up). For a non-probe payment, it feels "obvious" that a rejected_by_dest failure should cause us to stop the payment - we won't be able to complete the payment anyway, so we should stop. For a probe, indeed, indeed, we want different behavior, but I'm not sure how to expose that here...the new ChannelManager API suggested is focused on informing us either when we get a single success or when all payments have failed after we tell it we're done. What we need for probes the ability to learn when all paths have failed even though we don't want to mark the payment as done until all paths have failed with some condition.

Trying to accomplish this without explicit ChannelManager support for it seems to keep bumping up against #1164 again - we can't know just by looking at the event in front of us if its the last one for a given payment or not. If its tagged rejected_by_dest we can skip retrying and expose it to users, but then we've just pushed the "is the payment complete" detection to our users, where there isn't a clear way to do it. We could try some fuzzy heuristic on the amount of funds which have "successfully failed", but that fails to account for fees in some cases (plus doesn't work across restarts), so I'm kinda hesitant to do it. Maybe if we don't care about the fees issue (since we're just collecting paths) and consider it acceptable that it fails across restarts (cause its just a probe), then we can add some new API in the InvoicePayer to explicitly probe with some path-collection and then generate a new event there that says "collected enough paths, here is a route"? We'll need to be really careful about re-using liquidity twice in that case, though, which may make things even more complicated.

@TheBlueMatt
Copy link
Collaborator Author

Reworked the expect_payment_failed macros to dry them up a lot.

Copy link
Contributor

@jkczyz jkczyz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Largely looks good outside some naming ideas.

Comment on lines 480 to 483
self.payer.no_further_payment_retries(payment_id.unwrap());
} else if retry.is_none() {
log_trace!(self.logger, "Payment {} missing retry params; not retrying", log_bytes!(payment_hash.0));
self.payer.no_further_payment_retries(payment_id.unwrap());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not 100% sure I follow. Prior to this change, using the retry logic for the probe use case would let you retry other paths after a successful path probe (i.e., when rejected_by_dest is set). That is because the payment would be in state Retryable when calling ChannelManager::retry_payment. But now it won't work since the payment will be in state RetriesExceeded after ChannelManager::no_further_payment_retries is called for the rejected_by_dest case.

Comment on lines +1250 to +1259
$(
conditions = conditions.expected_htlc_error_data($expected_error_code, &$expected_error_data);
)*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the repetition only expected to be zero or one here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it always is. IIRC we still support versions of rustc that don't support the ? macro repetition option.

Comment on lines +1659 to +1716
let htlc_updates = SendEvent::from_node(&nodes[0]);
check_added_monitors!(nodes[0], 1);
assert_eq!(htlc_updates.msgs.len(), 1);

nodes[1].node.handle_update_add_htlc(&nodes[0].node.get_our_node_id(), &htlc_updates.msgs[0]);
nodes[1].node.handle_commitment_signed(&nodes[0].node.get_our_node_id(), &htlc_updates.commitment_msg);
check_added_monitors!(nodes[1], 1);
let (bs_first_raa, bs_first_cs) = get_revoke_commit_msgs!(nodes[1], nodes[0].node.get_our_node_id());

nodes[0].node.handle_revoke_and_ack(&nodes[1].node.get_our_node_id(), &bs_first_raa);
check_added_monitors!(nodes[0], 1);
let second_htlc_updates = SendEvent::from_node(&nodes[0]);

nodes[0].node.handle_commitment_signed(&nodes[1].node.get_our_node_id(), &bs_first_cs);
check_added_monitors!(nodes[0], 1);
let as_first_raa = get_event_msg!(nodes[0], MessageSendEvent::SendRevokeAndACK, nodes[1].node.get_our_node_id());

nodes[1].node.handle_update_add_htlc(&nodes[0].node.get_our_node_id(), &second_htlc_updates.msgs[0]);
nodes[1].node.handle_commitment_signed(&nodes[0].node.get_our_node_id(), &second_htlc_updates.commitment_msg);
check_added_monitors!(nodes[1], 1);
let bs_second_raa = get_event_msg!(nodes[1], MessageSendEvent::SendRevokeAndACK, nodes[0].node.get_our_node_id());

nodes[1].node.handle_revoke_and_ack(&nodes[0].node.get_our_node_id(), &as_first_raa);
check_added_monitors!(nodes[1], 1);
let bs_fail_update = get_htlc_update_msgs!(nodes[1], nodes[0].node.get_our_node_id());

nodes[0].node.handle_revoke_and_ack(&nodes[1].node.get_our_node_id(), &bs_second_raa);
check_added_monitors!(nodes[0], 1);

nodes[0].node.handle_update_fail_htlc(&nodes[1].node.get_our_node_id(), &bs_fail_update.update_fail_htlcs[0]);
nodes[0].node.handle_commitment_signed(&nodes[1].node.get_our_node_id(), &bs_fail_update.commitment_signed);
check_added_monitors!(nodes[0], 1);
let (as_second_raa, as_third_cs) = get_revoke_commit_msgs!(nodes[0], nodes[1].node.get_our_node_id());

nodes[1].node.handle_revoke_and_ack(&nodes[0].node.get_our_node_id(), &as_second_raa);
check_added_monitors!(nodes[1], 1);
let bs_second_fail_update = get_htlc_update_msgs!(nodes[1], nodes[0].node.get_our_node_id());

nodes[1].node.handle_commitment_signed(&nodes[0].node.get_our_node_id(), &as_third_cs);
check_added_monitors!(nodes[1], 1);
let bs_third_raa = get_event_msg!(nodes[1], MessageSendEvent::SendRevokeAndACK, nodes[0].node.get_our_node_id());

nodes[0].node.handle_update_fail_htlc(&nodes[1].node.get_our_node_id(), &bs_second_fail_update.update_fail_htlcs[0]);
nodes[0].node.handle_commitment_signed(&nodes[1].node.get_our_node_id(), &bs_second_fail_update.commitment_signed);
check_added_monitors!(nodes[0], 1);

nodes[0].node.handle_revoke_and_ack(&nodes[1].node.get_our_node_id(), &bs_third_raa);
check_added_monitors!(nodes[0], 1);
let (as_third_raa, as_fourth_cs) = get_revoke_commit_msgs!(nodes[0], nodes[1].node.get_our_node_id());

nodes[1].node.handle_revoke_and_ack(&nodes[0].node.get_our_node_id(), &as_third_raa);
check_added_monitors!(nodes[1], 1);
nodes[1].node.handle_commitment_signed(&nodes[0].node.get_our_node_id(), &as_fourth_cs);
check_added_monitors!(nodes[1], 1);
let bs_fourth_raa = get_event_msg!(nodes[1], MessageSendEvent::SendRevokeAndACK, nodes[0].node.get_our_node_id());

nodes[0].node.handle_revoke_and_ack(&nodes[1].node.get_our_node_id(), &bs_fourth_raa);
check_added_monitors!(nodes[0], 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any way to shorten this up with existing macros like commitment_signed_dance? Or is this too much of a special case?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not with the existing macros, sadly, no. I've thought about a macro that is just "exchange messages until everything is settled", but we don't have anything like that currently.

return Err(()),
PendingOutboundPayment::Retryable { session_privs, payment_hash, .. } |
PendingOutboundPayment::RetriesExceeded { session_privs, payment_hash, .. } => {
our_payment_hash = *payment_hash;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider setting outside using self.payment_hash() like in mark_fulfilled to avoid a side-effect in the match.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't bother cause it implies unwrapping the payment_hash when we don't need to. If you feel strongly I can change it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, didn't notice that difference. Don't feel too strongly. Just felt a little inconsistent when I saw it.

@TheBlueMatt TheBlueMatt force-pushed the 2021-12-fix-retries-races branch from 034efd5 to 5ef81f6 Compare December 12, 2021 00:13
@codecov-commenter
Copy link

codecov-commenter commented Dec 12, 2021

Codecov Report

Merging #1202 (05d7a33) into main (3ec529d) will decrease coverage by 0.04%.
The diff coverage is 85.47%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1202      +/-   ##
==========================================
- Coverage   90.58%   90.53%   -0.05%     
==========================================
  Files          69       69              
  Lines       37485    37724     +239     
==========================================
+ Hits        33954    34154     +200     
- Misses       3531     3570      +39     
Impacted Files Coverage Δ
lightning/src/ln/monitor_tests.rs 100.00% <ø> (ø)
lightning/src/ln/reorg_tests.rs 100.00% <ø> (ø)
lightning/src/ln/shutdown_tests.rs 96.11% <ø> (ø)
lightning/src/util/events.rs 32.37% <14.28%> (-1.34%) ⬇️
lightning/src/ln/functional_tests.rs 97.28% <76.19%> (-0.16%) ⬇️
lightning/src/ln/channelmanager.rs 83.81% <80.76%> (-0.07%) ⬇️
lightning-invoice/src/payment.rs 92.69% <95.20%> (+0.30%) ⬆️
lightning/src/ln/functional_test_utils.rs 95.28% <96.96%> (+0.05%) ⬆️
lightning-invoice/src/utils.rs 83.92% <100.00%> (+0.44%) ⬆️
lightning/src/ln/payment_tests.rs 99.10% <100.00%> (ø)
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3ec529d...05d7a33. Read the comment docs.

return Err(()),
PendingOutboundPayment::Retryable { session_privs, payment_hash, .. } |
PendingOutboundPayment::RetriesExceeded { session_privs, payment_hash, .. } => {
our_payment_hash = *payment_hash;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, didn't notice that difference. Don't feel too strongly. Just felt a little inconsistent when I saw it.

Copy link
Contributor

@valentinewallace valentinewallace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good. I think I'm ACK after the abandon/discontinue/abort/wtv rename

@TheBlueMatt TheBlueMatt force-pushed the 2021-12-fix-retries-races branch 2 times, most recently from 6a70a07 to d3985f5 Compare December 14, 2021 00:32
@TheBlueMatt TheBlueMatt force-pushed the 2021-12-fix-retries-races branch from d3985f5 to 403af0d Compare December 14, 2021 17:29
@TheBlueMatt TheBlueMatt force-pushed the 2021-12-fix-retries-races branch from 403af0d to 1b1c6d4 Compare December 14, 2021 20:30
... with a more extensible expectation-checking framework for them.
When a payer gives up trying to retry a payment, they don't know
for sure what the current state of the event queue is.
Specifically, they cannot be sure that there are not multiple
additional `PaymentPathFailed` or even `PaymentSuccess` events
pending which they will see later. Thus, they have a very hard
time identifying whether a payment has truly failed (and informing
the UI of that fact) or if it is still pending. See [1] for more
information.

In order to avoid this mess, we will resolve it here by having the
payer give `ChannelManager` a bit more information - when they
have given up on a payment - and using that to generate a
`PaymentFailed` event when all paths have failed.

This commit adds the neccessary storage and changes for the new
state inside `ChannelManager` and a public method to mark a payment
as failed, the next few commits will add the new `Event` and use
the new features in our `PaymentRetrier`.

[1] lightningdevkit#1164
When a payment fails, a payer needs to know when they can consider
a payment as fully-failed, and when only some of the HTLCs in the
payment have failed. This isn't possible with the current event
scheme, as discovered recently and as described in the previous
commit.

This adds a new event which describes when a payment is fully and
irrevocably failed, generating it only after the payment has
expired or been marked as expired with
`ChannelManager::mark_retries_exceeded` *and* all HTLCs for it
have failed. With this, a payer can more simply deduce when a
payment has failed and use that to remove payment state or
finalize a payment failure.
This finally fixes the bug described in the previous commits where
we retry a payment after its retry count has expired due to early
removal of the payment from the retry count tracking map. A test is
also added which demonstrates the bug in previous versions and
which passes now.

Fixes lightningdevkit#1164.
This is symmetric with the new failure once a payment is abandoned.
@TheBlueMatt TheBlueMatt force-pushed the 2021-12-fix-retries-races branch from 1b1c6d4 to 05d7a33 Compare December 15, 2021 03:57
@TheBlueMatt
Copy link
Collaborator Author

Squashed without diff. Diff since Jeff's ACK was only docs and a trivial test change. Will land after CI.

$ git diff-tree -U1 403af0d 05d7a33a
diff --git a/lightning/src/ln/channelmanager.rs b/lightning/src/ln/channelmanager.rs
index 9b9dd536..bd3b3683 100644
--- a/lightning/src/ln/channelmanager.rs
+++ b/lightning/src/ln/channelmanager.rs
@@ -2358,6 +2358,8 @@ impl<Signer: Sign, M: Deref, T: Deref, K: Deref, F: Deref, L: Deref> ChannelMana
 	/// [`send_payment`] documentation for more details on errors. This method will also error if the
-	/// retry amount puts the payment more than 10% over the payment's total amount, or if the payment
-	/// for the given `payment_id` cannot be found (likely due to timeout or success).
+	/// retry amount puts the payment more than 10% over the payment's total amount, if the payment
+	/// for the given `payment_id` cannot be found (likely due to timeout or success), or if
+	/// further retries have been disabled with [`abandon_payment`].
 	///
 	/// [`send_payment`]: [`ChannelManager::send_payment`]
+	/// [`abandon_payment`]: [`ChannelManager::abandon_payment`]
 	pub fn retry_payment(&self, route: &Route, payment_id: PaymentId) -> Result<(), PaymentSendFailure> {
diff --git a/lightning/src/ln/functional_test_utils.rs b/lightning/src/ln/functional_test_utils.rs
index f83c676b..8ff793ed 100644
--- a/lightning/src/ln/functional_test_utils.rs
+++ b/lightning/src/ln/functional_test_utils.rs
@@ -1621,4 +1621,3 @@ pub fn fail_payment_along_route<'a, 'b, 'c>(origin_node: &Node<'a, 'b, 'c>, expe
 			assert_eq!(events.len(), 1);
-			let our_payment_id;
-			match events[0] {
+			let expected_payment_id = match events[0] {
 				Event::PaymentPathFailed { payment_hash, rejected_by_dest, all_paths_failed, ref path, ref payment_id, .. } => {
@@ -1630,8 +1629,8 @@ pub fn fail_payment_along_route<'a, 'b, 'c>(origin_node: &Node<'a, 'b, 'c>, expe
 					}
-					our_payment_id = payment_id.unwrap();
+					payment_id.unwrap()
 				},
 				_ => panic!("Unexpected event"),
-			}
+			};
 			if i == expected_paths.len() - 1 {
-				origin_node.node.abandon_payment(our_payment_id);
+				origin_node.node.abandon_payment(expected_payment_id);
 				let events = origin_node.node.get_and_clear_pending_events();
@@ -1641,3 +1640,3 @@ pub fn fail_payment_along_route<'a, 'b, 'c>(origin_node: &Node<'a, 'b, 'c>, expe
 						assert_eq!(*payment_hash, our_payment_hash, "unexpected second payment_hash");
-						assert_eq!(*payment_id, our_payment_id);
+						assert_eq!(*payment_id, expected_payment_id);
 					}
$ git diff-tree -U1 1b1c6d4a 05d7a33a
$

@TheBlueMatt TheBlueMatt merged commit 54114c9 into lightningdevkit:main Dec 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants