-
Notifications
You must be signed in to change notification settings - Fork 421
Fix payment retry races and inform users when a payment fails #1202
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix payment retry races and inform users when a payment fails #1202
Conversation
101e5b2 to
c9f1f33
Compare
c9f1f33 to
bfb3134
Compare
valentinewallace
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Went through a first pass, mostly just nits!
lightning/src/util/events.rs
Outdated
| /// larger MPP payment were still in flight when this event was generated. | ||
| /// | ||
| /// Note that if you are retrying individual MPP parts using this value to determine if a | ||
| /// payment has fully failed is race-y. Because multiple failure can happen prior to events |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sadness... Is it off to say that this field seems broken? Not sure if would be better to remove it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, I mean basically - I guess we can deprecate it? If you are doing retries across the full payment and not on a per-path basis then the field is "correct" and works like you'd kinda expect. There isn't really a direct replacement absent this field, though - either you do path-level retries and call no_further_payment_retries or you do payment-level retries and then you need this bool. Obviously we'd ideally prefer users eventually move towards path-level retries, but...?
bfb3134 to
bd41bb5
Compare
jkczyz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Largely looks good. Just need to dig into the new test but will do so after comments are addressed.
| full_failure_ev = Some(events::Event::PaymentFailed { | ||
| payment_id, | ||
| payment_hash: payment.get().payment_hash().expect("RetriesExceeded PendintOutboundPayments always have a payment hash set"), | ||
| }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can avoid creating this here by conditioning on all_paths_failed when needing to add the event later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also have to test that retries_exceeded is set, so I don't think we can after here where we drop the payment object itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I was thinking you could still create the event without the entry since the payment hash is available below. But I guess you would need another variable regardless, so probably not worth it.
lightning-invoice/src/payment.rs
Outdated
| self.payer.no_further_payment_retries(payment_id.unwrap()); | ||
| } else if retry.is_none() { | ||
| log_trace!(self.logger, "Payment {} missing retry params; not retrying", log_bytes!(payment_hash.0)); | ||
| self.payer.no_further_payment_retries(payment_id.unwrap()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could these two cases be handled directly inside ChannelManager such that no_further_payment_retries only needs to be called when retry_payment returns an error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, retry.is_none is just a backwards compatibility thing - we'll remove it sooner or later - but I guess in this context no_further_payment_retries is actually a noop, just feels nice to have it for consistency? As for rejected_by_dest, I don't think we want to do that - in the case of someone implementing payment probes users may want to retry other failing paths even if some paths fail with rejected_by_dest. We don't really support that here, I guess, but its totally reasonable for a user to want to have their own InvoicePayer that uses it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point on the probing use case. I guess the change here would break anyone wanting to use it with InvoicePayer's retry logic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can't use InvoicePayer with per-path retries and probing anyway (well, mostly cause InvoicePayer's per-path retries are broken entirely), so I don't think anything new broke here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not 100% sure I follow. Prior to this change, using the retry logic for the probe use case would let you retry other paths after a successful path probe (i.e., when rejected_by_dest is set). That is because the payment would be in state Retryable when calling ChannelManager::retry_payment. But now it won't work since the payment will be in state RetriesExceeded after ChannelManager::no_further_payment_retries is called for the rejected_by_dest case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, my point was more that, prior to this PR, retries for individual paths were broken in general (ie cause they get retried more after retries are up). For a non-probe payment, it feels "obvious" that a rejected_by_dest failure should cause us to stop the payment - we won't be able to complete the payment anyway, so we should stop. For a probe, indeed, indeed, we want different behavior, but I'm not sure how to expose that here...the new ChannelManager API suggested is focused on informing us either when we get a single success or when all payments have failed after we tell it we're done. What we need for probes the ability to learn when all paths have failed even though we don't want to mark the payment as done until all paths have failed with some condition.
Trying to accomplish this without explicit ChannelManager support for it seems to keep bumping up against #1164 again - we can't know just by looking at the event in front of us if its the last one for a given payment or not. If its tagged rejected_by_dest we can skip retrying and expose it to users, but then we've just pushed the "is the payment complete" detection to our users, where there isn't a clear way to do it. We could try some fuzzy heuristic on the amount of funds which have "successfully failed", but that fails to account for fees in some cases (plus doesn't work across restarts), so I'm kinda hesitant to do it. Maybe if we don't care about the fees issue (since we're just collecting paths) and consider it acceptable that it fails across restarts (cause its just a probe), then we can add some new API in the InvoicePayer to explicitly probe with some path-collection and then generate a new event there that says "collected enough paths, here is a route"? We'll need to be really careful about re-using liquidity twice in that case, though, which may make things even more complicated.
bd41bb5 to
2edfb41
Compare
2edfb41 to
2a5119a
Compare
|
Reworked the |
2a5119a to
034efd5
Compare
jkczyz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Largely looks good outside some naming ideas.
lightning-invoice/src/payment.rs
Outdated
| self.payer.no_further_payment_retries(payment_id.unwrap()); | ||
| } else if retry.is_none() { | ||
| log_trace!(self.logger, "Payment {} missing retry params; not retrying", log_bytes!(payment_hash.0)); | ||
| self.payer.no_further_payment_retries(payment_id.unwrap()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not 100% sure I follow. Prior to this change, using the retry logic for the probe use case would let you retry other paths after a successful path probe (i.e., when rejected_by_dest is set). That is because the payment would be in state Retryable when calling ChannelManager::retry_payment. But now it won't work since the payment will be in state RetriesExceeded after ChannelManager::no_further_payment_retries is called for the rejected_by_dest case.
| $( | ||
| conditions = conditions.expected_htlc_error_data($expected_error_code, &$expected_error_data); | ||
| )* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the repetition only expected to be zero or one here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it always is. IIRC we still support versions of rustc that don't support the ? macro repetition option.
| let htlc_updates = SendEvent::from_node(&nodes[0]); | ||
| check_added_monitors!(nodes[0], 1); | ||
| assert_eq!(htlc_updates.msgs.len(), 1); | ||
|
|
||
| nodes[1].node.handle_update_add_htlc(&nodes[0].node.get_our_node_id(), &htlc_updates.msgs[0]); | ||
| nodes[1].node.handle_commitment_signed(&nodes[0].node.get_our_node_id(), &htlc_updates.commitment_msg); | ||
| check_added_monitors!(nodes[1], 1); | ||
| let (bs_first_raa, bs_first_cs) = get_revoke_commit_msgs!(nodes[1], nodes[0].node.get_our_node_id()); | ||
|
|
||
| nodes[0].node.handle_revoke_and_ack(&nodes[1].node.get_our_node_id(), &bs_first_raa); | ||
| check_added_monitors!(nodes[0], 1); | ||
| let second_htlc_updates = SendEvent::from_node(&nodes[0]); | ||
|
|
||
| nodes[0].node.handle_commitment_signed(&nodes[1].node.get_our_node_id(), &bs_first_cs); | ||
| check_added_monitors!(nodes[0], 1); | ||
| let as_first_raa = get_event_msg!(nodes[0], MessageSendEvent::SendRevokeAndACK, nodes[1].node.get_our_node_id()); | ||
|
|
||
| nodes[1].node.handle_update_add_htlc(&nodes[0].node.get_our_node_id(), &second_htlc_updates.msgs[0]); | ||
| nodes[1].node.handle_commitment_signed(&nodes[0].node.get_our_node_id(), &second_htlc_updates.commitment_msg); | ||
| check_added_monitors!(nodes[1], 1); | ||
| let bs_second_raa = get_event_msg!(nodes[1], MessageSendEvent::SendRevokeAndACK, nodes[0].node.get_our_node_id()); | ||
|
|
||
| nodes[1].node.handle_revoke_and_ack(&nodes[0].node.get_our_node_id(), &as_first_raa); | ||
| check_added_monitors!(nodes[1], 1); | ||
| let bs_fail_update = get_htlc_update_msgs!(nodes[1], nodes[0].node.get_our_node_id()); | ||
|
|
||
| nodes[0].node.handle_revoke_and_ack(&nodes[1].node.get_our_node_id(), &bs_second_raa); | ||
| check_added_monitors!(nodes[0], 1); | ||
|
|
||
| nodes[0].node.handle_update_fail_htlc(&nodes[1].node.get_our_node_id(), &bs_fail_update.update_fail_htlcs[0]); | ||
| nodes[0].node.handle_commitment_signed(&nodes[1].node.get_our_node_id(), &bs_fail_update.commitment_signed); | ||
| check_added_monitors!(nodes[0], 1); | ||
| let (as_second_raa, as_third_cs) = get_revoke_commit_msgs!(nodes[0], nodes[1].node.get_our_node_id()); | ||
|
|
||
| nodes[1].node.handle_revoke_and_ack(&nodes[0].node.get_our_node_id(), &as_second_raa); | ||
| check_added_monitors!(nodes[1], 1); | ||
| let bs_second_fail_update = get_htlc_update_msgs!(nodes[1], nodes[0].node.get_our_node_id()); | ||
|
|
||
| nodes[1].node.handle_commitment_signed(&nodes[0].node.get_our_node_id(), &as_third_cs); | ||
| check_added_monitors!(nodes[1], 1); | ||
| let bs_third_raa = get_event_msg!(nodes[1], MessageSendEvent::SendRevokeAndACK, nodes[0].node.get_our_node_id()); | ||
|
|
||
| nodes[0].node.handle_update_fail_htlc(&nodes[1].node.get_our_node_id(), &bs_second_fail_update.update_fail_htlcs[0]); | ||
| nodes[0].node.handle_commitment_signed(&nodes[1].node.get_our_node_id(), &bs_second_fail_update.commitment_signed); | ||
| check_added_monitors!(nodes[0], 1); | ||
|
|
||
| nodes[0].node.handle_revoke_and_ack(&nodes[1].node.get_our_node_id(), &bs_third_raa); | ||
| check_added_monitors!(nodes[0], 1); | ||
| let (as_third_raa, as_fourth_cs) = get_revoke_commit_msgs!(nodes[0], nodes[1].node.get_our_node_id()); | ||
|
|
||
| nodes[1].node.handle_revoke_and_ack(&nodes[0].node.get_our_node_id(), &as_third_raa); | ||
| check_added_monitors!(nodes[1], 1); | ||
| nodes[1].node.handle_commitment_signed(&nodes[0].node.get_our_node_id(), &as_fourth_cs); | ||
| check_added_monitors!(nodes[1], 1); | ||
| let bs_fourth_raa = get_event_msg!(nodes[1], MessageSendEvent::SendRevokeAndACK, nodes[0].node.get_our_node_id()); | ||
|
|
||
| nodes[0].node.handle_revoke_and_ack(&nodes[1].node.get_our_node_id(), &bs_fourth_raa); | ||
| check_added_monitors!(nodes[0], 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any way to shorten this up with existing macros like commitment_signed_dance? Or is this too much of a special case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not with the existing macros, sadly, no. I've thought about a macro that is just "exchange messages until everything is settled", but we don't have anything like that currently.
| return Err(()), | ||
| PendingOutboundPayment::Retryable { session_privs, payment_hash, .. } | | ||
| PendingOutboundPayment::RetriesExceeded { session_privs, payment_hash, .. } => { | ||
| our_payment_hash = *payment_hash; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider setting outside using self.payment_hash() like in mark_fulfilled to avoid a side-effect in the match.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't bother cause it implies unwrapping the payment_hash when we don't need to. If you feel strongly I can change it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, didn't notice that difference. Don't feel too strongly. Just felt a little inconsistent when I saw it.
034efd5 to
5ef81f6
Compare
Codecov Report
@@ Coverage Diff @@
## main #1202 +/- ##
==========================================
- Coverage 90.58% 90.53% -0.05%
==========================================
Files 69 69
Lines 37485 37724 +239
==========================================
+ Hits 33954 34154 +200
- Misses 3531 3570 +39
Continue to review full report at Codecov.
|
| return Err(()), | ||
| PendingOutboundPayment::Retryable { session_privs, payment_hash, .. } | | ||
| PendingOutboundPayment::RetriesExceeded { session_privs, payment_hash, .. } => { | ||
| our_payment_hash = *payment_hash; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, didn't notice that difference. Don't feel too strongly. Just felt a little inconsistent when I saw it.
valentinewallace
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking good. I think I'm ACK after the abandon/discontinue/abort/wtv rename
6a70a07 to
d3985f5
Compare
d3985f5 to
403af0d
Compare
403af0d to
1b1c6d4
Compare
... with a more extensible expectation-checking framework for them.
When a payer gives up trying to retry a payment, they don't know for sure what the current state of the event queue is. Specifically, they cannot be sure that there are not multiple additional `PaymentPathFailed` or even `PaymentSuccess` events pending which they will see later. Thus, they have a very hard time identifying whether a payment has truly failed (and informing the UI of that fact) or if it is still pending. See [1] for more information. In order to avoid this mess, we will resolve it here by having the payer give `ChannelManager` a bit more information - when they have given up on a payment - and using that to generate a `PaymentFailed` event when all paths have failed. This commit adds the neccessary storage and changes for the new state inside `ChannelManager` and a public method to mark a payment as failed, the next few commits will add the new `Event` and use the new features in our `PaymentRetrier`. [1] lightningdevkit#1164
When a payment fails, a payer needs to know when they can consider a payment as fully-failed, and when only some of the HTLCs in the payment have failed. This isn't possible with the current event scheme, as discovered recently and as described in the previous commit. This adds a new event which describes when a payment is fully and irrevocably failed, generating it only after the payment has expired or been marked as expired with `ChannelManager::mark_retries_exceeded` *and* all HTLCs for it have failed. With this, a payer can more simply deduce when a payment has failed and use that to remove payment state or finalize a payment failure.
This finally fixes the bug described in the previous commits where we retry a payment after its retry count has expired due to early removal of the payment from the retry count tracking map. A test is also added which demonstrates the bug in previous versions and which passes now. Fixes lightningdevkit#1164.
This is symmetric with the new failure once a payment is abandoned.
1b1c6d4 to
05d7a33
Compare
|
Squashed without diff. Diff since Jeff's ACK was only docs and a trivial test change. Will land after CI. $ git diff-tree -U1 403af0d 05d7a33a
diff --git a/lightning/src/ln/channelmanager.rs b/lightning/src/ln/channelmanager.rs
index 9b9dd536..bd3b3683 100644
--- a/lightning/src/ln/channelmanager.rs
+++ b/lightning/src/ln/channelmanager.rs
@@ -2358,6 +2358,8 @@ impl<Signer: Sign, M: Deref, T: Deref, K: Deref, F: Deref, L: Deref> ChannelMana
/// [`send_payment`] documentation for more details on errors. This method will also error if the
- /// retry amount puts the payment more than 10% over the payment's total amount, or if the payment
- /// for the given `payment_id` cannot be found (likely due to timeout or success).
+ /// retry amount puts the payment more than 10% over the payment's total amount, if the payment
+ /// for the given `payment_id` cannot be found (likely due to timeout or success), or if
+ /// further retries have been disabled with [`abandon_payment`].
///
/// [`send_payment`]: [`ChannelManager::send_payment`]
+ /// [`abandon_payment`]: [`ChannelManager::abandon_payment`]
pub fn retry_payment(&self, route: &Route, payment_id: PaymentId) -> Result<(), PaymentSendFailure> {
diff --git a/lightning/src/ln/functional_test_utils.rs b/lightning/src/ln/functional_test_utils.rs
index f83c676b..8ff793ed 100644
--- a/lightning/src/ln/functional_test_utils.rs
+++ b/lightning/src/ln/functional_test_utils.rs
@@ -1621,4 +1621,3 @@ pub fn fail_payment_along_route<'a, 'b, 'c>(origin_node: &Node<'a, 'b, 'c>, expe
assert_eq!(events.len(), 1);
- let our_payment_id;
- match events[0] {
+ let expected_payment_id = match events[0] {
Event::PaymentPathFailed { payment_hash, rejected_by_dest, all_paths_failed, ref path, ref payment_id, .. } => {
@@ -1630,8 +1629,8 @@ pub fn fail_payment_along_route<'a, 'b, 'c>(origin_node: &Node<'a, 'b, 'c>, expe
}
- our_payment_id = payment_id.unwrap();
+ payment_id.unwrap()
},
_ => panic!("Unexpected event"),
- }
+ };
if i == expected_paths.len() - 1 {
- origin_node.node.abandon_payment(our_payment_id);
+ origin_node.node.abandon_payment(expected_payment_id);
let events = origin_node.node.get_and_clear_pending_events();
@@ -1641,3 +1640,3 @@ pub fn fail_payment_along_route<'a, 'b, 'c>(origin_node: &Node<'a, 'b, 'c>, expe
assert_eq!(*payment_hash, our_payment_hash, "unexpected second payment_hash");
- assert_eq!(*payment_id, our_payment_id);
+ assert_eq!(*payment_id, expected_payment_id);
}
$ git diff-tree -U1 1b1c6d4a 05d7a33a
$ |
As described in #1164, our payment retry logic is race-y and may retry payments after we should have exhausted our retry count. Worse, when using our
InvoicePayer, there is no clear way for a user to detect at all when a payment is fully and irrevocably failed. This PR addresses both with a new event -PaymentFailedwhich is generated only after all HTLCs for a payment are failed and the payment has timed out or the user marks it as no longer retryable.