🌱 Bubble up machines permanent failure as MS/MD conditions #6218

enxebre · 2022-02-28T21:43:39Z

What this PR does / why we need it:
There's a UX lack to signal permanent machine failures in MachineSet and MachineDeployments.
This PR solves that by bubbling up permanent Machine failures aka FailureMessage/FailureReason as a MachineSet and MachineDeployment condition: MachinesSucceededCondition.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #5635

Needs #6025 for the conditions to be set in updateStatus

k8s-ci-robot · 2022-02-28T21:44:13Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please ask for approval from enxebre after the PR has been reviewed.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

enxebre · 2022-02-28T21:44:49Z

/hold

enxebre · 2022-02-28T21:46:49Z

PTAL @vincepri @fabriziopandini @sbueringer @CecileRobertMichon @Arvinderpal
If we are happy with the approach I'll try ti get the other PR merged, then will polish this one.

There's a UX lack signal permanent machine failures in MachineSet and MachineDeployments. This PR solves that by bubbling up permanent Machine failures aka FailureMessage/FailureReason as a MachineSet and MachineDeplopyment condition: MachinesSucceededCondition.

fabriziopandini · 2022-03-02T12:21:54Z

@enxebre first of all thanks for giving a go this problem, users can really benefit from it if we can get to an agreement.

I'm personally in favor of simplifying the user experience as much as possible, and thus:

Use conditions to represent the current state and deprecate/remove failureMessage and failureReason
Merge the evidence from failureMessage and failureReason into existing conditions, under the assumption that that failure happens during part of the process that is already surfaced by an existing condition

e.g. if in CAPA there is a permanent failure while provisioning the VPC, it should surface into the VCPReady condition, status false, severity error.

I think that having a separated condition could be really confusing for the user because in the example above we will have MachineSuccesful reporting a fatal error provisioning VPC, but VPCReady will report something else.

However unfortunately there is no agreement on this topic, as per #3692, If I got it right mostly due to how we should treat these conditions as permanent.

Might be a possible way out is to introduce a new severity level called "fatal", and make sure that util/conditions treat it as immutable/highest priority...

enxebre · 2022-03-02T16:52:36Z

Use conditions to represent the current state and deprecate/remove failureMessage and failureReason
Merge the evidence from failureMessage and failureReason into existing conditions, under the assumption that that failure happens during part of the process that is already surfaced by an existing condition
e.g. if in CAPA there is a permanent failure while provisioning the VPC, it should surface into the VCPReady condition, status false, severity error.
I think that having a separated condition could be really confusing for the user because in the example above we will have MachineSuccesful reporting a fatal error provisioning VPC, but VPCReady will report something else.

I'm not seeing how it'd be feasible to have different Machine conditions for every single permanent failure for every single provider and how to signal that in MS/MD without a common single provider agnostic condition for permanent failures.
The common factor is that the failure is permanent so having a common semantic to express that i.e failureMessage/Reason seems reasonable to me.
Now, to bubble permanent failures up to MachineSet/MachineDeployment we need a single known source of truth (failureMessage/Reason) and a single provider agnostic condition in MS/MD to signal, hence the introduction of MachinesSucceededCondition (I didn't choose MachinesFailureCondition to align with positive nuance of all other conditions), agree?

So my proposal would be:

Keep failureMessage/Reason in infraMachines.
Introduce a new condition for MS/MD to bubble up and signal permanent failures, i.e MachinesSucceededCondition.
Now, if we want revisit moving failureMessage/Reason to conditions in infraMachines we can do transparently any time as far as we satisfy the contract with MS/MD and the failures are bubble up through MachinesSucceededCondition.

Thoughts?

enxebre · 2022-03-02T22:24:54Z

Discussed in Community meeting Wed 2 Mar 2022 that using particular ConditionSeverities would be preferred over perpetuating failureMessage in infraMachines.

The workflow could look as follows:

Machine controller bubbles up conditions with severity x from infraMachines into Machines.
Upper level controller MS can then lookup these conditions by severity and set a condition "c" with a SetAggregate logic for all owned Machines.
Upper level controllers MD can bubble up "c" from MachineSets.
Deprecation of failureMessage in infraMachines while providers are given time to satisfy the severity contract.

@fabriziopandini @yastij does this align with your thoughts or were you thinking something different?

sbueringer · 2022-03-03T07:38:26Z

A question for my understanding

Machine controller bubbles up conditions with severity x from infraMachines into Machines.

Does this mean we take the InfraMachine condition and use it 1:1 in Machine, i.e. we would e.g. get an AWS InstanceReady condition into the conditions array of the Machine and then bubble the same condition up 1:1 to MachineSet etc?

Or alternatively, we look at the conditions from the InfraMachine and aggregate them into a single condition in the Machine and bubble that directly up / or aggregate it further when bubbling up.

Can you maybe make an example, how a specific failure condition of a specific provider would be bubbled up / reflected in the Machine/MachineSet/MachineDeployment.

Apologies if that was already discussed and I just missed it. Just having a hard-time figuring out how this would exactly work.

fabriziopandini · 2022-03-14T17:47:26Z

@enxebre @sbueringer I have written some notes about how terminal failures should work at the end of https://docs.google.com/document/d/1hBQnWWa5d16FOslNhDwYVOhcMjLIul4tMeUgh4maI3w/edit#

Happy to chat about it and to rally to make it an amendment to the condition proposal
WRT to the current PR/Issue, I think it is fair the ask to have failureReason/failureMessage to bubble up Machine failure (until those fields exists, they should behave properly). However I would decouple this fix by the work described in the doc that is more long-term

jdef · 2022-05-06T18:59:57Z

this looks stalled. how can we move this forward?

enxebre · 2022-05-09T08:44:13Z

this looks stalled. how can we move this forward?
Let's try to find consensus on the doc and move forward with #6025 first

k8s-triage-robot · 2022-08-07T09:32:07Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2022-09-06T09:48:56Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

jdef · 2022-09-06T11:54:41Z

/remove-lifecycle rotten

At the moment if the infra for a Machine or the software for a Node fail to operate it's completely transparent for NodePool API consumers. We need to solve agreggation upstream: kubernetes-sigs/cluster-api#6218 kubernetes-sigs/cluster-api#6025 This PR is a stopgap to mitigate the above and facilitate reaction and decissions for NodePool consumers. Examples In progesss: ``` - lastTransitionTime: "2022-11-30T09:03:08Z" message: | Machine agl-nested-us-east-1a-6d9fcd6565-n5lpf: InstanceProvisionStarted Machine agl-nested-us-east-1a-6d9fcd6565-k872l: InstanceProvisionStarted observedGeneration: 6 reason: InstanceProvisionStarted status: "False" type: AllMachinesReady - lastTransitionTime: "2022-11-30T09:03:08Z" message: | Machine agl-nested-us-east-1a-6d9fcd6565-n5lpf: WaitingForNodeRef Machine agl-nested-us-east-1a-6d9fcd6565-k872l: WaitingForNodeRef observedGeneration: 6 reason: WaitingForNodeRef status: "False" type: AllNodesHealthy ``` ``` - lastTransitionTime: "2022-11-30T09:03:27Z" message: All is well observedGeneration: 6 reason: AsExpected status: "True" type: AllMachinesReady - lastTransitionTime: "2022-11-30T09:03:08Z" message: | Machine agl-nested-us-east-1a-6d9fcd6565-n5lpf: NodeProvisioning Machine agl-nested-us-east-1a-6d9fcd6565-k872l: NodeProvisioning observedGeneration: 6 reason: NodeProvisioning status: "False" type: AllNodesHealthy ``` Failure - Instance terminated out of band ``` - lastTransitionTime: "2022-11-30T09:10:59Z" message: | Machine agl-nested-us-east-1a-6d9fcd6565-k872l: InstanceTerminated observedGeneration: 6 reason: InstanceTerminated status: "False" type: AllMachinesReady - lastTransitionTime: "2022-11-30T09:10:59Z" message: | Machine agl-nested-us-east-1a-6d9fcd6565-k872l: NodeConditionsFailed observedGeneration: 6 reason: NodeConditionsFailed status: "False" type: AllNodesHealthy ```

k8s-triage-robot · 2022-12-05T12:15:59Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2023-01-04T13:05:11Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

vincepri

Closing this due to inactivity

/close

k8s-ci-robot · 2023-01-23T20:00:24Z

@vincepri: Closed this PR.

In response to this:

Closing this due to inactivity

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Feb 28, 2022

k8s-ci-robot requested review from JoelSpeed and sbueringer February 28, 2022 21:44

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 28, 2022

enxebre force-pushed the bubbleup-machine-failures branch from 6571428 to ad8ea38 Compare February 28, 2022 21:55

fabriziopandini mentioned this pull request Mar 2, 2022

E2E framework should fail faster when terminal provisioning failures occur #6239

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 7, 2022

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 6, 2022

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Sep 6, 2022

enxebre mentioned this pull request Nov 30, 2022

Add AllMachinesReady and AllNodesHealthy conditions to NodePool openshift/hypershift#1907

Merged

4 tasks

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 5, 2022

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 4, 2023

vincepri reviewed Jan 23, 2023

View reviewed changes

k8s-ci-robot closed this Jan 23, 2023

enxebre mentioned this pull request Jun 3, 2024

🌱 Add BootstrapFailedMachineError error #10360

Closed

Uh oh!

🌱 Bubble up machines permanent failure as MS/MD conditions #6218

🌱 Bubble up machines permanent failure as MS/MD conditions #6218

Uh oh!

Conversation

enxebre commented Feb 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented Feb 28, 2022

Uh oh!

enxebre commented Feb 28, 2022

Uh oh!

enxebre commented Feb 28, 2022

Uh oh!

fabriziopandini commented Mar 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

enxebre commented Mar 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

enxebre commented Mar 2, 2022

Uh oh!

sbueringer commented Mar 3, 2022

Uh oh!

fabriziopandini commented Mar 14, 2022

Uh oh!

jdef commented May 6, 2022

Uh oh!

enxebre commented May 9, 2022

Uh oh!

k8s-triage-robot commented Aug 7, 2022

Uh oh!

k8s-triage-robot commented Sep 6, 2022

Uh oh!

jdef commented Sep 6, 2022

Uh oh!

k8s-triage-robot commented Dec 5, 2022

Uh oh!

k8s-triage-robot commented Jan 4, 2023

Uh oh!

vincepri left a comment

Choose a reason for hiding this comment

Uh oh!

k8s-ci-robot commented Jan 23, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

enxebre commented Feb 28, 2022 •

edited

Loading

fabriziopandini commented Mar 2, 2022 •

edited

Loading

enxebre commented Mar 2, 2022 •

edited

Loading