Skip to content

Conversation

@enxebre
Copy link
Member

@enxebre enxebre commented Feb 28, 2022

What this PR does / why we need it:
There's a UX lack to signal permanent machine failures in MachineSet and MachineDeployments.
This PR solves that by bubbling up permanent Machine failures aka FailureMessage/FailureReason as a MachineSet and MachineDeployment condition: MachinesSucceededCondition.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #5635

Needs #6025 for the conditions to be set in updateStatus

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Feb 28, 2022
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please ask for approval from enxebre after the PR has been reviewed.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@enxebre
Copy link
Member Author

enxebre commented Feb 28, 2022

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 28, 2022
@enxebre
Copy link
Member Author

enxebre commented Feb 28, 2022

PTAL @vincepri @fabriziopandini @sbueringer @CecileRobertMichon @Arvinderpal
If we are happy with the approach I'll try ti get the other PR merged, then will polish this one.

There's a UX lack signal permanent machine failures in MachineSet and MachineDeployments.
This PR solves that by bubbling up permanent Machine failures aka FailureMessage/FailureReason as a MachineSet and MachineDeplopyment condition: MachinesSucceededCondition.
@enxebre enxebre force-pushed the bubbleup-machine-failures branch from 6571428 to ad8ea38 Compare February 28, 2022 21:55
@fabriziopandini
Copy link
Member

fabriziopandini commented Mar 2, 2022

@enxebre first of all thanks for giving a go this problem, users can really benefit from it if we can get to an agreement.

I'm personally in favor of simplifying the user experience as much as possible, and thus:

  • Use conditions to represent the current state and deprecate/remove failureMessage and failureReason
  • Merge the evidence from failureMessage and failureReason into existing conditions, under the assumption that that failure happens during part of the process that is already surfaced by an existing condition

e.g. if in CAPA there is a permanent failure while provisioning the VPC, it should surface into the VCPReady condition, status false, severity error.

I think that having a separated condition could be really confusing for the user because in the example above we will have MachineSuccesful reporting a fatal error provisioning VPC, but VPCReady will report something else.

However unfortunately there is no agreement on this topic, as per #3692, If I got it right mostly due to how we should treat these conditions as permanent.

Might be a possible way out is to introduce a new severity level called "fatal", and make sure that util/conditions treat it as immutable/highest priority...

@enxebre
Copy link
Member Author

enxebre commented Mar 2, 2022

Use conditions to represent the current state and deprecate/remove failureMessage and failureReason
Merge the evidence from failureMessage and failureReason into existing conditions, under the assumption that that failure happens during part of the process that is already surfaced by an existing condition
e.g. if in CAPA there is a permanent failure while provisioning the VPC, it should surface into the VCPReady condition, status false, severity error.
I think that having a separated condition could be really confusing for the user because in the example above we will have MachineSuccesful reporting a fatal error provisioning VPC, but VPCReady will report something else.

I'm not seeing how it'd be feasible to have different Machine conditions for every single permanent failure for every single provider and how to signal that in MS/MD without a common single provider agnostic condition for permanent failures.
The common factor is that the failure is permanent so having a common semantic to express that i.e failureMessage/Reason seems reasonable to me.
Now, to bubble permanent failures up to MachineSet/MachineDeployment we need a single known source of truth (failureMessage/Reason) and a single provider agnostic condition in MS/MD to signal, hence the introduction of MachinesSucceededCondition (I didn't choose MachinesFailureCondition to align with positive nuance of all other conditions), agree?

So my proposal would be:

  • Keep failureMessage/Reason in infraMachines.
  • Introduce a new condition for MS/MD to bubble up and signal permanent failures, i.e MachinesSucceededCondition.
  • Now, if we want revisit moving failureMessage/Reason to conditions in infraMachines we can do transparently any time as far as we satisfy the contract with MS/MD and the failures are bubble up through MachinesSucceededCondition.

Thoughts?

@enxebre
Copy link
Member Author

enxebre commented Mar 2, 2022

Discussed in Community meeting Wed 2 Mar 2022 that using particular ConditionSeverities would be preferred over perpetuating failureMessage in infraMachines.

The workflow could look as follows:

  • Machine controller bubbles up conditions with severity x from infraMachines into Machines.
  • Upper level controller MS can then lookup these conditions by severity and set a condition "c" with a SetAggregate logic for all owned Machines.
  • Upper level controllers MD can bubble up "c" from MachineSets.
  • Deprecation of failureMessage in infraMachines while providers are given time to satisfy the severity contract.

@fabriziopandini @yastij does this align with your thoughts or were you thinking something different?

@sbueringer
Copy link
Member

A question for my understanding

Machine controller bubbles up conditions with severity x from infraMachines into Machines.

Does this mean we take the InfraMachine condition and use it 1:1 in Machine, i.e. we would e.g. get an AWS InstanceReady condition into the conditions array of the Machine and then bubble the same condition up 1:1 to MachineSet etc?

Or alternatively, we look at the conditions from the InfraMachine and aggregate them into a single condition in the Machine and bubble that directly up / or aggregate it further when bubbling up.

Can you maybe make an example, how a specific failure condition of a specific provider would be bubbled up / reflected in the Machine/MachineSet/MachineDeployment.

Apologies if that was already discussed and I just missed it. Just having a hard-time figuring out how this would exactly work.

@fabriziopandini
Copy link
Member

@enxebre @sbueringer I have written some notes about how terminal failures should work at the end of https://docs.google.com/document/d/1hBQnWWa5d16FOslNhDwYVOhcMjLIul4tMeUgh4maI3w/edit#

Happy to chat about it and to rally to make it an amendment to the condition proposal
WRT to the current PR/Issue, I think it is fair the ask to have failureReason/failureMessage to bubble up Machine failure (until those fields exists, they should behave properly). However I would decouple this fix by the work described in the doc that is more long-term

@jdef
Copy link
Contributor

jdef commented May 6, 2022

this looks stalled. how can we move this forward?

@enxebre
Copy link
Member Author

enxebre commented May 9, 2022

this looks stalled. how can we move this forward?
Let's try to find consensus on the doc and move forward with #6025 first

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 7, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 6, 2022
@jdef
Copy link
Contributor

jdef commented Sep 6, 2022

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Sep 6, 2022
enxebre added a commit to enxebre/hypershift that referenced this pull request Nov 30, 2022
At the moment if the infra for a Machine or the software for a Node fail to operate it's completely transparent for NodePool API consumers.
We need to solve agreggation upstream:
kubernetes-sigs/cluster-api#6218
kubernetes-sigs/cluster-api#6025

This PR is a stopgap to mitigate the above and facilitate reaction and decissions for NodePool consumers.

Examples
In progesss:
```
- lastTransitionTime: "2022-11-30T09:03:08Z"
message: |
Machine agl-nested-us-east-1a-6d9fcd6565-n5lpf: InstanceProvisionStarted
Machine agl-nested-us-east-1a-6d9fcd6565-k872l: InstanceProvisionStarted
observedGeneration: 6
reason: InstanceProvisionStarted
status: "False"
type: AllMachinesReady
- lastTransitionTime: "2022-11-30T09:03:08Z"
message: |
Machine agl-nested-us-east-1a-6d9fcd6565-n5lpf: WaitingForNodeRef
Machine agl-nested-us-east-1a-6d9fcd6565-k872l: WaitingForNodeRef
observedGeneration: 6
reason: WaitingForNodeRef
status: "False"
type: AllNodesHealthy
```

```
- lastTransitionTime: "2022-11-30T09:03:27Z"
message: All is well
observedGeneration: 6
reason: AsExpected
status: "True"
type: AllMachinesReady
- lastTransitionTime: "2022-11-30T09:03:08Z"
message: |
Machine agl-nested-us-east-1a-6d9fcd6565-n5lpf: NodeProvisioning
Machine agl-nested-us-east-1a-6d9fcd6565-k872l: NodeProvisioning
observedGeneration: 6
reason: NodeProvisioning
status: "False"
type: AllNodesHealthy
```

Failure - Instance terminated out of band
```
- lastTransitionTime: "2022-11-30T09:10:59Z"
message: |
Machine agl-nested-us-east-1a-6d9fcd6565-k872l: InstanceTerminated
observedGeneration: 6
reason: InstanceTerminated
status: "False"
type: AllMachinesReady
- lastTransitionTime: "2022-11-30T09:10:59Z"
message: |
Machine agl-nested-us-east-1a-6d9fcd6565-k872l: NodeConditionsFailed
observedGeneration: 6
reason: NodeConditionsFailed
status: "False"
type: AllNodesHealthy
```
enxebre added a commit to enxebre/hypershift that referenced this pull request Dec 1, 2022
At the moment if the infra for a Machine or the software for a Node fail to operate it's completely transparent for NodePool API consumers.
We need to solve agreggation upstream:
kubernetes-sigs/cluster-api#6218
kubernetes-sigs/cluster-api#6025

This PR is a stopgap to mitigate the above and facilitate reaction and decissions for NodePool consumers.

Examples
In progesss:
```
- lastTransitionTime: "2022-11-30T09:03:08Z"
message: |
Machine agl-nested-us-east-1a-6d9fcd6565-n5lpf: InstanceProvisionStarted
Machine agl-nested-us-east-1a-6d9fcd6565-k872l: InstanceProvisionStarted
observedGeneration: 6
reason: InstanceProvisionStarted
status: "False"
type: AllMachinesReady
- lastTransitionTime: "2022-11-30T09:03:08Z"
message: |
Machine agl-nested-us-east-1a-6d9fcd6565-n5lpf: WaitingForNodeRef
Machine agl-nested-us-east-1a-6d9fcd6565-k872l: WaitingForNodeRef
observedGeneration: 6
reason: WaitingForNodeRef
status: "False"
type: AllNodesHealthy
```

```
- lastTransitionTime: "2022-11-30T09:03:27Z"
message: All is well
observedGeneration: 6
reason: AsExpected
status: "True"
type: AllMachinesReady
- lastTransitionTime: "2022-11-30T09:03:08Z"
message: |
Machine agl-nested-us-east-1a-6d9fcd6565-n5lpf: NodeProvisioning
Machine agl-nested-us-east-1a-6d9fcd6565-k872l: NodeProvisioning
observedGeneration: 6
reason: NodeProvisioning
status: "False"
type: AllNodesHealthy
```

Failure - Instance terminated out of band
```
- lastTransitionTime: "2022-11-30T09:10:59Z"
message: |
Machine agl-nested-us-east-1a-6d9fcd6565-k872l: InstanceTerminated
observedGeneration: 6
reason: InstanceTerminated
status: "False"
type: AllMachinesReady
- lastTransitionTime: "2022-11-30T09:10:59Z"
message: |
Machine agl-nested-us-east-1a-6d9fcd6565-k872l: NodeConditionsFailed
observedGeneration: 6
reason: NodeConditionsFailed
status: "False"
type: AllNodesHealthy
```
enxebre added a commit to enxebre/hypershift that referenced this pull request Dec 1, 2022
At the moment if the infra for a Machine or the software for a Node fail to operate it's completely transparent for NodePool API consumers.
We need to solve agreggation upstream:
kubernetes-sigs/cluster-api#6218
kubernetes-sigs/cluster-api#6025

This PR is a stopgap to mitigate the above and facilitate reaction and decissions for NodePool consumers.

Examples
In progesss:
```
- lastTransitionTime: "2022-11-30T09:03:08Z"
message: |
Machine agl-nested-us-east-1a-6d9fcd6565-n5lpf: InstanceProvisionStarted
Machine agl-nested-us-east-1a-6d9fcd6565-k872l: InstanceProvisionStarted
observedGeneration: 6
reason: InstanceProvisionStarted
status: "False"
type: AllMachinesReady
- lastTransitionTime: "2022-11-30T09:03:08Z"
message: |
Machine agl-nested-us-east-1a-6d9fcd6565-n5lpf: WaitingForNodeRef
Machine agl-nested-us-east-1a-6d9fcd6565-k872l: WaitingForNodeRef
observedGeneration: 6
reason: WaitingForNodeRef
status: "False"
type: AllNodesHealthy
```

```
- lastTransitionTime: "2022-11-30T09:03:27Z"
message: All is well
observedGeneration: 6
reason: AsExpected
status: "True"
type: AllMachinesReady
- lastTransitionTime: "2022-11-30T09:03:08Z"
message: |
Machine agl-nested-us-east-1a-6d9fcd6565-n5lpf: NodeProvisioning
Machine agl-nested-us-east-1a-6d9fcd6565-k872l: NodeProvisioning
observedGeneration: 6
reason: NodeProvisioning
status: "False"
type: AllNodesHealthy
```

Failure - Instance terminated out of band
```
- lastTransitionTime: "2022-11-30T09:10:59Z"
message: |
Machine agl-nested-us-east-1a-6d9fcd6565-k872l: InstanceTerminated
observedGeneration: 6
reason: InstanceTerminated
status: "False"
type: AllMachinesReady
- lastTransitionTime: "2022-11-30T09:10:59Z"
message: |
Machine agl-nested-us-east-1a-6d9fcd6565-k872l: NodeConditionsFailed
observedGeneration: 6
reason: NodeConditionsFailed
status: "False"
type: AllNodesHealthy
```
enxebre added a commit to enxebre/hypershift that referenced this pull request Dec 2, 2022
At the moment if the infra for a Machine or the software for a Node fail to operate it's completely transparent for NodePool API consumers.
We need to solve agreggation upstream:
kubernetes-sigs/cluster-api#6218
kubernetes-sigs/cluster-api#6025

This PR is a stopgap to mitigate the above and facilitate reaction and decissions for NodePool consumers.

Examples
In progesss:
```
- lastTransitionTime: "2022-11-30T09:03:08Z"
message: |
Machine agl-nested-us-east-1a-6d9fcd6565-n5lpf: InstanceProvisionStarted
Machine agl-nested-us-east-1a-6d9fcd6565-k872l: InstanceProvisionStarted
observedGeneration: 6
reason: InstanceProvisionStarted
status: "False"
type: AllMachinesReady
- lastTransitionTime: "2022-11-30T09:03:08Z"
message: |
Machine agl-nested-us-east-1a-6d9fcd6565-n5lpf: WaitingForNodeRef
Machine agl-nested-us-east-1a-6d9fcd6565-k872l: WaitingForNodeRef
observedGeneration: 6
reason: WaitingForNodeRef
status: "False"
type: AllNodesHealthy
```

```
- lastTransitionTime: "2022-11-30T09:03:27Z"
message: All is well
observedGeneration: 6
reason: AsExpected
status: "True"
type: AllMachinesReady
- lastTransitionTime: "2022-11-30T09:03:08Z"
message: |
Machine agl-nested-us-east-1a-6d9fcd6565-n5lpf: NodeProvisioning
Machine agl-nested-us-east-1a-6d9fcd6565-k872l: NodeProvisioning
observedGeneration: 6
reason: NodeProvisioning
status: "False"
type: AllNodesHealthy
```

Failure - Instance terminated out of band
```
- lastTransitionTime: "2022-11-30T09:10:59Z"
message: |
Machine agl-nested-us-east-1a-6d9fcd6565-k872l: InstanceTerminated
observedGeneration: 6
reason: InstanceTerminated
status: "False"
type: AllMachinesReady
- lastTransitionTime: "2022-11-30T09:10:59Z"
message: |
Machine agl-nested-us-east-1a-6d9fcd6565-k872l: NodeConditionsFailed
observedGeneration: 6
reason: NodeConditionsFailed
status: "False"
type: AllNodesHealthy
```
enxebre added a commit to enxebre/hypershift that referenced this pull request Dec 2, 2022
At the moment if the infra for a Machine or the software for a Node fail to operate it's completely transparent for NodePool API consumers.
We need to solve agreggation upstream:
kubernetes-sigs/cluster-api#6218
kubernetes-sigs/cluster-api#6025

This PR is a stopgap to mitigate the above and facilitate reaction and decissions for NodePool consumers.

Examples
In progesss:
```
- lastTransitionTime: "2022-11-30T09:03:08Z"
message: |
Machine agl-nested-us-east-1a-6d9fcd6565-n5lpf: InstanceProvisionStarted
Machine agl-nested-us-east-1a-6d9fcd6565-k872l: InstanceProvisionStarted
observedGeneration: 6
reason: InstanceProvisionStarted
status: "False"
type: AllMachinesReady
- lastTransitionTime: "2022-11-30T09:03:08Z"
message: |
Machine agl-nested-us-east-1a-6d9fcd6565-n5lpf: WaitingForNodeRef
Machine agl-nested-us-east-1a-6d9fcd6565-k872l: WaitingForNodeRef
observedGeneration: 6
reason: WaitingForNodeRef
status: "False"
type: AllNodesHealthy
```

```
- lastTransitionTime: "2022-11-30T09:03:27Z"
message: All is well
observedGeneration: 6
reason: AsExpected
status: "True"
type: AllMachinesReady
- lastTransitionTime: "2022-11-30T09:03:08Z"
message: |
Machine agl-nested-us-east-1a-6d9fcd6565-n5lpf: NodeProvisioning
Machine agl-nested-us-east-1a-6d9fcd6565-k872l: NodeProvisioning
observedGeneration: 6
reason: NodeProvisioning
status: "False"
type: AllNodesHealthy
```

Failure - Instance terminated out of band
```
- lastTransitionTime: "2022-11-30T09:10:59Z"
message: |
Machine agl-nested-us-east-1a-6d9fcd6565-k872l: InstanceTerminated
observedGeneration: 6
reason: InstanceTerminated
status: "False"
type: AllMachinesReady
- lastTransitionTime: "2022-11-30T09:10:59Z"
message: |
Machine agl-nested-us-east-1a-6d9fcd6565-k872l: NodeConditionsFailed
observedGeneration: 6
reason: NodeConditionsFailed
status: "False"
type: AllNodesHealthy
```
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 5, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 4, 2023
Copy link
Member

@vincepri vincepri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Closing this due to inactivity

/close

@k8s-ci-robot
Copy link
Contributor

@vincepri: Closed this PR.

In response to this:

Closing this due to inactivity

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Machine failures not propagated up to MachineDeployment.Status

7 participants