-
Couldn't load subscription status.
- Fork 1.4k
🌱 Bubble up machines permanent failure as MS/MD conditions #6218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/hold |
|
PTAL @vincepri @fabriziopandini @sbueringer @CecileRobertMichon @Arvinderpal |
There's a UX lack signal permanent machine failures in MachineSet and MachineDeployments. This PR solves that by bubbling up permanent Machine failures aka FailureMessage/FailureReason as a MachineSet and MachineDeplopyment condition: MachinesSucceededCondition.
6571428 to
ad8ea38
Compare
|
@enxebre first of all thanks for giving a go this problem, users can really benefit from it if we can get to an agreement. I'm personally in favor of simplifying the user experience as much as possible, and thus:
e.g. if in CAPA there is a permanent failure while provisioning the VPC, it should surface into the VCPReady condition, status false, severity error. I think that having a separated condition could be really confusing for the user because in the example above we will have MachineSuccesful reporting a fatal error provisioning VPC, but VPCReady will report something else. However unfortunately there is no agreement on this topic, as per #3692, If I got it right mostly due to how we should treat these conditions as permanent. Might be a possible way out is to introduce a new severity level called "fatal", and make sure that util/conditions treat it as immutable/highest priority... |
I'm not seeing how it'd be feasible to have different Machine conditions for every single permanent failure for every single provider and how to signal that in MS/MD without a common single provider agnostic condition for permanent failures. So my proposal would be:
Thoughts? |
|
Discussed in Community meeting Wed 2 Mar 2022 that using particular ConditionSeverities would be preferred over perpetuating failureMessage in infraMachines. The workflow could look as follows:
@fabriziopandini @yastij does this align with your thoughts or were you thinking something different? |
|
A question for my understanding
Does this mean we take the InfraMachine condition and use it 1:1 in Machine, i.e. we would e.g. get an AWS Or alternatively, we look at the conditions from the InfraMachine and aggregate them into a single condition in the Machine and bubble that directly up / or aggregate it further when bubbling up. Can you maybe make an example, how a specific failure condition of a specific provider would be bubbled up / reflected in the Machine/MachineSet/MachineDeployment. Apologies if that was already discussed and I just missed it. Just having a hard-time figuring out how this would exactly work. |
|
@enxebre @sbueringer I have written some notes about how terminal failures should work at the end of https://docs.google.com/document/d/1hBQnWWa5d16FOslNhDwYVOhcMjLIul4tMeUgh4maI3w/edit# Happy to chat about it and to rally to make it an amendment to the condition proposal |
|
this looks stalled. how can we move this forward? |
|
|
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
|
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
|
/remove-lifecycle rotten |
At the moment if the infra for a Machine or the software for a Node fail to operate it's completely transparent for NodePool API consumers. We need to solve agreggation upstream: kubernetes-sigs/cluster-api#6218 kubernetes-sigs/cluster-api#6025 This PR is a stopgap to mitigate the above and facilitate reaction and decissions for NodePool consumers. Examples In progesss: ``` - lastTransitionTime: "2022-11-30T09:03:08Z" message: | Machine agl-nested-us-east-1a-6d9fcd6565-n5lpf: InstanceProvisionStarted Machine agl-nested-us-east-1a-6d9fcd6565-k872l: InstanceProvisionStarted observedGeneration: 6 reason: InstanceProvisionStarted status: "False" type: AllMachinesReady - lastTransitionTime: "2022-11-30T09:03:08Z" message: | Machine agl-nested-us-east-1a-6d9fcd6565-n5lpf: WaitingForNodeRef Machine agl-nested-us-east-1a-6d9fcd6565-k872l: WaitingForNodeRef observedGeneration: 6 reason: WaitingForNodeRef status: "False" type: AllNodesHealthy ``` ``` - lastTransitionTime: "2022-11-30T09:03:27Z" message: All is well observedGeneration: 6 reason: AsExpected status: "True" type: AllMachinesReady - lastTransitionTime: "2022-11-30T09:03:08Z" message: | Machine agl-nested-us-east-1a-6d9fcd6565-n5lpf: NodeProvisioning Machine agl-nested-us-east-1a-6d9fcd6565-k872l: NodeProvisioning observedGeneration: 6 reason: NodeProvisioning status: "False" type: AllNodesHealthy ``` Failure - Instance terminated out of band ``` - lastTransitionTime: "2022-11-30T09:10:59Z" message: | Machine agl-nested-us-east-1a-6d9fcd6565-k872l: InstanceTerminated observedGeneration: 6 reason: InstanceTerminated status: "False" type: AllMachinesReady - lastTransitionTime: "2022-11-30T09:10:59Z" message: | Machine agl-nested-us-east-1a-6d9fcd6565-k872l: NodeConditionsFailed observedGeneration: 6 reason: NodeConditionsFailed status: "False" type: AllNodesHealthy ```
At the moment if the infra for a Machine or the software for a Node fail to operate it's completely transparent for NodePool API consumers. We need to solve agreggation upstream: kubernetes-sigs/cluster-api#6218 kubernetes-sigs/cluster-api#6025 This PR is a stopgap to mitigate the above and facilitate reaction and decissions for NodePool consumers. Examples In progesss: ``` - lastTransitionTime: "2022-11-30T09:03:08Z" message: | Machine agl-nested-us-east-1a-6d9fcd6565-n5lpf: InstanceProvisionStarted Machine agl-nested-us-east-1a-6d9fcd6565-k872l: InstanceProvisionStarted observedGeneration: 6 reason: InstanceProvisionStarted status: "False" type: AllMachinesReady - lastTransitionTime: "2022-11-30T09:03:08Z" message: | Machine agl-nested-us-east-1a-6d9fcd6565-n5lpf: WaitingForNodeRef Machine agl-nested-us-east-1a-6d9fcd6565-k872l: WaitingForNodeRef observedGeneration: 6 reason: WaitingForNodeRef status: "False" type: AllNodesHealthy ``` ``` - lastTransitionTime: "2022-11-30T09:03:27Z" message: All is well observedGeneration: 6 reason: AsExpected status: "True" type: AllMachinesReady - lastTransitionTime: "2022-11-30T09:03:08Z" message: | Machine agl-nested-us-east-1a-6d9fcd6565-n5lpf: NodeProvisioning Machine agl-nested-us-east-1a-6d9fcd6565-k872l: NodeProvisioning observedGeneration: 6 reason: NodeProvisioning status: "False" type: AllNodesHealthy ``` Failure - Instance terminated out of band ``` - lastTransitionTime: "2022-11-30T09:10:59Z" message: | Machine agl-nested-us-east-1a-6d9fcd6565-k872l: InstanceTerminated observedGeneration: 6 reason: InstanceTerminated status: "False" type: AllMachinesReady - lastTransitionTime: "2022-11-30T09:10:59Z" message: | Machine agl-nested-us-east-1a-6d9fcd6565-k872l: NodeConditionsFailed observedGeneration: 6 reason: NodeConditionsFailed status: "False" type: AllNodesHealthy ```
At the moment if the infra for a Machine or the software for a Node fail to operate it's completely transparent for NodePool API consumers. We need to solve agreggation upstream: kubernetes-sigs/cluster-api#6218 kubernetes-sigs/cluster-api#6025 This PR is a stopgap to mitigate the above and facilitate reaction and decissions for NodePool consumers. Examples In progesss: ``` - lastTransitionTime: "2022-11-30T09:03:08Z" message: | Machine agl-nested-us-east-1a-6d9fcd6565-n5lpf: InstanceProvisionStarted Machine agl-nested-us-east-1a-6d9fcd6565-k872l: InstanceProvisionStarted observedGeneration: 6 reason: InstanceProvisionStarted status: "False" type: AllMachinesReady - lastTransitionTime: "2022-11-30T09:03:08Z" message: | Machine agl-nested-us-east-1a-6d9fcd6565-n5lpf: WaitingForNodeRef Machine agl-nested-us-east-1a-6d9fcd6565-k872l: WaitingForNodeRef observedGeneration: 6 reason: WaitingForNodeRef status: "False" type: AllNodesHealthy ``` ``` - lastTransitionTime: "2022-11-30T09:03:27Z" message: All is well observedGeneration: 6 reason: AsExpected status: "True" type: AllMachinesReady - lastTransitionTime: "2022-11-30T09:03:08Z" message: | Machine agl-nested-us-east-1a-6d9fcd6565-n5lpf: NodeProvisioning Machine agl-nested-us-east-1a-6d9fcd6565-k872l: NodeProvisioning observedGeneration: 6 reason: NodeProvisioning status: "False" type: AllNodesHealthy ``` Failure - Instance terminated out of band ``` - lastTransitionTime: "2022-11-30T09:10:59Z" message: | Machine agl-nested-us-east-1a-6d9fcd6565-k872l: InstanceTerminated observedGeneration: 6 reason: InstanceTerminated status: "False" type: AllMachinesReady - lastTransitionTime: "2022-11-30T09:10:59Z" message: | Machine agl-nested-us-east-1a-6d9fcd6565-k872l: NodeConditionsFailed observedGeneration: 6 reason: NodeConditionsFailed status: "False" type: AllNodesHealthy ```
At the moment if the infra for a Machine or the software for a Node fail to operate it's completely transparent for NodePool API consumers. We need to solve agreggation upstream: kubernetes-sigs/cluster-api#6218 kubernetes-sigs/cluster-api#6025 This PR is a stopgap to mitigate the above and facilitate reaction and decissions for NodePool consumers. Examples In progesss: ``` - lastTransitionTime: "2022-11-30T09:03:08Z" message: | Machine agl-nested-us-east-1a-6d9fcd6565-n5lpf: InstanceProvisionStarted Machine agl-nested-us-east-1a-6d9fcd6565-k872l: InstanceProvisionStarted observedGeneration: 6 reason: InstanceProvisionStarted status: "False" type: AllMachinesReady - lastTransitionTime: "2022-11-30T09:03:08Z" message: | Machine agl-nested-us-east-1a-6d9fcd6565-n5lpf: WaitingForNodeRef Machine agl-nested-us-east-1a-6d9fcd6565-k872l: WaitingForNodeRef observedGeneration: 6 reason: WaitingForNodeRef status: "False" type: AllNodesHealthy ``` ``` - lastTransitionTime: "2022-11-30T09:03:27Z" message: All is well observedGeneration: 6 reason: AsExpected status: "True" type: AllMachinesReady - lastTransitionTime: "2022-11-30T09:03:08Z" message: | Machine agl-nested-us-east-1a-6d9fcd6565-n5lpf: NodeProvisioning Machine agl-nested-us-east-1a-6d9fcd6565-k872l: NodeProvisioning observedGeneration: 6 reason: NodeProvisioning status: "False" type: AllNodesHealthy ``` Failure - Instance terminated out of band ``` - lastTransitionTime: "2022-11-30T09:10:59Z" message: | Machine agl-nested-us-east-1a-6d9fcd6565-k872l: InstanceTerminated observedGeneration: 6 reason: InstanceTerminated status: "False" type: AllMachinesReady - lastTransitionTime: "2022-11-30T09:10:59Z" message: | Machine agl-nested-us-east-1a-6d9fcd6565-k872l: NodeConditionsFailed observedGeneration: 6 reason: NodeConditionsFailed status: "False" type: AllNodesHealthy ```
At the moment if the infra for a Machine or the software for a Node fail to operate it's completely transparent for NodePool API consumers. We need to solve agreggation upstream: kubernetes-sigs/cluster-api#6218 kubernetes-sigs/cluster-api#6025 This PR is a stopgap to mitigate the above and facilitate reaction and decissions for NodePool consumers. Examples In progesss: ``` - lastTransitionTime: "2022-11-30T09:03:08Z" message: | Machine agl-nested-us-east-1a-6d9fcd6565-n5lpf: InstanceProvisionStarted Machine agl-nested-us-east-1a-6d9fcd6565-k872l: InstanceProvisionStarted observedGeneration: 6 reason: InstanceProvisionStarted status: "False" type: AllMachinesReady - lastTransitionTime: "2022-11-30T09:03:08Z" message: | Machine agl-nested-us-east-1a-6d9fcd6565-n5lpf: WaitingForNodeRef Machine agl-nested-us-east-1a-6d9fcd6565-k872l: WaitingForNodeRef observedGeneration: 6 reason: WaitingForNodeRef status: "False" type: AllNodesHealthy ``` ``` - lastTransitionTime: "2022-11-30T09:03:27Z" message: All is well observedGeneration: 6 reason: AsExpected status: "True" type: AllMachinesReady - lastTransitionTime: "2022-11-30T09:03:08Z" message: | Machine agl-nested-us-east-1a-6d9fcd6565-n5lpf: NodeProvisioning Machine agl-nested-us-east-1a-6d9fcd6565-k872l: NodeProvisioning observedGeneration: 6 reason: NodeProvisioning status: "False" type: AllNodesHealthy ``` Failure - Instance terminated out of band ``` - lastTransitionTime: "2022-11-30T09:10:59Z" message: | Machine agl-nested-us-east-1a-6d9fcd6565-k872l: InstanceTerminated observedGeneration: 6 reason: InstanceTerminated status: "False" type: AllMachinesReady - lastTransitionTime: "2022-11-30T09:10:59Z" message: | Machine agl-nested-us-east-1a-6d9fcd6565-k872l: NodeConditionsFailed observedGeneration: 6 reason: NodeConditionsFailed status: "False" type: AllNodesHealthy ```
|
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
|
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Closing this due to inactivity
/close
|
@vincepri: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
What this PR does / why we need it:
There's a UX lack to signal permanent machine failures in MachineSet and MachineDeployments.
This PR solves that by bubbling up permanent Machine failures aka FailureMessage/FailureReason as a MachineSet and MachineDeployment condition: MachinesSucceededCondition.
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)format, will close the issue(s) when PR gets merged):Fixes #5635
Needs #6025 for the conditions to be set in updateStatus