- 
                Notifications
    You must be signed in to change notification settings 
- Fork 1.4k
🌱 Add Node related condition to Machine conditions #3670
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🌱 Add Node related condition to Machine conditions #3670
Conversation
| Skipping CI for Draft Pull Request. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
first pass
6a2d6b7    to
    eb2fff6      
    Compare
  
    | Unit test fails because CacheTracker panics in the tests that doesn't use testEnv. Probably will need to use testEnv for the tests using CacheTracker. | 
eb2fff6    to
    e2c6e86      
    Compare
  
    | /milestone v0.4.0 | 
995c571    to
    663a3eb      
    Compare
  
    | /test pull-cluster-api-test | 
42b6249    to
    52b6450      
    Compare
  
    | This was ready to go, just solved conflicts. | 
| /lgtm | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to take a bit more time to review but it seems we're potentially changing some behaviors that we should first chat about
| } | ||
| conditions.MarkFalse(m, clusterv1.MachineNodeHealthyCondition, clusterv1.DeletingReason, clusterv1.ConditionSeverityInfo, "") | ||
| if err := patchMachine(ctx, patchHelper, m); err != nil { | ||
| conditions.MarkFalse(m, clusterv1.MachineNodeHealthyCondition, clusterv1.DeletionFailedReason, clusterv1.ConditionSeverityInfo, "") | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect this to fail as well if the patch doesn't go through above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Machines are patched in Reconcile after returning from reconcileDelete, so setting this in case it won't fail there.
| node, err := r.getNode(remoteClient, providerID) | ||
| if err != nil { | ||
| if err == ErrNodeNotFound { | ||
| // While a NodeRef is set in the status, failing to get that node means the node is deleted. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if it's a transient error? The node might not be deleted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are setting the condition and reconciling, if it is transient it will be fixed right away. Otherwise how to detect the transient error?
| if err != nil { | ||
| if err == ErrNodeNotFound { | ||
| // While a NodeRef is set in the status, failing to get that node means the node is deleted. | ||
| // If Status.NodeRef is not set before, node still can be in the provisioning state. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is a little confusing, don't we set the node reference a few lines down?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before this PR we were returning directly if the NodeRef is set.
| if machine.Status.NodeRef != nil { | 
Removed that check to be able to check the health of the node.
And here, if node cannot be found when the NodeRef is already assigned, that means the node is deleted. But if NodeRef is not yet set, that means there was never a node before as NodeRef is nil, and hence it is not a NodeNotFoundReason issue.
| Kind: node.Kind, | ||
| APIVersion: node.APIVersion, | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these properly populated? Usually they're not, and they need to be filled in manually
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is same with what was used before:
cluster-api/controllers/machine_controller_noderef.go
Lines 103 to 107 in fe28e9d
| return &apicorev1.ObjectReference{ | |
| Kind: node.Kind, | |
| APIVersion: node.APIVersion, | |
| Name: node.Name, | |
| UID: node.UID, | 
| I had renamed  cc @vincepri | 
| @sedefsavas: The following tests failed, say  
 Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. | 
| /hold cancel | 
| @sedefsavas if you squash commits, then this one can be merged | 
| MachineNodeHealthyCondition ConditionType = "NodeHealthy" | ||
|  | ||
| // WaitingForNodeRefReason (Severity=Info) documents a machine.spec.providerId is not assigned yet. | ||
| WaitingForNodeRefReason = "WaitingForNodeRef" | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: what about
| WaitingForNodeRefReason = "WaitingForNodeRef" | |
| WaitingForNodeProvisioningReason = "WaitingForNodeProvisioning" | 
NodeRef is a bit too technical for a user that does not know CAPI internals
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is already a NodeProvisioningReason, I think it will be confusing.
WaitingForNodeRefReason  is waiting for machine.Spec.ProviderID and
NodeProvisioningReason is waiting for machine.Status.NodeRef when there is no node found.
394b4bb    to
    9506423      
    Compare
  
    | /approve | 
| [APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: fabriziopandini The full list of commands accepted by this bot can be found here. The pull request process is described here 
Needs approval from an approver in each of these files:
 
 Approvers can indicate their approval by writing  | 
| /lgtm | 
What this PR does / why we need it:
This PR adds a condition for Node health to Machine conditions that will be managed by Machine controller.
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)format, will close the issue(s) when PR gets merged):Fixes # #3138