Skip to content

Conversation

@jaypoulz
Copy link

@jaypoulz jaypoulz commented Oct 21, 2025

Introduces tnf.etcd.openshift.io/v1alpha1 API group with PacemakerStatus custom resource. This provides visibility into Pacemaker cluster health for dual-replica etcd deployments. The status-only resource is populated by a privileged controller and consumed by the cluster-etcd-operator healthcheck controller. Not gated because it's only used by CEO when two-node has transitioned.

Works in conjunction with openshift/cluster-etcd-operator#1487

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Oct 21, 2025
@openshift-ci-robot
Copy link

openshift-ci-robot commented Oct 21, 2025

@jaypoulz: This pull request references OCPEDGE-2084 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

In response to this:

Introduces tnf.etcd.openshift.io/v1alpha1 API group with PacemakerStatus custom resource. This provides visibility into Pacemaker cluster health for dual-replica etcd deployments. The status-only resource is populated by a privileged controller and consumed by the cluster-etcd-operator healthcheck controller. Gated by DualReplica feature and managed by two-node-fencing component.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 21, 2025

Hello @jaypoulz! Some important instructions when contributing to openshift/api:
API design plays an important part in the user experience of OpenShift and as such API PRs are subject to a high level of scrutiny to ensure they follow our best practices. If you haven't already done so, please review the OpenShift API Conventions and ensure that your proposed changes are compliant. Following these conventions will help expedite the api review process for your PR.

@openshift-ci openshift-ci bot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Oct 21, 2025
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 21, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign joelspeed for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the do-not-merge/invalid-owners-file Indicates that a PR should not merge because it has an invalid OWNERS file in it. label Oct 21, 2025
@openshift-ci-robot
Copy link

openshift-ci-robot commented Oct 21, 2025

@jaypoulz: This pull request references OCPEDGE-2084 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

In response to this:

Introduces tnf.etcd.openshift.io/v1alpha1 API group with PacemakerStatus custom resource. This provides visibility into Pacemaker cluster health for dual-replica etcd deployments. The status-only resource is populated by a privileged controller and consumed by the cluster-etcd-operator healthcheck controller. Gated by DualReplica feature and managed by two-node-fencing component.

Works in conjunction with openshift/cluster-etcd-operator#1487

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot removed the do-not-merge/invalid-owners-file Indicates that a PR should not merge because it has an invalid OWNERS file in it. label Oct 21, 2025
@jaypoulz jaypoulz force-pushed the OCPEDGE-2084 branch 4 times, most recently from 2ba442d to 29b9fec Compare October 21, 2025 23:56
@saschagrunert
Copy link
Member

@jaypoulz thank you for the PR, do you mind making the CI happy?

@jaypoulz
Copy link
Author

Hi @saschagrunert :) Working on it! :D
New to this repo so working through beginner challenges 😸

@jaypoulz
Copy link
Author

A few open questions I have:

  1. This is a config object of a sort. It's created by cluster-etcd-operator only when you have a two-node cluster and only for the purposes of gathering information about the health of pacemaker (our ha tool) from the nodes. I put it in etcd/tnf (two node fencing) because it seemed sensible. But I'm not sure if it needs to be in config.

That said, it doesn't work like a normal config - there's no spec and it shouldn't be created during bootstrap. The CRD just needs to be present when the CEO runs an cronjob to post an update to it.

  1. bash hack/update-protobuf.sh failed for me because it wanted the path to be installed in my go path. That said, cursor happily runs it and copies over the files without issue. I'm just skeptical of the zz_generated files, but I assume those are verified by CI?

  2. For the non-boolean enum fields. Should I be creating static string definitions that can be exported to CEO? How do I generate those?

@jaypoulz jaypoulz force-pushed the OCPEDGE-2084 branch 2 times, most recently from b0ff230 to 1b57b09 Compare October 22, 2025 16:59
@openshift-ci openshift-ci bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Oct 22, 2025
@jaypoulz jaypoulz force-pushed the OCPEDGE-2084 branch 4 times, most recently from b9b727f to fdd53e9 Compare October 22, 2025 20:37
@saschagrunert
Copy link
Member

saschagrunert commented Oct 23, 2025

Yeah, I'll ignore the CI failures for now, running ./hack/update-codegen.sh locally also gives me a diff in openapi/generated_openapi/zz_generated.openapi.go. 🙃

A few open questions I have:

  1. This is a config object of a sort. It's created by cluster-etcd-operator only when you have a two-node cluster and only for the purposes of gathering information about the health of pacemaker (our ha tool) from the nodes. I put it in etcd/tnf (two node fencing) because it seemed sensible. But I'm not sure if it needs to be in config.

I'm new to API review, but my gut feeling tells me that a dedicated etcd API group sounds fine for that purpose.

That said, it doesn't work like a normal config - there's no spec and it shouldn't be created during bootstrap. The CRD just needs to be present when the CEO runs an cronjob to post an update to it.

  1. bash hack/update-protobuf.sh failed for me because it wanted the path to be installed in my go path. That said, cursor happily runs it and copies over the files without issue. I'm just skeptical of the zz_generated files, but I assume those are verified by CI?

You can also try to run it in a container by make verify-with-container.

  1. For the non-boolean enum fields. Should I be creating static string definitions that can be exported to CEO? How do I generate those?

Do you mind elaborating on that? Do you mean generating the code for the unions?

API docs ref: https://github.com/openshift/enhancements/blob/master/dev-guide/api-conventions.md#writing-a-union-in-go


@jaypoulz is there an OpenShift enhancement available for this change?

@jaypoulz jaypoulz force-pushed the OCPEDGE-2084 branch 3 times, most recently from c620199 to 6b36b92 Compare October 23, 2025 17:37
@jaypoulz
Copy link
Author

What I was asking about was: https://github.com/openshift/enhancements/blob/master/dev-guide/api-conventions.md#do-not-use-boolean-fields

Do we usually provide constants for the non-boolean fields for easy reference?

@jaypoulz jaypoulz force-pushed the OCPEDGE-2084 branch 5 times, most recently from 3bfc09e to b505119 Compare October 23, 2025 22:13
@jaypoulz
Copy link
Author

  1. For the non-boolean enum fields. Should I be creating static string definitions that can be exported to CEO? How do I generate those?

Do you mind elaborating on that? Do you mean generating the code for the unions?

API docs ref: https://github.com/openshift/enhancements/blob/master/dev-guide/api-conventions.md#writing-a-union-in-go

I saw the kinds of constants I was thinking about in the control plan topology type, so I decided to proceed in that direction. Should be more obvious what I meant now. :)

@jaypoulz
Copy link
Author

@saschagrunert CI is happy 🥹

@jaypoulz
Copy link
Author

Or at least it was when I wrote that comment - I decided to update the READMEs. 🤞 I didn't break anything

@saschagrunert
Copy link
Member

/test okd-scos-e2e-aws-ovn

Copy link
Member

@saschagrunert saschagrunert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the links! My review is mostly about docs and naming conventions. Let's enhance on that. 👍

@@ -0,0 +1,3 @@
swaggerdocs:
commentPolicy: Warn

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove empty line

Suggested change

// +kubebuilder:validation:Optional
// +groupName=etcd.openshift.io
package v1alpha1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove that empty eol, tools like gofmt will complain about that.

Suggested change

)

// QuorumStatusType represents the quorum status of a Pacemaker cluster
// +kubebuilder:validation:Enum=Quorate;NoQuorum
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to document the valid values, like:

Suggested change
// +kubebuilder:validation:Enum=Quorate;NoQuorum
// Valid values are Quorate (cluster has quorum) and NoQuorum (cluster does not have quorum).
// +kubebuilder:validation:Enum=Quorate;NoQuorum

The same applies to NodeOnlineStatusType, NodeModeType, ResourceActiveStatusType below.

Comment on lines 85 to 87
// status contains the actual pacemaker cluster status information collected from the cluster.
// +optional
Status *PacemakerStatusStatus `json:"status,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please document what happens when it's not present.

Suggested change
// status contains the actual pacemaker cluster status information collected from the cluster.
// +optional
Status *PacemakerStatusStatus `json:"status,omitempty"`
// status contains the actual pacemaker cluster status information collected from the cluster.
// When not present, …
// +optional
Status *PacemakerStatusStatus `json:"status,omitempty"`

Comment on lines 96 to 98
// lastUpdated is the timestamp when this status was last updated
// +optional
LastUpdated metav1.Time `json:"lastUpdated,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, in which case can this be not present? Please document it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this one convinced me that last updated should be required :)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question - this is the only required field, but the linter is unhappy if I don't include omitempty on it. Am I doing something wrong?

Comment on lines 148 to 152
// pacemakerdState indicates if pacemaker is running
// +kubebuilder:validation:MinLength=1
// +kubebuilder:validation:MaxLength=16
// +optional
PacemakerdState string `json:"pacemakerdState,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider an enum type here instead of just a string.

The same suggestion would apply to ResourceStatus.Role, NodeHistoryEntry.Operation, FencingEvent.Action and FencingEvent.Status.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NodeHistoryEntry.Operation is not a great fit for this, because resource agents can define custom operations. I don't want to validate our way out of potentially helpful information. The others I think I can nail down.

}

// PacemakerStatusStatus contains the actual pacemaker cluster status information
type PacemakerStatusStatus struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could rename PacemakerStatus to Pacemaker and PacemakerStatusStatus to PacemakerStatus to avoid the doubled status status.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to go with PacemakerCluster for the top-level object since pacemaker just felt wrong.
I prefixed the other possible conflict fields with pacemaker.

}

// NodeStatus represents the status of a single node in the Pacemaker cluster
type NodeStatus struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider naming that PacemakerNodeStatus to avoid (future) conflicts.

}

// ResourceStatus represents the status of a single resource in the Pacemaker cluster
type ResourceStatus struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider naming that PacemakerResourceStatus to avoid (future) conflicts.


// PacemakerSummary provides a high-level summary of cluster state
type PacemakerSummary struct {
// pacemakerdState indicates if pacemaker is running
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the Pacemakerd referencing to a daemon? If so, we should probably name it PacemakerDaemonState to have a clearer naming.

@jaypoulz jaypoulz force-pushed the OCPEDGE-2084 branch 3 times, most recently from f6f91ba to 4fb527a Compare October 24, 2025 16:21
@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 24, 2025
@jaypoulz jaypoulz force-pushed the OCPEDGE-2084 branch 5 times, most recently from 3f45017 to 2fb0282 Compare October 24, 2025 21:15
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 24, 2025
@jaypoulz jaypoulz force-pushed the OCPEDGE-2084 branch 2 times, most recently from 4d4fea5 to 6c69f2a Compare October 24, 2025 22:00
Introduces etcd.openshift.io/v1alpha1 API group with a PacemakerCluster
custom resource. This provides visibility into Pacemaker cluster health for
Two Node Fencing etcd deployments. The status-only resource is populated by a
privileged controller and consumed by the cluster-etcd-operator healthcheck
controller. This API is not gated because it's only created by CEO
once the transition to an ExternalEtcd has occured.
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 25, 2025

@jaypoulz: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/verify cbe66c9 link true /test verify
ci/prow/okd-scos-e2e-aws-ovn cbe66c9 link false /test okd-scos-e2e-aws-ovn
ci/prow/lint cbe66c9 link true /test lint

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants