Update document about error handling #5462

gnufied · 2025-07-29T16:08:59Z

This PR documents error handling in external-resizer and how each type of error is handled.

gnufied · 2025-07-29T16:09:12Z

k8s-ci-robot · 2025-07-29T16:24:26Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: gnufied
Once this PR has been reviewed and has the lgtm label, please ask for approval from jsafrane. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

keps/sig-storage/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

msau42 · 2025-07-29T17:24:04Z

@sunnylovestiramisu @AndrewSirenko

keps/sig-storage/3751-volume-attributes-class/README.md

AndrewSirenko · 2025-07-29T18:39:43Z

keps/sig-storage/3751-volume-attributes-class/README.md

+
+In general Kubernetes sidecars classify all CSI errors in three different classes. Namely:
+
+- Non-final errors (such as `DeadlineExceeded`), which indicate a transient error, which may be because of timeout or some other temporary failure.


Would add to non-final errors that "CO is unsure if volume was modified"

Because CSI Operation CAN time out despite volume modification occurring in storage provider. E.g. storage providers where modifications may take a while

Kinda added extra wording. But not exactly what you wrote above. Please do check.

AndrewSirenko · 2025-07-29T20:59:06Z

keps/sig-storage/3751-volume-attributes-class/README.md

+
+#### Handling of infeasible errors
+
+If volume modification to a VAC is failing with a final and infeasible error, then users can either set VAC to previously specified value in `status.currentVolumeAttributesClass` or set to `nil` if no VAC was specified. In both the cases, external-resizer will stop trying to reconcile the volume modification. 


If volume modification to a VAC is failing with a final and infeasible error

I thought we ONLY cancel modification on infeasible err.

This is to prevent partial modification on final errs like InternalErr, which could lead to half-modified volumes for drivers.

I was also thinking we ONLY cancel modification(rollback) on Infeasible err, kubernetes-csi/external-resizer#487 is based on this assumption.

I think he's just stating that infeasible errors are a subset of final errors. All infeasible errors are final, but not all final errors are infeasible. This should probably say:

"failing with an infeasible error (but not other final errors),"

Yes I meant and to do some heavy lifting here. Since infeasible are already final, both conditions must be true. I will update the wording.

I fixed it. PTAL.

bswartz · 2025-07-29T23:10:51Z

keps/sig-storage/3751-volume-attributes-class/README.md

+
+#### Handling of infeasible errors
+
+If volume modification to a VAC is failing with a final and infeasible error, then users can either set VAC to previously specified value in `status.currentVolumeAttributesClass` or set to `nil` if no VAC was specified. In both the cases, external-resizer will stop trying to reconcile the volume modification. 


I think he's just stating that infeasible errors are a subset of final errors. All infeasible errors are final, but not all final errors are infeasible. This should probably say:

"failing with an infeasible error (but not other final errors),"

AndrewSirenko

/lgtm

huww98 · 2025-07-31T14:25:59Z

keps/sig-storage/3751-volume-attributes-class/README.md

+
+Please note if PVC already had a `currentVolumeAttributesClass` in its status, then setting VAC to `nil` is not allowed.
+
+It is possible that if there were one or more partial volume modifications that happened before on the volume, they will not be undone when this happens because for infeasible errors no `ControllerModifyVolume` will be called when user resets the VAC. This mechanism exists only to prevent perpetual call to `ControllerModifyVolume` for volume modifications which are never going to succeed. Storage providers and users are recommended to roll forward to different VAC, if desired behaviour is resetting the VAC to some pre-specified value for all `mutable_parameters`.


As a developer of CSI driver, and a cluster admin of our infra, I still cannot accept this.

they will not be undone

This means, when I specify my volume to have 2000 IOPS, and PVC.status tells me the reconcile finishes, but my volume may actually have only 1000 IOPS. And I can never observe the abnormal from Kubernetes API, until something more serious goes wrong:

If the performance is higher than expected, it will incur extra cost

If the performance is lower than expected, it can result in unexpected latency to workload, even catastrophic system failure

If we add topology integration to VAC, it also means then PV nodeAffinity can be out-of-sync, which will cause Pod pending or stuck due to scheduled to wrong node.

It is also subject to potential quota abuse

This mechanism exists only to prevent perpetual call

This does not make sense. After VAC is rolled back, if the volume is already at the desired state, SP should just return OK and do nothing. There is no reason the call will be perpetual. If the volume is actually partially modified, and cannot be rolled back by SP, it is better to let user notice this, rather than just hide it.

We never end the reconcile process with an failed gRPC call. e.g.

We only delete VolumeAttachment if ControllerUnpublishVolume returns OK.

We only clear PVC.Status.AllocatedResourceStatuses if ControllerExpandVolume returns OK.

So we should do the same, only clear PVC.Status.ModifyVolumeStatus if ControllerModifyVolume returns OK, and never cancel modification.

Storage providers and users are recommended to roll forward to different VAC, if desired behaviour is resetting the VAC to some pre-specified value for all mutable_parameters.

In Kubernetes, spec specifies the desired state, not action. Which ever state the user specifies, we should try to bring the underlying system to the specified state. It would be ridiculous if two VAC specifies the same state, but only one of them will work.

k8s-ci-robot · 2025-07-31T16:04:06Z

New changes are detected. LGTM label has been removed.

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory labels Jul 29, 2025

k8s-ci-robot requested review from saad-ali and xing-yang July 29, 2025 16:09

k8s-ci-robot added the sig/storage Categorizes an issue or PR as relevant to SIG Storage. label Jul 29, 2025

k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jul 29, 2025

k8s-ci-robot assigned jsafrane, msau42 and xing-yang Jul 29, 2025

gnufied mentioned this pull request Jul 29, 2025

Collapse the INVALID_ARGUMENTS error rows and clarify container-storage-interface/spec#597

Open

gnufied force-pushed the update-vac-kep branch from 3ce4800 to a15211b Compare July 29, 2025 16:24

sunnylovestiramisu reviewed Jul 29, 2025

View reviewed changes

keps/sig-storage/3751-volume-attributes-class/README.md Outdated Show resolved Hide resolved

sunnylovestiramisu reviewed Jul 29, 2025

View reviewed changes

keps/sig-storage/3751-volume-attributes-class/README.md Outdated Show resolved Hide resolved

sunnylovestiramisu reviewed Jul 29, 2025

View reviewed changes

keps/sig-storage/3751-volume-attributes-class/README.md Outdated Show resolved Hide resolved

AndrewSirenko reviewed Jul 29, 2025

View reviewed changes

gnufied force-pushed the update-vac-kep branch from a15211b to 1495478 Compare July 29, 2025 20:14

AndrewSirenko reviewed Jul 29, 2025

View reviewed changes

bswartz suggested changes Jul 29, 2025

View reviewed changes

AndrewSirenko reviewed Jul 31, 2025

View reviewed changes

k8s-ci-robot assigned AndrewSirenko Jul 31, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 31, 2025

huww98 reviewed Jul 31, 2025

View reviewed changes

Update document about error handling

f533296

gnufied force-pushed the update-vac-kep branch from 1495478 to f533296 Compare July 31, 2025 16:04

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 31, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update document about error handling #5462

Update document about error handling #5462

gnufied commented Jul 29, 2025

Uh oh!

gnufied commented Jul 29, 2025

Uh oh!

k8s-ci-robot commented Jul 29, 2025

Uh oh!

msau42 commented Jul 29, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

AndrewSirenko Jul 29, 2025

Uh oh!

gnufied Jul 29, 2025

Uh oh!

AndrewSirenko Jul 29, 2025

Uh oh!

sunnylovestiramisu Jul 29, 2025

Uh oh!

bswartz Jul 29, 2025

Uh oh!

gnufied Jul 30, 2025

Uh oh!

gnufied Jul 31, 2025

Uh oh!

bswartz Jul 29, 2025

Uh oh!

AndrewSirenko left a comment

Uh oh!

huww98 Jul 31, 2025

Uh oh!

k8s-ci-robot commented Jul 31, 2025

Uh oh!

Uh oh!


		In general Kubernetes sidecars classify all CSI errors in three different classes. Namely:

		- Non-final errors (such as `DeadlineExceeded`), which indicate a transient error, which may be because of timeout or some other temporary failure.


		#### Handling of infeasible errors

		If volume modification to a VAC is failing with a final and infeasible error, then users can either set VAC to previously specified value in `status.currentVolumeAttributesClass` or set to `nil` if no VAC was specified. In both the cases, external-resizer will stop trying to reconcile the volume modification.


		Please note if PVC already had a `currentVolumeAttributesClass` in its status, then setting VAC to `nil` is not allowed.

		It is possible that if there were one or more partial volume modifications that happened before on the volume, they will not be undone when this happens because for infeasible errors no `ControllerModifyVolume` will be called when user resets the VAC. This mechanism exists only to prevent perpetual call to `ControllerModifyVolume` for volume modifications which are never going to succeed. Storage providers and users are recommended to roll forward to different VAC, if desired behaviour is resetting the VAC to some pre-specified value for all `mutable_parameters`.

Update document about error handling #5462

Are you sure you want to change the base?

Update document about error handling #5462

Conversation

gnufied commented Jul 29, 2025

Uh oh!

gnufied commented Jul 29, 2025

Uh oh!

k8s-ci-robot commented Jul 29, 2025

Uh oh!

msau42 commented Jul 29, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AndrewSirenko left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

k8s-ci-robot commented Jul 31, 2025

Uh oh!

Uh oh!