-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Description
Today when the MasterService shuts down, it fails waiting tasks but does not necessarily fail the ongoing batch of tasks. For instance, we just drop the batch on the floor here:
elasticsearch/server/src/main/java/org/elasticsearch/cluster/service/MasterService.java
Lines 209 to 213 in 356e109
| if (lifecycle.started() == false) { | |
| logger.debug("processing [{}]: ignoring, master service not started", summary); | |
| listener.onResponse(null); | |
| return; | |
| } |
and we swallow rejections here:
elasticsearch/server/src/main/java/org/elasticsearch/cluster/service/MasterService.java
Lines 398 to 405 in 356e109
| assert publicationMayFail() || (exception instanceof EsRejectedExecutionException esre && esre.isExecutorShutdown()) | |
| : exception; | |
| clusterStateUpdateStatsTracker.onPublicationFailure( | |
| threadPool.rawRelativeTimeInMillis(), | |
| clusterStatePublicationEvent, | |
| 0L | |
| ); | |
| handleException(summary, publicationStartTime, newClusterState, exception); |
This behaviour has existed for a long time (i.e. it was not introduced by recent changes in the area such as #92021 and #94325) but I still think we should improve it. Note however that it does not work simply to fail the ongoing tasks on rejection: today with acked tasks we call (at most) one of onAllNodesAcked(), onAckFailure(), onAckTimeout(), or ClusterStateTaskListener#onFailure(), and implementations rely on this fact, but we may experience a rejection exception after acking has completed. I think that means we have to delay the acking until the end of the publication, because the alternative would be to suppress onFailure() calls for acked tasks which seems like a confusing API choice that will lead to bugs.