Skip to content

[Transform] Transforms can get stopped during a rolling upgrade for many(>8) running transforms #81796

@hendrikmuhs

Description

@hendrikmuhs

Affected version: 7.12-
Fixed with: 8.1

Transform supports rolling upgrade, transforms can remain running during the upgrade process and its built-in fail over mechanics re-locate the transform persistent task to another node if the transform node gets upgraded.

If you have more than 8 running transforms a problem might occur due to a limited thread pool. Due to the upgrade transform might resume >8 transforms at once. Due to an internal threadpool that has a fixed size of 4 and a waiting queue of 4 the threadpool might be exhausted when the 9th transform tries to start:

persistent task ... failed
org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: rejected execution of org.elasticsearch.persistent.NodePersistentTasksExecutor$1@2d5a8eeb on EsThreadPoolExecutor[name = instance-xyz/transform_indexing, queue capacity = 4, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@1b757dd4[Running, pool size = 4, active threads = 4, queued tasks = 4, completed tasks = 0]]
	at org.elasticsearch.common.util.concurrent.EsAbortPolicy.rejectedExecution(EsAbortPolicy.java:37) ~[elasticsearch-7.16.1.jar:7.16.1]
	at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:833) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1365) ~[?:?]

If this failure occurs the transform goes into stopped state without any message

Sub-Issue: It seems that the p-task assignment fails without re-trying. The threadpool is only temporarily exhausted, so the system can re-cover if it retries. We should check why it doesn't retry.

Sub-Issue: The transform should not go into stopped state without any message. A transform should never stop on its own, if it gets stopped by the system it is failed. That means that after retries the transform should go into failed with a proper message.

Transform switches the threadpool (to generic) after initialization, the limit only applies to a very short time window.

Mitigation

  • start the transform again, as explained, the threadpool limitation only applies to the start phase
  • increase the threadpool size for transform_indexing as explained here

Origin

The origin of the issues dates back to the very beginning, code got inherited from rollup. Rollup removed the extra threadpool with #65958.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions