-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Description
Affected version: 7.12-
Fixed with: 8.1
Transform supports rolling upgrade, transforms can remain running during the upgrade process and its built-in fail over mechanics re-locate the transform persistent task to another node if the transform node gets upgraded.
If you have more than 8 running transforms a problem might occur due to a limited thread pool. Due to the upgrade transform might resume >8 transforms at once. Due to an internal threadpool that has a fixed size of 4 and a waiting queue of 4 the threadpool might be exhausted when the 9th transform tries to start:
persistent task ... failed
org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: rejected execution of org.elasticsearch.persistent.NodePersistentTasksExecutor$1@2d5a8eeb on EsThreadPoolExecutor[name = instance-xyz/transform_indexing, queue capacity = 4, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@1b757dd4[Running, pool size = 4, active threads = 4, queued tasks = 4, completed tasks = 0]]
at org.elasticsearch.common.util.concurrent.EsAbortPolicy.rejectedExecution(EsAbortPolicy.java:37) ~[elasticsearch-7.16.1.jar:7.16.1]
at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:833) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1365) ~[?:?]
If this failure occurs the transform goes into stopped state without any message
Sub-Issue: It seems that the p-task assignment fails without re-trying. The threadpool is only temporarily exhausted, so the system can re-cover if it retries. We should check why it doesn't retry.
Sub-Issue: The transform should not go into stopped state without any message. A transform should never stop on its own, if it gets stopped by the system it is failed. That means that after retries the transform should go into failed with a proper message.
Transform switches the threadpool (to generic) after initialization, the limit only applies to a very short time window.
Mitigation
- start the transform again, as explained, the threadpool limitation only applies to the start phase
- increase the threadpool size for
transform_indexingas explained here
Origin
The origin of the issues dates back to the very beginning, code got inherited from rollup. Rollup removed the extra threadpool with #65958.