[Transform] Transforms can get stopped during a rolling upgrade for many(>8) running transforms

**Affected version**: 7.12-
**Fixed with:** 8.1

Transform supports rolling upgrade, transforms can remain running during the upgrade process and its built-in fail over mechanics re-locate the transform persistent task to another node if the transform node gets upgraded.

If you have more than 8 running transforms a problem might occur due to a limited thread pool. Due to the upgrade transform might resume >8 transforms at once. Due to an internal threadpool that has a fixed size of `4` and a waiting queue of `4` the threadpool might be exhausted when the 9th transform tries to start:

```
persistent task ... failed
org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: rejected execution of org.elasticsearch.persistent.NodePersistentTasksExecutor$1@2d5a8eeb on EsThreadPoolExecutor[name = instance-xyz/transform_indexing, queue capacity = 4, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@1b757dd4[Running, pool size = 4, active threads = 4, queued tasks = 4, completed tasks = 0]]
	at org.elasticsearch.common.util.concurrent.EsAbortPolicy.rejectedExecution(EsAbortPolicy.java:37) ~[elasticsearch-7.16.1.jar:7.16.1]
	at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:833) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1365) ~[?:?]
```

If this failure occurs the transform goes into `stopped` state without any message

**Sub-Issue:** It seems that the p-task assignment fails without re-trying. The threadpool is only temporarily exhausted, so the system can re-cover if it retries. We should check why it doesn't retry.

**Sub-Issue:** The transform should not go into `stopped` state without any message. A transform should never stop on its own, if it gets stopped by the system it is `failed`. That means that after retries the transform should go into `failed` with a proper message. 

Transform switches the threadpool (to `generic`) after initialization, the limit only applies to a very short time window. 

**Mitigation**

 - start the transform again, as explained, the threadpool limitation only applies to the start phase
 - increase the threadpool size for `transform_indexing` as explained [here](https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-threadpool.html)

**Origin**

The origin of the issues dates back to the very beginning, code got inherited from rollup. Rollup removed the extra threadpool with #65958.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Transform] Transforms can get stopped during a rolling upgrade for many(>8) running transforms #81796

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Transform] Transforms can get stopped during a rolling upgrade for many(>8) running transforms #81796

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions