Operator/ingest #89735

grcevski · 2022-08-30T18:53:02Z

This PR adds support for /_ingest/pipeline for file based settings.

Relates to #89183

Move the logic out of TransportMasterNodeAction to support reserved state handlers beyond those that affect cluster state.

This reverts commit 1466e9c.

This reverts commit 60a7583.

This reverts commit e8c6322.

Extend reserved cluster state service to allow for handlers to perform async data load operations.

grcevski · 2022-09-15T17:56:56Z

server/src/main/java/org/elasticsearch/reservedstate/service/ReservedClusterStateService.java

        }

-        process(namespace, stateChunk, errorListener);
+        // After we have parsed the json content, we give each handler an opportunity to augment the data


This refactoring was required in ReservedClusterStateService because the Ingest transform operations require NodeInfos, which is an async action. For this purpose, I introduced a preTransform step that runs with an async listener, allowing any handler to do additional fetching of data after we've parsed the JSON.

elasticsearchmachine · 2022-09-19T16:53:45Z

Pinging @elastic/es-core-infra (Team:Core/Infra)

elasticsearchmachine · 2022-09-19T16:53:45Z

Hi @grcevski, I've created a changelog YAML for you.

dakrone · 2022-09-19T21:37:23Z

server/src/main/java/org/elasticsearch/action/ingest/ReservedPipelineAction.java

+            try {
+                ingestService.validatePipelineRequest(pipeline, pipelines.nodesInfos);
+            } catch (Exception e) {
+                exceptions.add(e.getMessage());


Here we collect validation errors and collapse them into a single exception, but I notice in the SLM operator PR we are piecemeal in that the first exception halts execution. Should we add the collection of multiple failures to the other PR?

Ah yeah, sorry, I forgot about that. It's better user experience to collect all errors. I'll fix that for SLM.

dakrone

I left a few comments, I'm unsure about the preTransform piece. I understand we need the ingest infos, but I wonder if there is a better way to track it somehow so that validation could not require it. What do you think?

dakrone · 2022-09-19T21:44:40Z

server/src/main/java/org/elasticsearch/ingest/IngestService.java

+        validatePipeline(ingestInfos, request.getId(), config);
+    }
+
+    public boolean isNoOpPipelineUpdate(ClusterState state, PutPipelineRequest request, ActionListener<AcknowledgedResponse> listener) {


I don't think the listener is actually used in this method any more? It's very awkward since this takes a listener and also returns a boolean. I'd prefer to remove the listener and also to make it static if possible.

dakrone · 2022-09-19T21:50:10Z

server/src/main/java/org/elasticsearch/reservedstate/service/ReservedClusterStateService.java

+        if (stateChunk.state().isEmpty()) {
+            errorListener.accept(null);
            return;
        }


(excuse my ignorance of this code)

How does the stateChunk.state() being empty imply that there was failure and the state couldn't be applied because of an incompatible version? Is there no situation where the state could be empty otherwise?

dakrone · 2022-09-19T21:58:33Z

server/src/main/java/org/elasticsearch/reservedstate/service/ReservedClusterStateService.java

+        for (var handlerEntry : stateChunk.state().entrySet()) {
+            handlers.get(handlerEntry.getKey()).preTransform(handlerEntry.getValue(), preTransformListener);
+        }


Can you update the javadoc for the parent method to indicate that it runs the pretransforms as well?

dakrone · 2022-09-19T22:00:04Z

server/src/main/java/org/elasticsearch/action/ingest/ReservedPipelineAction.java

+        NodesInfoRequest nodesInfoRequest = new NodesInfoRequest();
+        nodesInfoRequest.clear();
+        nodesInfoRequest.addMetric(NodesInfoRequest.Metric.INGEST.metricName());
+        nodeClient.admin().cluster().nodesInfo(nodesInfoRequest, new ActionListener<>() {


This makes me uneasy, what happens if the cluster is overloaded and this request times out? I think then we cancel all file-based changes right? (because the error will be thrown from the pre-processing if I understand correctly?)

Should this timeout be changed or configurable?

grcevski · 2022-09-19T22:47:20Z

I left a few comments, I'm unsure about the preTransform piece. I understand we need the ingest infos, but I wonder if there is a better way to track it somehow so that validation could not require it. What do you think?

Yeah good point, let me go back and look at ways we can do without it. It complicates things quite a bit, for such a small thing. I'll address the rest of the feedback based on what I manage to do here.

grcevski · 2022-09-20T23:42:43Z

I pushed an update to see if all tests pass, this seems like a simpler change and more efficient, since we don't expect node configuration to change that much along with file based settings. We keep the node infos cached in the file settings service, and we only refresh them when some nodes have changed.

I need to write extra tests for that logic, make sure we set the flag correctly when nodes are added and removed. I initially thought I only need to track the adds, but removes are needed in case the last ingest capable node is removed.

grcevski · 2022-09-21T16:26:16Z

I reworked how the node infos were being called, it's done on demand and only if nodes have joined/left the cluster. I also just pushed an update with new tests that ensure the nodeInfos are fetched correctly and at the right time.

dakrone

LGTM, I left one question, but as long as we have some sort of protection I don't think it's a blocker or anything.

dakrone · 2022-09-27T15:17:56Z

server/src/main/java/org/elasticsearch/reservedstate/service/FileSettingsService.java

+            if (nodeInfosRefreshRequired || nodesInfoResponse == null) {
+                var nodesInfoRequest = NodesInfoRequest.requestWithMetrics(NodesInfoRequest.Metric.INGEST);
+
+                clusterAdminClient().nodesInfo(nodesInfoRequest, new ActionListener<>() {


Is processFileSettings protected by a lock or AtomicBoolean or anything like that? What happens if the file on disk is replaced hundreds of times every second, will we end up issuing many nodes info requests since we don't have any lock to ensure we only retrieve it once?

(I'm not super close to the internals of the file settings stuff, so it may be that this isn't an issue because we only invoke it once every X seconds, let me know if that's the case)

This is a really good question, so we have only one thread that processes the file change events and the return of this call (processFileSettings) is a latch, the CountDownLatch waitForCompletion. The caller of process file settings waits on this latch, for all async actions to complete, before it decides to pick up another event and to check if the file changed since last time. This effectively makes the processing one at a time, until we have successfully written file state or hit an error we don't try to process again.

The relevant caller code is here: https://github.com/elastic/elasticsearch/pull/89735/files#diff-5a2c0417a5b4bdbcc6873fc853db7a8c531b47536bd41b84e20bb6c7b21ad6d9R334

We also don't trust the events we get from the JDK watch service that the file has changed, we only use them as an indicator that we should check if the file is different. If the file hasn't changed the OS file key or the modified timestamp we skip the event. The reason we can't rely on the JDK watcher is because the series of events that get sent are vastly different between OSs.

grcevski · 2022-09-27T17:48:56Z

@elasticsearchmachine run elasticsearch-ci/part-2

grcevski · 2022-09-27T18:30:24Z

Thanks Lee!

Nikola Grcevski added 6 commits August 29, 2022 16:56

Refactor transport actions for reuse

686d8be

Refactor the reserved state handler interface

e8c6322

Move the logic out of TransportMasterNodeAction to support reserved state handlers beyond those that affect cluster state.

Make some methods public for easier test writing.

60a7583

Implement the reserved aware interface

9efbdf8

Bring in the change for postTransform

1466e9c

Add empty reserved pipeline action.

81272fc

elasticsearchmachine added the v8.5.0 label Aug 30, 2022

Nikola Grcevski added 8 commits August 30, 2022 14:57

Revert "Bring in the change for postTransform"

56a915c

This reverts commit 1466e9c.

Revert "Make some methods public for easier test writing."

bb2690b

This reverts commit 60a7583.

Revert "Refactor the reserved state handler interface"

649f5a9

This reverts commit e8c6322.

Add logic in reserved action and tests

79c54dd

Style and spotless

1329725

Add integration test and enable the handler.

2a195f9

Add the reserved handler hooks in REST.

0a86ed1

Fix style issues

7cfc86a

grcevski mentioned this pull request Aug 30, 2022

File Based Settings for Elasticsearch #89183

Closed

10 tasks

Nikola Grcevski added 2 commits September 15, 2022 12:21

Merge main

4e16472

Allow for async data loading

0f84e61

Extend reserved cluster state service to allow for handlers to perform async data load operations.

grcevski commented Sep 15, 2022

View reviewed changes

grcevski marked this pull request as ready for review September 15, 2022 17:57

grcevski requested a review from dakrone September 15, 2022 17:57

elasticsearchmachine added the needs:triage Requires assignment of a team area label label Sep 15, 2022

grcevski added :Core/Infra/Core Core issues without another label Team:Core/Infra Meta label for core/infra team Team:Data Management Meta label for data/management team >enhancement and removed needs:triage Requires assignment of a team area label labels Sep 19, 2022

elasticsearchmachine removed the Team:Data Management Meta label for data/management team label Sep 19, 2022

Update docs/changelog/89735.yaml

6d50596

dakrone reviewed Sep 19, 2022

View reviewed changes

dakrone requested changes Sep 19, 2022

View reviewed changes

Nikola Grcevski added 4 commits September 20, 2022 10:38

Apply some of the feedback

4141eab

Merge main

8081291

Port to latest version.

801217e

Rework the async call

97ef952

Nikola Grcevski added 5 commits September 21, 2022 09:33

Merge branch 'main' into operator/ingest

4b9b94e

Fix test performance.

0f72856

Add more tests, fix a bug

8e724ba

Spotless

f04d8f8

Fix test error message

8dbcc36

grcevski requested a review from dakrone September 21, 2022 16:25

csoulios added v8.6.0 and removed v8.5.0 labels Sep 21, 2022

dakrone approved these changes Sep 27, 2022

View reviewed changes

Nikola Grcevski added 2 commits September 27, 2022 13:11

Merge main

efc8fbe

Fix merge issue

1bdb1b4

grcevski merged commit a3d5d33 into elastic:main Sep 27, 2022

grcevski deleted the operator/ingest branch September 27, 2022 18:30

Operator/ingest #89735

Operator/ingest #89735

Uh oh!

Conversation

grcevski commented Aug 30, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

elasticsearchmachine commented Sep 19, 2022

Uh oh!

elasticsearchmachine commented Sep 19, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dakrone left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

grcevski commented Sep 19, 2022

Uh oh!

grcevski commented Sep 20, 2022

Uh oh!

grcevski commented Sep 21, 2022

Uh oh!

dakrone left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

grcevski commented Sep 27, 2022

Uh oh!

grcevski commented Sep 27, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants