Skip to content

Conversation

@Tim-Brooks
Copy link
Contributor

This is related to #42612. Currently the reindexing transport action
creates a task on the local coordinator node. Unfortunately this is not
resilient to coordinator node failures. This commit adds a new action
that creates a reindexing job as a persistent task.

@Tim-Brooks Tim-Brooks added >enhancement v8.0.0 :Distributed Indexing/Reindex Issues relating to reindex that are not caused by issues further down labels Jun 19, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@Tim-Brooks Tim-Brooks changed the title Persistent reindex Make reindexing managed by a persistent task Jun 19, 2019
@Tim-Brooks Tim-Brooks changed the base branch from master to reindex_v2 June 19, 2019 16:38
@Tim-Brooks
Copy link
Contributor Author

This PR still needs a few cleanups. I will assign reviewers and comment when it is ready.

@Tim-Brooks
Copy link
Contributor Author

@henningandersen @ywelsch I have adjusted this PR for some change in the direction of returning the task-id from the allocated persistent task. I have not adjusted it to work with rethrottling yet. But it appears that the other test are passing.

Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

This approach looks good. Apart from minor comments I think it is ready to merge to feature branch. Feel free to defer some of them to follow-ups if that is easier.

* task can't totally validate until it starts but this is better than
* nothing.
*/
ActionRequestValidationException validationException = internal.getReindexRequest().validate();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can go outside the if now.

}
}

public static class Response extends AcknowledgedResponse {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems we either return acknowledged=true or an exception. Could this just be an ActionResponse then?


import java.util.List;

public class TransportGetReindexJobTaskAction extends TransportTasksAction<ReindexTask, GetReindexJobTaskAction.Request,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks unused now?

rethrottle(logger, clusterService.localNode().getId(), client, bulkByScrollTask, request.getRequestsPerSecond(), listener);
} else if (task instanceof ReindexTask) {
BulkByScrollTask childTask = null;
for (Task task1 : taskManager.getTasks().values()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

task1 rename to childCandidate?

return jobException;
}

public TaskId getTaskId() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe rename to getEphemeralTaskId(), it could make call-site code a little easier to read (at least I mixed it up with the persistent task id).

builder.field("running_time_in_nanos", runningTimeNanos);
builder.field("cancellable", cancellable);
if (parentTaskId.isSet()) {
if (parentTaskId.isSet() && parentTaskId.getNodeId().equals("cluster") == false) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let us add a todo here, to ensure we revisit this. I think this means that cluster: parent ids no longer appear in the _tasks output, which could harm diagnostics/troubleshooting.

@Tim-Brooks
Copy link
Contributor Author

@ywelsch - This is probably ready for another look. I have address Henning's comments. The primary issue currently is due to a task submitting another task, there are some races with the child tasks being available for a follow-up rethrottle action. I have currently resolved this for testing purposes by a spin loop waiting for all the tasks to be available in Rethrottle action.

I think the fix is to extract the BulkByScrollTask and reindex logic out to a different place that allows the entire setup to be synchronous with the persistent task starting (so a single ephemeral task). However, this work will continue to make this PR larger and extend the scope. I guess I just want to know if we should include that in this PR, or add that as a meta issue (it would probably be the next task I would work on, but it would be in a separate PR).

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left a few more minor comments, and one item (rethrottle) that we probably need to discuss

@Tim-Brooks Tim-Brooks requested a review from ywelsch July 17, 2019 16:57
@Tim-Brooks
Copy link
Contributor Author

@ywelsch I made the changes.

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Tim-Brooks Tim-Brooks merged commit 480c545 into elastic:reindex_v2 Jul 18, 2019
@Tim-Brooks Tim-Brooks deleted the persistent_reindex branch April 30, 2020 18:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed Indexing/Reindex Issues relating to reindex that are not caused by issues further down >enhancement v8.0.0-alpha1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants