Skip to content

Conversation

@dnhatn
Copy link
Member

@dnhatn dnhatn commented Dec 8, 2020

Like #56620, this change relies on channel disconnect instead of node leave events to remove parent-task ban markers.

Relates #65443
Relates #56620

@dnhatn dnhatn added >enhancement :Distributed Coordination/Task Management Issues for anything around the Tasks API - both persistent and node level. v8.0.0 v7.11.0 labels Dec 8, 2020
@dnhatn dnhatn requested review from DaveCTurner and jimczi December 8, 2020 21:22
@elasticmachine elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Dec 8, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done a first pass and think there's a missing bit of synchronisation; I left a few other comments too.

private static class Ban {
final String reason;
final boolean perChannel; // TODO: Remove this in 8.0
final Set<ChannelPendingTaskTracker> channels;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mismatch between tracking the Transport.Connection on the parent node and the individual TcpChannel on this node is tricksy. It's all fine I think, neither can remain open forever if the other one closes, I'm just noting for posterity that this isn't obvious and took a bit of thought.

@dnhatn
Copy link
Member Author

dnhatn commented Dec 9, 2020

@DaveCTurner Thanks for looking. I have addressed your comments.

@dnhatn dnhatn requested a review from DaveCTurner December 9, 2020 16:11
Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left one further comment, otherwise this LGTM I think

synchronized (banedParents) {
synchronized (bannedParents) {
lastDiscoveryNodes = event.state().getNodes();
// Remove all bans that were registered by nodes that are no longer in the cluster state
Copy link
Contributor

@DaveCTurner DaveCTurner Dec 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can move to drop this now and rely entirely on the channel-closing trigger. I don't think this logic ever completely worked, we could receive a task from a departing node after it had left the cluster but while it was still connected to us; moreover it might rejoin the cluster without disconnecting from us and then we'd keep on processing the task it thought it had cancelled.

(it's still needed for BWC, it's better than nothing, but should only apply to bans that aren't tracking their own channels)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this applies to bans that are not tracked by channels (see line 512 where we check ban.getValue().perChannel == false).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will remove this on 8.0 once we backported this change to 7.x.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, you're right again.

@dnhatn dnhatn requested a review from DaveCTurner December 16, 2020 18:08
Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

synchronized (banedParents) {
synchronized (bannedParents) {
lastDiscoveryNodes = event.state().getNodes();
// Remove all bans that were registered by nodes that are no longer in the cluster state
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, you're right again.

@dnhatn
Copy link
Member Author

dnhatn commented Dec 16, 2020

@DaveCTurner Thanks so much for your thoughtful reviews on the task cancellation issues. I appreciate that :).

@dnhatn dnhatn merged commit 0a57458 into elastic:master Dec 16, 2020
@dnhatn dnhatn deleted the remove-ban-on-disconnect branch December 16, 2020 18:30
dnhatn added a commit to dnhatn/elasticsearch that referenced this pull request Dec 16, 2020
Like elastic#56620, this change relies on channel disconnect instead of node
leave events to remove parent-task ban markers.

Relates elastic#65443
Relates elastic#56620
dnhatn added a commit that referenced this pull request Dec 17, 2020
Like #56620, this change relies on channel disconnect instead of node
leave events to remove parent-task ban markers.

Relates #65443
Relates #56620
dnhatn added a commit that referenced this pull request Dec 18, 2020
This BWC is no longer needed after backporting #66066 to 7.x.

Closes #66518
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed Coordination/Task Management Issues for anything around the Tasks API - both persistent and node level. >enhancement Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. v7.12.0 v8.0.0-alpha1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants