Skip to content

Conversation

@davidkyle
Copy link
Member

PersistentTasksClusterService decides if a task should be reassigned by checking there is a node in the cluster with the same Id instead of comparing the ephemeral Id or the actual DiscoveryNode.

https://github.com/elastic/elasticsearch/blob/master/server/src/main/java/org/elasticsearch/persistent/PersistentTasksClusterService.java#L389

During a rolling restart PersistentTasksClusterService may not observe a node exit and re-enter the cluster. When the node returns the assignment appears valid as a node with the correct Id exists in the cluster but its departure and hence the un-assignment of the task was missed. This happens most often during the rolling upgrade tests where there are 3 nodes and minimum master nodes is set to 3. The solution is to un-assign the persistent tasks when the executor node leaves the cluster.

There is a spelling mistake in the use of deassociate... the correct options are disassociate or dissociate. I've followed the existing pattern for this change and will correct the spelling in a follow up PR.

@davidkyle davidkyle added :Distributed Coordination/Task Management Issues for anything around the Tasks API - both persistent and node level. v7.0.0 v6.7.0 labels Jan 21, 2019
@davidkyle davidkyle requested a review from ywelsch January 21, 2019 14:11
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just use buildNewFakeTransportAddress() (comes from ESTestCase)

@davidkyle davidkyle force-pushed the unassign-ptask-on-node-loss branch from f48d79d to 1616eaa Compare January 22, 2019 09:58
@davidkyle davidkyle merged commit 3fad1ee into elastic:master Jan 22, 2019
@davidkyle davidkyle deleted the unassign-ptask-on-node-loss branch January 22, 2019 12:44
davidkyle added a commit that referenced this pull request Jan 22, 2019
PersistentTasksClusterService decides if a task should be reassigned by
checking there is a node in the cluster with the same Id. If a node is
restarted PersistentTasksClusterService may not observe the change and
decide the task still has a valid assignment because the node's
ephemeral Id is not used in that decision. This change un-assigns tasks
as the nodes in the cluster change.
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Jan 22, 2019
* elastic/master: (43 commits)
  Remove remaining occurances of "include_type_name=true" in docs (elastic#37646)
  SQL: Return Intervals in SQL format for CLI (elastic#37602)
  Publish to masters first (elastic#37673)
  Un-assign persistent tasks as nodes exit the cluster (elastic#37656)
  Fail start of non-data node if node has data (elastic#37347)
  Use cancel instead of timeout for aborting publications (elastic#37670)
  Follow stats api should return a 404 when requesting stats for a non existing index (elastic#37220)
  Remove deprecated FieldNamesFieldMapper.Builder#index (elastic#37305)
  Document that date math is locale independent
  Bootstrap a Zen2 cluster once quorum is discovered (elastic#37463)
  Upgrade to lucene-8.0.0-snapshot-83f9835. (elastic#37668)
  Mute failing test
  Fix java time formatters that round up (elastic#37604)
  Removes awaits fix as the fix is in. (elastic#37676)
  Mute failing test
  Ensure that max seq # is equal to the global checkpoint when creating ReadOnlyEngines (elastic#37426)
  Mute failing discovery disruption tests
  Add note about how the body is referenced (elastic#33935)
  Define constants for REST requests endpoints in tests (elastic#37610)
  Make prepare engine step of recovery source non-blocking (elastic#37573)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>bug :Distributed Coordination/Task Management Issues for anything around the Tasks API - both persistent and node level. v6.7.0 v7.0.0-beta1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants