IndicesStore shouldn't try to delete index after deleting a shard #12463

bleskes · 2015-07-25T20:15:48Z

When a node discovers shard content on disk which isn't used, we reach out to all other nodes that supposed to have the shard active. Only once all of those have confirmed the shard active, the shard has no unassigned copies and no cluster state change have happened in the mean while, do we go and delete the shard folder.

Currently, after removing a shard, the IndicesStores checks the indices services if that has no more shard active for this index and if so, it tries to delete the entire index folder (unless on master node, where we keep the index metadata around). This is wrong as both the check and the protections in IndicesServices.deleteIndexStore make sure that there isn't any shard in use from that index. However, it may be the we erroneously delete other unused shard copies on disk, without the proper safety guards described above.

Normally, this is not a problem as the missing copy will be recovered from another shard copy on another node (although a shame). However, in extremely rare cases involving multiple node failures/restarts where all shard copies are not available (i.e., shard is red) there are race conditions which can cause all shard copies to be deleted.

Instead, we should change the decision to clean up an index folder to be based on checking the index directory for being empty and containing no shards.

Note: this PR is against the 1.6 branch.

… shard When a node discovers shard content on disk which isn't used, we reach out to all other nodes that supposed to have the shard active. Only once all of those have confirmed the shard active, the shard has no unassigned copies *and* no cluster state change have happened in the mean while, do we go and delete the shard folder. Currently, after removing a shard, the IndicesStores checks the indices services if that has no more shard active for this index and if so, it tries to delete the entire index folder (unless on master node, where we keep the index metadata around). This is wrong as both the check and the protections in IndicesServices.deleteIndexStore make sure that there isn't any shard *in use* from that index. However, it may be the we erroneously delete other unused shard copies on disk, without the proper safety guards described above. Normally, this is not a problem as the missing copy will be recovered from another shard copy on another node (although a shame). However, in extremely rare cases involving multiple node failures/restarts where all shard copies are not available (i.e., shard is red) there are race conditions which can cause all shard copies to be deleted. Instead, we should change the decision to clean up an index folder to based on checking the index directory for being empty and containing no shards.

GlenRSmith · 2015-07-25T22:23:06Z

src/main/java/org/elasticsearch/indices/IndicesService.java

martijnvg · 2015-07-27T12:35:26Z

@bleskes This bug is sneaky. LGTM. Left a question about deleteIndexStore() itself, but that shouldn't block this PR.

dakrone · 2015-07-27T14:05:21Z

src/main/java/org/elasticsearch/indices/IndicesService.java

I'd much rather have the logging prior to the deletion, so we can at least seen the shard id in logs if something goes awry during the deletion.

There is logging prior to deletion on the trace level inside deleteShardDirectorySafe(....). We can change it to debug, but I think it would be too much noise. Are you suggesting changing this back to trace and changing another on to debug?

Ahh okay, if there's already one inside deleteShardDirectorySafe then leaving it as-is is fine with me :)

dakrone · 2015-07-27T14:12:02Z

Left a couple a pretty minor comments.

imotov · 2015-07-27T15:58:09Z

I am taking over this PR and because I cannot push into @bleskes's branch I opened a new PR #12487. I am closing this one. Let's continue the discussion on the new PR. @dakrone could you take a look?

bleskes added 2 commits July 25, 2015 21:03

reproducing test

e65c0e6

bleskes added >bug v2.0.0-beta1 resiliency :Internal v1.7.1 v1.6.2 labels Jul 25, 2015

bleskes assigned martijnvg Jul 25, 2015

change update task name to be more descriptive

41e9321

GlenRSmith reviewed Jul 25, 2015
View reviewed changes

src/main/java/org/elasticsearch/indices/IndicesService.java

Copy link

Contributor

GlenRSmith Jul 25, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"its"

dakrone reviewed Jul 27, 2015
View reviewed changes

imotov closed this Jul 27, 2015

spinscale removed v1.6.2 v1.7.1 v2.0.0-beta1 labels Jul 28, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

IndicesStore shouldn't try to delete index after deleting a shard #12463

IndicesStore shouldn't try to delete index after deleting a shard #12463

Uh oh!

bleskes commented Jul 25, 2015

Uh oh!

GlenRSmith Jul 25, 2015

Uh oh!

martijnvg commented Jul 27, 2015

Uh oh!

dakrone Jul 27, 2015

Uh oh!

imotov Jul 27, 2015

Uh oh!

dakrone Jul 27, 2015

Uh oh!

dakrone commented Jul 27, 2015

Uh oh!

imotov commented Jul 27, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

IndicesStore shouldn't try to delete index after deleting a shard #12463

IndicesStore shouldn't try to delete index after deleting a shard #12463

Uh oh!

Conversation

bleskes commented Jul 25, 2015

Uh oh!

GlenRSmith Jul 25, 2015

Choose a reason for hiding this comment

Uh oh!

martijnvg commented Jul 27, 2015

Uh oh!

dakrone Jul 27, 2015

Choose a reason for hiding this comment

Uh oh!

imotov Jul 27, 2015

Choose a reason for hiding this comment

Uh oh!

dakrone Jul 27, 2015

Choose a reason for hiding this comment

Uh oh!

dakrone commented Jul 27, 2015

Uh oh!

imotov commented Jul 27, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants