-
Notifications
You must be signed in to change notification settings - Fork 25.6k
IndicesStore shouldn't try to delete index after deleting a shard #12463
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… shard When a node discovers shard content on disk which isn't used, we reach out to all other nodes that supposed to have the shard active. Only once all of those have confirmed the shard active, the shard has no unassigned copies *and* no cluster state change have happened in the mean while, do we go and delete the shard folder. Currently, after removing a shard, the IndicesStores checks the indices services if that has no more shard active for this index and if so, it tries to delete the entire index folder (unless on master node, where we keep the index metadata around). This is wrong as both the check and the protections in IndicesServices.deleteIndexStore make sure that there isn't any shard *in use* from that index. However, it may be the we erroneously delete other unused shard copies on disk, without the proper safety guards described above. Normally, this is not a problem as the missing copy will be recovered from another shard copy on another node (although a shame). However, in extremely rare cases involving multiple node failures/restarts where all shard copies are not available (i.e., shard is red) there are race conditions which can cause all shard copies to be deleted. Instead, we should change the decision to clean up an index folder to based on checking the index directory for being empty and containing no shards.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"its"
|
@bleskes This bug is sneaky. LGTM. Left a question about deleteIndexStore() itself, but that shouldn't block this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd much rather have the logging prior to the deletion, so we can at least seen the shard id in logs if something goes awry during the deletion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is logging prior to deletion on the trace level inside deleteShardDirectorySafe(....). We can change it to debug, but I think it would be too much noise. Are you suggesting changing this back to trace and changing another on to debug?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh okay, if there's already one inside deleteShardDirectorySafe then leaving it as-is is fine with me :)
|
Left a couple a pretty minor comments. |
When a node discovers shard content on disk which isn't used, we reach out to all other nodes that supposed to have the shard active. Only once all of those have confirmed the shard active, the shard has no unassigned copies and no cluster state change have happened in the mean while, do we go and delete the shard folder.
Currently, after removing a shard, the IndicesStores checks the indices services if that has no more shard active for this index and if so, it tries to delete the entire index folder (unless on master node, where we keep the index metadata around). This is wrong as both the check and the protections in IndicesServices.deleteIndexStore make sure that there isn't any shard in use from that index. However, it may be the we erroneously delete other unused shard copies on disk, without the proper safety guards described above.
Normally, this is not a problem as the missing copy will be recovered from another shard copy on another node (although a shame). However, in extremely rare cases involving multiple node failures/restarts where all shard copies are not available (i.e., shard is red) there are race conditions which can cause all shard copies to be deleted.
Instead, we should change the decision to clean up an index folder to be based on checking the index directory for being empty and containing no shards.
Note: this PR is against the 1.6 branch.