-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Log Warning on Failed Blob Deletes in BlobStoreRepository #40188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Log Warning on Failed Blob Deletes in BlobStoreRepository #40188
Conversation
* We should not just debug log these spots, they all can and will lead to leaked files when snapshot deletion fails
|
Pinging @elastic/es-distributed |
| // since we are using UUIDs for path names, this won't be an issue for | ||
| // snapshotting indices of the same name | ||
| logger.debug(() -> new ParameterizedMessage("[{}] index [{}] no longer part of any snapshots in the repository, " + | ||
| logger.warn(() -> new ParameterizedMessage("[{}] index [{}] no longer part of any snapshots in the repository, " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ywelsch This spot is really bad, we should take action here to delete the index folder when we know it must be deleted and not just quietly (or now not so quietly :D) skip it because we made mistake earlier.
I'll raise a PR to deal with this situation once I know if we want to proceed on #40144 (having the threadpool here would be very helpful in creating a reasonable solution that won't further slow this code down and block deletes).
ywelsch
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've left one comment
| } catch (IOException ioe) { | ||
| // a different IOException occurred while trying to delete - will just log the issue for now | ||
| logger.debug(() -> new ParameterizedMessage("[{}] index [{}] no longer part of any snapshots in the repository, " + | ||
| logger.warn(() -> new ParameterizedMessage("[{}] index [{}] no longer part of any snapshots in the repository, " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should continue to log NoSuchFileException with debug (this happens if an earlier clean-up failed), otherwise we might quite a bit of noisy logs. Same applies to other places that were changed in this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ywelsch is that actually true though? Looking at the code we're delete shard data files (which is the only place where we would really have an explosion in the log lines) via org.elasticsearch.common.blobstore.BlobContainer#deleteBlobIgnoringIfNotExists so we have that exception filtered already anyway?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This effectively tries to delete folders, which does not make sense for the blob stores (e.g. Azure, GCP, ...) and will throw a NoSuchFileException as the blob key won't be found.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, but it seems that that will happen at a low enough rate so that it's fine because we don't have that many folders? (I see it's a hand-waving argument, but given that it just took way too long to find #39852 (comment) when that was totally trivial with this change I'm gonna stand by it :D)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it will happen every time a user deletes a snapshot with an index that's not referenced by any other snapshot anymore. Yes, that's not happening at a high rate, but logging warnings for something that happens under normal operations is not good, especially if we can avoid it (which we can I think).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But we use the repository data to determine which indices to delete (that's the whole problem in the first place leading to stale files since we simply stop touching stale index directories) so we won't ever try to delete an index "folder" twice will we?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not following your argument. Are you saying that because this is only erroneously logged as warning once (under normal operations) that it is therefore harmless? Or are you saying that under normal operations we will never log this as a warning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or are you saying that under normal operations we will never log this as a warning?
This should never happen under normal operations imo. If we run into this, we failed to delete all the index files in the above which is clearly a bug (+ every time this happens we leak blobs that are never cleaned up).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will happen under normal operations. Run GoogleCloudStorageBlobStoreRepositoryTests#testSnapshotAndRestore and see for yourself:
[2019-03-20T08:19:50,218][WARN ][o.e.r.b.BlobStoreRepository] [node_sm0] [krciqciev] index [[ftvgcjc/28AyZ3zgRMmGbTEfac6I7Q]] no longer part of any snapshots in the repository, but failed to clean up its index folder.
java.nio.file.NoSuchFileException: Blob [GoogleCloudStorageBlobStoreRepositoryTests/indices/28AyZ3zgRMmGbTEfac6I7Q] does not exist
at org.elasticsearch.repositories.gcs.GoogleCloudStorageBlobStore.deleteBlob(GoogleCloudStorageBlobStore.java:289) ~[main/:?]
at org.elasticsearch.repositories.gcs.GoogleCloudStorageBlobContainer.deleteBlob(GoogleCloudStorageBlobContainer.java:78) ~[main/:?]
at org.elasticsearch.repositories.blobstore.BlobStoreRepository.deleteSnapshot(BlobStoreRepository.java:475) [main/:?]
at org.elasticsearch.snapshots.SnapshotsService.lambda$deleteSnapshotFromRepository$10(SnapshotsService.java:1314) [main/:?]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:677) [main/:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
at java.lang.Thread.run(Thread.java:834) [?:?]
[2019-03-20T08:19:50,224][WARN ][o.e.r.b.BlobStoreRepository] [node_sm0] [krciqciev] index [[ddrevic/URmsihElTxi3pfkgeDb4ag]] no longer part of any snapshots in the repository, but failed to clean up its index folder.
java.nio.file.NoSuchFileException: Blob [GoogleCloudStorageBlobStoreRepositoryTests/indices/URmsihElTxi3pfkgeDb4ag] does not exist
at org.elasticsearch.repositories.gcs.GoogleCloudStorageBlobStore.deleteBlob(GoogleCloudStorageBlobStore.java:289) ~[main/:?]
at org.elasticsearch.repositories.gcs.GoogleCloudStorageBlobContainer.deleteBlob(GoogleCloudStorageBlobContainer.java:78) ~[main/:?]
at org.elasticsearch.repositories.blobstore.BlobStoreRepository.deleteSnapshot(BlobStoreRepository.java:475) [main/:?]
at org.elasticsearch.snapshots.SnapshotsService.lambda$deleteSnapshotFromRepository$10(SnapshotsService.java:1314) [main/:?]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:677) [main/:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
at java.lang.Thread.run(Thread.java:834) [?:?]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤦♂️ sorry about that, obviously my head didn't think about the blob stores and only about the FS implementation ... adjusted the code accordingly in 5f636ad.
I think the other spots adjusted here are safe though because they already use org.elasticsearch.common.blobstore.BlobContainer#deleteBlobIgnoringIfNotExists for "folder" deletes right?
ywelsch
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Let's get this into 7.0 and 6.7 as well
|
@elasticmachine run elasticsearch-ci/1 |
Uh oh!
There was an error while loading. Please reload this page.