Log Warning on Failed Blob Deletes in BlobStoreRepository #40188

original-brownbear · 2019-03-19T05:14:10Z

We should not just debug log these spots, they all can lead to leaked files when snapshot deletion fails
Somewhat relates to [CI] DedicatedClusterSnapshotRestoreIT testSnapshotWithStuckNode failed #39852

* We should not just debug log these spots, they all can and will lead to leaked files when snapshot deletion fails

elasticmachine · 2019-03-19T05:14:11Z

Pinging @elastic/es-distributed

original-brownbear · 2019-03-19T05:16:45Z

server/src/main/java/org/elasticsearch/repositories/blobstore/BlobStoreRepository.java

                    // since we are using UUIDs for path names, this won't be an issue for
                    // snapshotting indices of the same name
-                    logger.debug(() -> new ParameterizedMessage("[{}] index [{}] no longer part of any snapshots in the repository, " +
+                    logger.warn(() -> new ParameterizedMessage("[{}] index [{}] no longer part of any snapshots in the repository, " +


@ywelsch This spot is really bad, we should take action here to delete the index folder when we know it must be deleted and not just quietly (or now not so quietly :D) skip it because we made mistake earlier.
I'll raise a PR to deal with this situation once I know if we want to proceed on #40144 (having the threadpool here would be very helpful in creating a reasonable solution that won't further slow this code down and block deletes).

ywelsch

I've left one comment

ywelsch · 2019-03-19T08:59:17Z

server/src/main/java/org/elasticsearch/repositories/blobstore/BlobStoreRepository.java

                } catch (IOException ioe) {
                    // a different IOException occurred while trying to delete - will just log the issue for now
-                    logger.debug(() -> new ParameterizedMessage("[{}] index [{}] no longer part of any snapshots in the repository, " +
+                    logger.warn(() -> new ParameterizedMessage("[{}] index [{}] no longer part of any snapshots in the repository, " +


I wonder if we should continue to log NoSuchFileException with debug (this happens if an earlier clean-up failed), otherwise we might quite a bit of noisy logs. Same applies to other places that were changed in this PR

@ywelsch is that actually true though? Looking at the code we're delete shard data files (which is the only place where we would really have an explosion in the log lines) via org.elasticsearch.common.blobstore.BlobContainer#deleteBlobIgnoringIfNotExists so we have that exception filtered already anyway?

This effectively tries to delete folders, which does not make sense for the blob stores (e.g. Azure, GCP, ...) and will throw a NoSuchFileException as the blob key won't be found.

Yea, but it seems that that will happen at a low enough rate so that it's fine because we don't have that many folders? (I see it's a hand-waving argument, but given that it just took way too long to find #39852 (comment) when that was totally trivial with this change I'm gonna stand by it :D)

it will happen every time a user deletes a snapshot with an index that's not referenced by any other snapshot anymore. Yes, that's not happening at a high rate, but logging warnings for something that happens under normal operations is not good, especially if we can avoid it (which we can I think).

But we use the repository data to determine which indices to delete (that's the whole problem in the first place leading to stale files since we simply stop touching stale index directories) so we won't ever try to delete an index "folder" twice will we?

I'm not following your argument. Are you saying that because this is only erroneously logged as warning once (under normal operations) that it is therefore harmless? Or are you saying that under normal operations we will never log this as a warning?

@ywelsch

Or are you saying that under normal operations we will never log this as a warning?

This should never happen under normal operations imo. If we run into this, we failed to delete all the index files in the above which is clearly a bug (+ every time this happens we leak blobs that are never cleaned up).

This will happen under normal operations. Run GoogleCloudStorageBlobStoreRepositoryTests#testSnapshotAndRestore and see for yourself:

[2019-03-20T08:19:50,218][WARN ][o.e.r.b.BlobStoreRepository] [node_sm0] [krciqciev] index [[ftvgcjc/28AyZ3zgRMmGbTEfac6I7Q]] no longer part of any snapshots in the repository, but failed to clean up its index folder. java.nio.file.NoSuchFileException: Blob [GoogleCloudStorageBlobStoreRepositoryTests/indices/28AyZ3zgRMmGbTEfac6I7Q] does not exist at org.elasticsearch.repositories.gcs.GoogleCloudStorageBlobStore.deleteBlob(GoogleCloudStorageBlobStore.java:289) ~[main/:?] at org.elasticsearch.repositories.gcs.GoogleCloudStorageBlobContainer.deleteBlob(GoogleCloudStorageBlobContainer.java:78) ~[main/:?] at org.elasticsearch.repositories.blobstore.BlobStoreRepository.deleteSnapshot(BlobStoreRepository.java:475) [main/:?] at org.elasticsearch.snapshots.SnapshotsService.lambda$deleteSnapshotFromRepository$10(SnapshotsService.java:1314) [main/:?] at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:677) [main/:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?] at java.lang.Thread.run(Thread.java:834) [?:?] [2019-03-20T08:19:50,224][WARN ][o.e.r.b.BlobStoreRepository] [node_sm0] [krciqciev] index [[ddrevic/URmsihElTxi3pfkgeDb4ag]] no longer part of any snapshots in the repository, but failed to clean up its index folder. java.nio.file.NoSuchFileException: Blob [GoogleCloudStorageBlobStoreRepositoryTests/indices/URmsihElTxi3pfkgeDb4ag] does not exist at org.elasticsearch.repositories.gcs.GoogleCloudStorageBlobStore.deleteBlob(GoogleCloudStorageBlobStore.java:289) ~[main/:?] at org.elasticsearch.repositories.gcs.GoogleCloudStorageBlobContainer.deleteBlob(GoogleCloudStorageBlobContainer.java:78) ~[main/:?] at org.elasticsearch.repositories.blobstore.BlobStoreRepository.deleteSnapshot(BlobStoreRepository.java:475) [main/:?] at org.elasticsearch.snapshots.SnapshotsService.lambda$deleteSnapshotFromRepository$10(SnapshotsService.java:1314) [main/:?] at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:677) [main/:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?] at java.lang.Thread.run(Thread.java:834) [?:?]

🤦‍♂️ sorry about that, obviously my head didn't think about the blob stores and only about the FS implementation ... adjusted the code accordingly in 5f636ad.

I think the other spots adjusted here are safe though because they already use org.elasticsearch.common.blobstore.BlobContainer#deleteBlobIgnoringIfNotExists for "folder" deletes right?

…bstore

ywelsch

LGTM. Let's get this into 7.0 and 6.7 as well

…bstore

original-brownbear · 2019-03-21T11:58:41Z

@elasticmachine run elasticsearch-ci/1

) * Log Warning on Failed Blob Deletes in BlobStoreRepository * We should not just debug log these spots, they all can and will lead to leaked files when snapshot deletion fails

…40342) * Log Warning on Failed Blob Deletes in BlobStoreRepository * We should not just debug log these spots, they all can and will lead to leaked files when snapshot deletion fails

…40341) * Log Warning on Failed Blob Deletes in BlobStoreRepository * We should not just debug log these spots, they all can and will lead to leaked files when snapshot deletion fails

…40340) * Log Warning on Failed Blob Deletes in BlobStoreRepository * We should not just debug log these spots, they all can and will lead to leaked files when snapshot deletion fails

Log Warning on Failed Blob Deletes in BlobStoreRepository

30a5280

* We should not just debug log these spots, they all can and will lead to leaked files when snapshot deletion fails

original-brownbear added >non-issue :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v8.0.0 v7.2.0 labels Mar 19, 2019

original-brownbear requested a review from ywelsch March 19, 2019 05:14

original-brownbear commented Mar 19, 2019

View reviewed changes

ywelsch reviewed Mar 19, 2019

View reviewed changes

original-brownbear requested a review from ywelsch March 19, 2019 17:02

original-brownbear added 3 commits March 20, 2019 12:24

Merge remote-tracking branch 'elastic/master' into louder-logging-blo…

8114d7c

…bstore

don't warn when blob did not exist

5f636ad

Merge remote-tracking branch 'elastic/master' into louder-logging-blo…

a6d0362

…bstore

ywelsch approved these changes Mar 21, 2019

View reviewed changes

original-brownbear added v7.0.0 v6.7.0 labels Mar 21, 2019

Merge remote-tracking branch 'elastic/master' into louder-logging-blo…

aaa39fc

…bstore

danielmitterdorfer added v6.7.1 and removed v6.7.0 labels Mar 21, 2019

original-brownbear merged commit a19087b into elastic:master Mar 21, 2019

original-brownbear deleted the louder-logging-blobstore branch March 21, 2019 13:58

original-brownbear added the backport pending label Mar 21, 2019

original-brownbear removed the backport pending label Mar 22, 2019

michaelbaamonde added the v7.0.0-rc1 label Mar 25, 2019

michaelbaamonde removed the v7.0.0 label Mar 25, 2019

original-brownbear mentioned this pull request Apr 24, 2019

Implement Bulk Deletes for GCS Repository #41368

Merged

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Log Warning on Failed Blob Deletes in BlobStoreRepository #40188

Log Warning on Failed Blob Deletes in BlobStoreRepository #40188

Uh oh!

Conversation

original-brownbear commented Mar 19, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticmachine commented Mar 19, 2019

Uh oh!

original-brownbear Mar 19, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ywelsch left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

original-brownbear Mar 19, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ywelsch left a comment

Choose a reason for hiding this comment

Uh oh!

original-brownbear commented Mar 21, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

original-brownbear commented Mar 19, 2019 •

edited

Loading

original-brownbear Mar 19, 2019 •

edited

Loading

original-brownbear Mar 19, 2019 •

edited

Loading