-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Log Warning on Failed Blob Deletes in BlobStoreRepository #40188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
30a5280
8114d7c
5f636ad
a6d0362
aaa39fc
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -472,17 +472,17 @@ public void deleteSnapshot(SnapshotId snapshotId, long repositoryStateId) { | |
| final BlobContainer indicesBlobContainer = blobStore().blobContainer(basePath().add("indices")); | ||
| for (final IndexId indexId : indicesToCleanUp) { | ||
| try { | ||
| indicesBlobContainer.deleteBlob(indexId.getId()); | ||
| indicesBlobContainer.deleteBlobIgnoringIfNotExists(indexId.getId()); | ||
| } catch (DirectoryNotEmptyException dnee) { | ||
| // if the directory isn't empty for some reason, it will fail to clean up; | ||
| // we'll ignore that and accept that cleanup didn't fully succeed. | ||
| // since we are using UUIDs for path names, this won't be an issue for | ||
| // snapshotting indices of the same name | ||
| logger.debug(() -> new ParameterizedMessage("[{}] index [{}] no longer part of any snapshots in the repository, " + | ||
| logger.warn(() -> new ParameterizedMessage("[{}] index [{}] no longer part of any snapshots in the repository, " + | ||
| "but failed to clean up its index folder due to the directory not being empty.", metadata.name(), indexId), dnee); | ||
| } catch (IOException ioe) { | ||
| // a different IOException occurred while trying to delete - will just log the issue for now | ||
| logger.debug(() -> new ParameterizedMessage("[{}] index [{}] no longer part of any snapshots in the repository, " + | ||
| logger.warn(() -> new ParameterizedMessage("[{}] index [{}] no longer part of any snapshots in the repository, " + | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I wonder if we should continue to log
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @ywelsch is that actually true though? Looking at the code we're delete shard data files (which is the only place where we would really have an explosion in the log lines) via
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This effectively tries to delete folders, which does not make sense for the blob stores (e.g. Azure, GCP, ...) and will throw a
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yea, but it seems that that will happen at a low enough rate so that it's fine because we don't have that many folders? (I see it's a hand-waving argument, but given that it just took way too long to find #39852 (comment) when that was totally trivial with this change I'm gonna stand by it :D)
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. it will happen every time a user deletes a snapshot with an index that's not referenced by any other snapshot anymore. Yes, that's not happening at a high rate, but logging warnings for something that happens under normal operations is not good, especially if we can avoid it (which we can I think).
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. But we use the repository data to determine which indices to delete (that's the whole problem in the first place leading to stale files since we simply stop touching stale index directories) so we won't ever try to delete an index "folder" twice will we?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not following your argument. Are you saying that because this is only erroneously logged as warning once (under normal operations) that it is therefore harmless? Or are you saying that under normal operations we will never log this as a warning?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
This should never happen under normal operations imo. If we run into this, we failed to delete all the index files in the above which is clearly a bug (+ every time this happens we leak blobs that are never cleaned up).
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This will happen under normal operations. Run
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🤦♂️ sorry about that, obviously my head didn't think about the blob stores and only about the FS implementation ... adjusted the code accordingly in 5f636ad. I think the other spots adjusted here are safe though because they already use |
||
| "but failed to clean up its index folder.", metadata.name(), indexId), ioe); | ||
| } | ||
| } | ||
|
|
@@ -832,7 +832,7 @@ private long listBlobsToGetLatestIndexId() throws IOException { | |
| } catch (NumberFormatException nfe) { | ||
| // the index- blob wasn't of the format index-N where N is a number, | ||
| // no idea what this blob is but it doesn't belong in the repository! | ||
| logger.debug("[{}] Unknown blob in the repository: {}", metadata.name(), blobName); | ||
| logger.warn("[{}] Unknown blob in the repository: {}", metadata.name(), blobName); | ||
| } | ||
| } | ||
| return latest; | ||
|
|
@@ -975,7 +975,7 @@ public void delete() { | |
| try { | ||
| indexShardSnapshotFormat.delete(blobContainer, snapshotId.getUUID()); | ||
| } catch (IOException e) { | ||
| logger.debug("[{}] [{}] failed to delete shard snapshot file", shardId, snapshotId); | ||
| logger.warn(new ParameterizedMessage("[{}] [{}] failed to delete shard snapshot file", shardId, snapshotId), e); | ||
| } | ||
|
|
||
| // Build a list of snapshots that should be preserved | ||
|
|
@@ -1135,7 +1135,7 @@ protected Tuple<BlobStoreIndexShardSnapshots, Integer> buildBlobStoreIndexShardS | |
| logger.warn(() -> new ParameterizedMessage("failed to read index file [{}]", file), e); | ||
| } | ||
| } else if (blobKeys.isEmpty() == false) { | ||
| logger.debug("Could not find a readable index-N file in a non-empty shard snapshot directory [{}]", blobContainer.path()); | ||
| logger.warn("Could not find a readable index-N file in a non-empty shard snapshot directory [{}]", blobContainer.path()); | ||
| } | ||
|
|
||
| // We couldn't load the index file - falling back to loading individual snapshots | ||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ywelsch This spot is really bad, we should take action here to delete the index folder when we know it must be deleted and not just quietly (or now not so quietly :D) skip it because we made mistake earlier.
I'll raise a PR to deal with this situation once I know if we want to proceed on #40144 (having the threadpool here would be very helpful in creating a reasonable solution that won't further slow this code down and block deletes).