Skip to content

Conversation

@original-brownbear
Copy link
Contributor

@original-brownbear original-brownbear commented Mar 19, 2019

* We should not just debug log these spots, they all can and will lead to leaked files when snapshot deletion fails
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

// since we are using UUIDs for path names, this won't be an issue for
// snapshotting indices of the same name
logger.debug(() -> new ParameterizedMessage("[{}] index [{}] no longer part of any snapshots in the repository, " +
logger.warn(() -> new ParameterizedMessage("[{}] index [{}] no longer part of any snapshots in the repository, " +
Copy link
Contributor Author

@original-brownbear original-brownbear Mar 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ywelsch This spot is really bad, we should take action here to delete the index folder when we know it must be deleted and not just quietly (or now not so quietly :D) skip it because we made mistake earlier.
I'll raise a PR to deal with this situation once I know if we want to proceed on #40144 (having the threadpool here would be very helpful in creating a reasonable solution that won't further slow this code down and block deletes).

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left one comment

} catch (IOException ioe) {
// a different IOException occurred while trying to delete - will just log the issue for now
logger.debug(() -> new ParameterizedMessage("[{}] index [{}] no longer part of any snapshots in the repository, " +
logger.warn(() -> new ParameterizedMessage("[{}] index [{}] no longer part of any snapshots in the repository, " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should continue to log NoSuchFileException with debug (this happens if an earlier clean-up failed), otherwise we might quite a bit of noisy logs. Same applies to other places that were changed in this PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ywelsch is that actually true though? Looking at the code we're delete shard data files (which is the only place where we would really have an explosion in the log lines) via org.elasticsearch.common.blobstore.BlobContainer#deleteBlobIgnoringIfNotExists so we have that exception filtered already anyway?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This effectively tries to delete folders, which does not make sense for the blob stores (e.g. Azure, GCP, ...) and will throw a NoSuchFileException as the blob key won't be found.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, but it seems that that will happen at a low enough rate so that it's fine because we don't have that many folders? (I see it's a hand-waving argument, but given that it just took way too long to find #39852 (comment) when that was totally trivial with this change I'm gonna stand by it :D)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it will happen every time a user deletes a snapshot with an index that's not referenced by any other snapshot anymore. Yes, that's not happening at a high rate, but logging warnings for something that happens under normal operations is not good, especially if we can avoid it (which we can I think).

Copy link
Contributor Author

@original-brownbear original-brownbear Mar 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we use the repository data to determine which indices to delete (that's the whole problem in the first place leading to stale files since we simply stop touching stale index directories) so we won't ever try to delete an index "folder" twice will we?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not following your argument. Are you saying that because this is only erroneously logged as warning once (under normal operations) that it is therefore harmless? Or are you saying that under normal operations we will never log this as a warning?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ywelsch

Or are you saying that under normal operations we will never log this as a warning?

This should never happen under normal operations imo. If we run into this, we failed to delete all the index files in the above which is clearly a bug (+ every time this happens we leak blobs that are never cleaned up).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will happen under normal operations. Run GoogleCloudStorageBlobStoreRepositoryTests#testSnapshotAndRestore and see for yourself:

[2019-03-20T08:19:50,218][WARN ][o.e.r.b.BlobStoreRepository] [node_sm0] [krciqciev] index [[ftvgcjc/28AyZ3zgRMmGbTEfac6I7Q]] no longer part of any snapshots in the repository, but failed to clean up its index folder.
java.nio.file.NoSuchFileException: Blob [GoogleCloudStorageBlobStoreRepositoryTests/indices/28AyZ3zgRMmGbTEfac6I7Q] does not exist
	at org.elasticsearch.repositories.gcs.GoogleCloudStorageBlobStore.deleteBlob(GoogleCloudStorageBlobStore.java:289) ~[main/:?]
	at org.elasticsearch.repositories.gcs.GoogleCloudStorageBlobContainer.deleteBlob(GoogleCloudStorageBlobContainer.java:78) ~[main/:?]
	at org.elasticsearch.repositories.blobstore.BlobStoreRepository.deleteSnapshot(BlobStoreRepository.java:475) [main/:?]
	at org.elasticsearch.snapshots.SnapshotsService.lambda$deleteSnapshotFromRepository$10(SnapshotsService.java:1314) [main/:?]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:677) [main/:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
	at java.lang.Thread.run(Thread.java:834) [?:?]
[2019-03-20T08:19:50,224][WARN ][o.e.r.b.BlobStoreRepository] [node_sm0] [krciqciev] index [[ddrevic/URmsihElTxi3pfkgeDb4ag]] no longer part of any snapshots in the repository, but failed to clean up its index folder.
java.nio.file.NoSuchFileException: Blob [GoogleCloudStorageBlobStoreRepositoryTests/indices/URmsihElTxi3pfkgeDb4ag] does not exist
	at org.elasticsearch.repositories.gcs.GoogleCloudStorageBlobStore.deleteBlob(GoogleCloudStorageBlobStore.java:289) ~[main/:?]
	at org.elasticsearch.repositories.gcs.GoogleCloudStorageBlobContainer.deleteBlob(GoogleCloudStorageBlobContainer.java:78) ~[main/:?]
	at org.elasticsearch.repositories.blobstore.BlobStoreRepository.deleteSnapshot(BlobStoreRepository.java:475) [main/:?]
	at org.elasticsearch.snapshots.SnapshotsService.lambda$deleteSnapshotFromRepository$10(SnapshotsService.java:1314) [main/:?]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:677) [main/:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
	at java.lang.Thread.run(Thread.java:834) [?:?]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤦‍♂️ sorry about that, obviously my head didn't think about the blob stores and only about the FS implementation ... adjusted the code accordingly in 5f636ad.

I think the other spots adjusted here are safe though because they already use org.elasticsearch.common.blobstore.BlobContainer#deleteBlobIgnoringIfNotExists for "folder" deletes right?

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Let's get this into 7.0 and 6.7 as well

@original-brownbear
Copy link
Contributor Author

@elasticmachine run elasticsearch-ci/1

@original-brownbear original-brownbear merged commit a19087b into elastic:master Mar 21, 2019
@original-brownbear original-brownbear deleted the louder-logging-blobstore branch March 21, 2019 13:58
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Mar 22, 2019
)

* Log Warning on Failed Blob Deletes in BlobStoreRepository
* We should not just debug log these spots, they all can and will lead to leaked files when snapshot deletion fails
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Mar 22, 2019
)

* Log Warning on Failed Blob Deletes in BlobStoreRepository
* We should not just debug log these spots, they all can and will lead to leaked files when snapshot deletion fails
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Mar 22, 2019
)

* Log Warning on Failed Blob Deletes in BlobStoreRepository
* We should not just debug log these spots, they all can and will lead to leaked files when snapshot deletion fails
original-brownbear added a commit that referenced this pull request Mar 24, 2019
…40342)

* Log Warning on Failed Blob Deletes in BlobStoreRepository
* We should not just debug log these spots, they all can and will lead to leaked files when snapshot deletion fails
original-brownbear added a commit that referenced this pull request Mar 24, 2019
…40341)

* Log Warning on Failed Blob Deletes in BlobStoreRepository
* We should not just debug log these spots, they all can and will lead to leaked files when snapshot deletion fails
original-brownbear added a commit that referenced this pull request Mar 25, 2019
…40340)

* Log Warning on Failed Blob Deletes in BlobStoreRepository
* We should not just debug log these spots, they all can and will lead to leaked files when snapshot deletion fails
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants