-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Cleaner Handling of Store Refcount in BlobStoreRepository #47560
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cleaner Handling of Store Refcount in BlobStoreRepository #47560
Conversation
If a shard gets closed we properly abort its snapshot before closing it. We should in thise case make sure to not throw a confusing exception about trying to increment the reference on an already closed shard in the async tasks if the snapshot is already aborted. Also, added an assertion to make sure that aborts are in fact the only situation in which we run into a concurrently closed store.
|
Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore) |
| try { | ||
| if (alreadyFailed.get() == false) { | ||
| snapshotFile(snapshotFileInfo, indexId, shardId, snapshotId, snapshotStatus, store); | ||
| if (store.tryIncRef()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This whole loop is kinda awkward to begin with ... makes me wonder if we shouldn't just run this on the generic pool and makethe parallelism for snapshots configurable explicitly exactly like we do for recoveries ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makethe parallelism for snapshots configurable explicitly exactly like we do for recoveries
Not sure to follow you :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry badly explained :)
I find this whole loop over all the files really strange. We currently create one Runnable for each file to upload individually then enqueue all the runnables. That forces us to do the strange alreadyFailed flag to not get crazy exceptions and also to increment and decrement the ref count on the store for each file individually.
It seems like it would be more correct/simpler and less hacky to simply have a queue of files and have workers pull from that queue until its empty. Then each worker can just get that reference once and we don't have to run all N tasks for N files even if the first file fails uploading.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for explaining, it makes sense but I don't see this as a requirement to merge this PR. Let's keep this in our mind for the rainy boring days ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, this was more of a general comment to justify the weird code :)
|
Jenkins run elasticsearch-ci/bwc |
|
@tlrx I don't think this has much/any practical impact but this is spamming test logs all the time, that's the whole motivation here :) |
|
Jenkins run elasticsearch-ci/bwc |
|
Jenkins run elasticsearch-ci/packaging-sample |
|
Thanks Tanguy! |
) If a shard gets closed we properly abort its snapshot before closing it. We should in thise case make sure to not throw a confusing exception about trying to increment the reference on an already closed shard in the async tasks if the snapshot is already aborted. Also, added an assertion to make sure that aborts are in fact the only situation in which we run into a concurrently closed store.
…47594) If a shard gets closed we properly abort its snapshot before closing it. We should in thise case make sure to not throw a confusing exception about trying to increment the reference on an already closed shard in the async tasks if the snapshot is already aborted. Also, added an assertion to make sure that aborts are in fact the only situation in which we run into a concurrently closed store.
If a shard gets closed we properly abort its snapshot
before closing it. We should in thise case make sure to
not throw a confusing exception about trying to increment
the reference on an already closed shard in the async tasks
if the snapshot is already aborted.
Also, added an assertion to make sure that aborts are in
fact the only situation in which we run into a concurrently
closed store.