Skip to content

Conversation

@original-brownbear
Copy link
Contributor

If a shard gets closed we properly abort its snapshot
before closing it. We should in thise case make sure to
not throw a confusing exception about trying to increment
the reference on an already closed shard in the async tasks
if the snapshot is already aborted.
Also, added an assertion to make sure that aborts are in
fact the only situation in which we run into a concurrently
closed store.

If a shard gets closed we properly abort its snapshot
before closing it. We should in thise case make sure to
not throw a confusing exception about trying to increment
the reference on an already closed shard in the async tasks
if the snapshot is already aborted.
Also, added an assertion to make sure that aborts are in
fact the only situation in which we run into a concurrently
closed store.
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore)

try {
if (alreadyFailed.get() == false) {
snapshotFile(snapshotFileInfo, indexId, shardId, snapshotId, snapshotStatus, store);
if (store.tryIncRef()) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole loop is kinda awkward to begin with ... makes me wonder if we shouldn't just run this on the generic pool and makethe parallelism for snapshots configurable explicitly exactly like we do for recoveries ...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makethe parallelism for snapshots configurable explicitly exactly like we do for recoveries

Not sure to follow you :(

Copy link
Contributor Author

@original-brownbear original-brownbear Oct 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry badly explained :)

I find this whole loop over all the files really strange. We currently create one Runnable for each file to upload individually then enqueue all the runnables. That forces us to do the strange alreadyFailed flag to not get crazy exceptions and also to increment and decrement the ref count on the store for each file individually.
It seems like it would be more correct/simpler and less hacky to simply have a queue of files and have workers pull from that queue until its empty. Then each worker can just get that reference once and we don't have to run all N tasks for N files even if the first file fails uploading.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explaining, it makes sense but I don't see this as a requirement to merge this PR. Let's keep this in our mind for the rainy boring days ;)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, this was more of a general comment to justify the weird code :)

@original-brownbear
Copy link
Contributor Author

Jenkins run elasticsearch-ci/bwc

@original-brownbear
Copy link
Contributor Author

@tlrx I don't think this has much/any practical impact but this is spamming test logs all the time, that's the whole motivation here :)

@original-brownbear
Copy link
Contributor Author

Jenkins run elasticsearch-ci/bwc

@original-brownbear
Copy link
Contributor Author

Jenkins run elasticsearch-ci/packaging-sample

@original-brownbear
Copy link
Contributor Author

Thanks Tanguy!

@original-brownbear original-brownbear merged commit 9141e05 into elastic:master Oct 4, 2019
@original-brownbear original-brownbear deleted the fix-messy-refcount-handling branch October 4, 2019 17:03
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Oct 4, 2019
)

If a shard gets closed we properly abort its snapshot
before closing it. We should in thise case make sure to
not throw a confusing exception about trying to increment
the reference on an already closed shard in the async tasks
if the snapshot is already aborted.
Also, added an assertion to make sure that aborts are in
fact the only situation in which we run into a concurrently
closed store.
original-brownbear added a commit that referenced this pull request Oct 5, 2019
…47594)

If a shard gets closed we properly abort its snapshot
before closing it. We should in thise case make sure to
not throw a confusing exception about trying to increment
the reference on an already closed shard in the async tasks
if the snapshot is already aborted.
Also, added an assertion to make sure that aborts are in
fact the only situation in which we run into a concurrently
closed store.
@original-brownbear original-brownbear restored the fix-messy-refcount-handling branch August 6, 2020 18:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants