-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Ensure close is called under lock in the case of an engine failure #5800
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo... should be assertLockIsHel_d_
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LOL yeah :)
|
I like it! it makes things cleaner. Left some comments.. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we just catch Throwable here to be safe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also applies to other places in the code below, if we decide to do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well this is what the code used to do. I think we are find here as it is to be honest....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an excess to the indexWriter without a lock. I think this can lead to an NPE if the shard is closed. I realize it's not part of the change, but I think we should deal with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will use a local version of the indexwriter here...
|
@bleskes thanks for the review - I pushed another commit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think writer can be null if optimizeMutex is true when the method begins. It seems we have this recurrent pattern of calling ensureOpen and than getting the indexWriter to do something. Perhaps we can change ensureOpen to either throw an exception or to return a non-null writer (guaranteed) . This this can become ensureOpen().waitForMerges()
|
thx. Simon. Looking good. Left one last comment. I'm +1 on this otherwise. |
|
I fixed your last suggestion! thanks for all the reveiws @bleskes I think it's ready, if you don't object I'd like to rebase and push it. |
|
thx. ++1 :) |
Until today we did close the engine without aqcuireing the write lock since most calls were still holding a read lock. This commit removes the code that holds on to the readlock when failing the engine which means we can simply call #close()
When a replication operation (index/delete/update) fails to be executed properly, we fail the replica and allow master to allocate a new copy of it. At the moment, the node hosting the primary shard is responsible of notifying the master of a failed replica. However, if the replica shard is initializing (`POST_RECOVERY` state), we have a racing condition between the failed shard message and moving the shard into the `STARTED` state. If the latter happen first, master will fail to resolve the fail shard message. This PR builds on elastic#5800 and fails the engine of the replica shard if a replication operation fails. This protects us against the above as the shard will reject the `STARTED` command from master. It also makes us more resilient to other racing conditions in this area.
When a replication operation (index/delete/update) fails to be executed properly, we fail the replica and allow master to allocate a new copy of it. At the moment, the node hosting the primary shard is responsible of notifying the master of a failed replica. However, if the replica shard is initializing (`POST_RECOVERY` state), we have a racing condition between the failed shard message and moving the shard into the `STARTED` state. If the latter happen first, master will fail to resolve the fail shard message. This commit builds on #5800 and fails the engine of the replica shard if a replication operation fails. This protects us against the above as the shard will reject the `STARTED` command from master. It also makes us more resilient to other racing conditions in this area. Closes #5847
When a replication operation (index/delete/update) fails to be executed properly, we fail the replica and allow master to allocate a new copy of it. At the moment, the node hosting the primary shard is responsible of notifying the master of a failed replica. However, if the replica shard is initializing (`POST_RECOVERY` state), we have a racing condition between the failed shard message and moving the shard into the `STARTED` state. If the latter happen first, master will fail to resolve the fail shard message. This commit builds on #5800 and fails the engine of the replica shard if a replication operation fails. This protects us against the above as the shard will reject the `STARTED` command from master. It also makes us more resilient to other racing conditions in this area. Closes #5847
Until today we did close the engine without aqcuireing the write lock
since most calls were still holding a read lock. This commit removes
the code that holds on to the readlock when failing the engine which
means we can simply call #close()