Skip to content

Conversation

@s1monw
Copy link
Contributor

@s1monw s1monw commented May 19, 2017

Today when we get a metadata snapshot from the index shard we ensure
that if there is no engine started on the shard that we lock the index
writer before we go and fetch the store metadata. Yet, if we concurrently
recover that shard, recovery finalization might fail since it can't acquire
the IW lock on the directory. This is mainly due to the wrong order of acquiring
the IW lock and the metadata lock. Fetching store metadata without a started engine
should block on the metadata lock in Store.java but since IndexShard locks the writer
first we get into a failed recovery dance especially in test. In production
this is less of an issue since we rarely get into this situation if at all.

Closes #24481

Today when we get a metadata snapshot from the index shard we ensure
that if there is no engine started on the shard that we lock the index
writer before we go and fetch the store metadata. Yet, if we concurrently
recover that shard, recovery finalization might fail since it can't acquire
the IW lock on the directory. This is mainly due to the wrong order of aquiring
the IW lock and the metadata lock. Fetching store metadata without a started engine
should block on the metadata lock in Store.java but since IndexShard locks the writer
first we get into a failed recovery dance especially in test. In production
this is less of an issue since we rarely get into this siutation if at all.

Closes elastic#24481
@s1monw s1monw added :Internal :Distributed Indexing/Store Issues around managing unopened Lucene indices. If it touches Store.java, this is a likely label. >bug v5.4.2 v5.5.0 v6.0.0 labels May 19, 2017
@s1monw s1monw requested a review from bleskes May 19, 2017 09:04
Copy link
Contributor

@bleskes bleskes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch

@s1monw s1monw merged commit b17d23d into elastic:master May 19, 2017
@s1monw s1monw deleted the issues/24481 branch May 19, 2017 13:36
s1monw added a commit that referenced this pull request May 19, 2017
Today when we get a metadata snapshot from the index shard we ensure
that if there is no engine started on the shard that we lock the index
writer before we go and fetch the store metadata. Yet, if we concurrently
recover that shard, recovery finalization might fail since it can't acquire
the IW lock on the directory. This is mainly due to the wrong order of aquiring
the IW lock and the metadata lock. Fetching store metadata without a started engine
should block on the metadata lock in Store.java but since IndexShard locks the writer
first we get into a failed recovery dance especially in test. In production
this is less of an issue since we rarely get into this siutation if at all.

Closes #24481
s1monw added a commit that referenced this pull request May 19, 2017
Today when we get a metadata snapshot from the index shard we ensure
that if there is no engine started on the shard that we lock the index
writer before we go and fetch the store metadata. Yet, if we concurrently
recover that shard, recovery finalization might fail since it can't acquire
the IW lock on the directory. This is mainly due to the wrong order of aquiring
the IW lock and the metadata lock. Fetching store metadata without a started engine
should block on the metadata lock in Store.java but since IndexShard locks the writer
first we get into a failed recovery dance especially in test. In production
this is less of an issue since we rarely get into this siutation if at all.

Closes #24481
dnhatn added a commit to dnhatn/elasticsearch that referenced this pull request Dec 12, 2017
Today when we get a metadata snapshot directly from a store directory,
we acquire a metadata lock, then acquire an IW lock. However, we create
a CheckIndex in IndexShard without acquiring the metadata lock first.
This causes a recovery failed because the IW lock can be still held by
`snapshotStoreMetadata`. This commit makes sure to create a CheckIndex
under the metadata lock.

Closes elastic#24481
Relates elastic#24787
dnhatn added a commit that referenced this pull request Dec 20, 2017
Today when we get a metadata snapshot directly from a store directory, 
we acquire a metadata lock, then acquire an IndexWriter lock. However,
we create a CheckIndex in IndexShard without acquiring the metadata lock 
first. This causes a recovery failed because the IndexWriter lock can be
still held by method snapshotStoreMetadata. This commit makes sure to
create a CheckIndex under the metadata lock.

Closes #24481
Closes #27731
Relates #24787
dnhatn added a commit that referenced this pull request Dec 20, 2017
Today when we get a metadata snapshot directly from a store directory,
we acquire a metadata lock, then acquire an IndexWriter lock. However,
we create a CheckIndex in IndexShard without acquiring the metadata lock
first. This causes a recovery failed because the IndexWriter lock can be
still held by method snapshotStoreMetadata. This commit makes sure to
create a CheckIndex under the metadata lock.

Closes #24481
Closes #27731
Relates #24787
@clintongormley clintongormley added :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. and removed :Distributed Indexing/Store Issues around managing unopened Lucene indices. If it touches Store.java, this is a likely label. labels Feb 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>bug :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. v5.4.1 v5.5.0 v6.0.0-beta1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants