Skip to content

Conversation

@dnhatn
Copy link
Member

@dnhatn dnhatn commented Sep 21, 2019

If peer recovery happens after indexing, and indexing flushes some shard at the end, then the explicit flush in the test will be a noop. Then replicas will have some uncommitted translog , which is [transferred] in peer recovery, although all of these operations are in the commit already. If that replica becomes primary (after we restarted the cluster), it will have translog to replay and the test will fail. I can reproduce this failure in 0ced108.

Another issue in this test is that synced_flush is not a replication action, then the global checkpoint on replicas might be not up to date. We need to either wait for the global checkpoint to be synced or call a replication action to sync it.

Closes #46712

@dnhatn dnhatn added >test Issues or PRs that are addressing/adding tests :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. v8.0.0 v7.5.0 v6.8.4 v7.4.1 v7.3.3 labels Sep 21, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dnhatn
Copy link
Member Author

dnhatn commented Sep 22, 2019

Thanks @ywelsch.

@dnhatn dnhatn merged commit 38277fd into elastic:master Sep 22, 2019
@dnhatn dnhatn deleted the fix-recovery-test branch September 22, 2019 20:55
dnhatn added a commit that referenced this pull request Sep 22, 2019
If peer recovery happens after indexing, and indexing flushes some shard
at the end, then the explicit flush in the test will be a noop. Then
replicas will have some uncommitted translog , which is transferred in
peer recovery, although all of these operations are in the commit
already. If that replica becomes primary (after we restarted the
cluster), it will have translog to replay and the test will fail.

Another issue in this test is that synced_flush is not a replication
action, then the global checkpoint on replicas might be not up to date.
We need to either wait for the global checkpoint to be synced or call a
replication action to sync it.

Closes #46712
dnhatn added a commit that referenced this pull request Sep 23, 2019
If peer recovery happens after indexing, and indexing flushes some shard
at the end, then the explicit flush in the test will be a noop. Then
replicas will have some uncommitted translog , which is transferred in
peer recovery, although all of these operations are in the commit
already. If that replica becomes primary (after we restarted the
cluster), it will have translog to replay and the test will fail.

Another issue in this test is that synced_flush is not a replication
action, then the global checkpoint on replicas might be not up to date.
We need to either wait for the global checkpoint to be synced or call a
replication action to sync it.

Closes #46712
dnhatn added a commit that referenced this pull request Sep 23, 2019
If peer recovery happens after indexing, and indexing flushes some shard
at the end, then the explicit flush in the test will be a noop. Then
replicas will have some uncommitted translog , which is transferred in
peer recovery, although all of these operations are in the commit
already. If that replica becomes primary (after we restarted the
cluster), it will have translog to replay and the test will fail.

Another issue in this test is that synced_flush is not a replication
action, then the global checkpoint on replicas might be not up to date.
We need to either wait for the global checkpoint to be synced or call a
replication action to sync it.

Closes #46712
dnhatn added a commit that referenced this pull request Sep 24, 2019
If peer recovery happens after indexing, and indexing flushes some shard
at the end, then the explicit flush in the test will be a noop. Then
replicas will have some uncommitted translog , which is transferred in
peer recovery, although all of these operations are in the commit
already. If that replica becomes primary (after we restarted the
cluster), it will have translog to replay and the test will fail.

Another issue in this test is that synced_flush is not a replication
action, then the global checkpoint on replicas might be not up to date.
We need to either wait for the global checkpoint to be synced or call a
replication action to sync it.

Closes #46712
@colings86 colings86 added v7.4.0 and removed v7.4.1 labels Sep 25, 2019
dnhatn added a commit that referenced this pull request Oct 2, 2019
The pattern in the latest failure is similar to the source fixed in #46956
but relates to synced-flush. If peer recovery happens after indexing,
and indexing flushes some shard at the end, then a synced flush in the
test will not roll or commit translog.

Closes #46712
dnhatn added a commit that referenced this pull request Oct 2, 2019
The pattern in the latest failure is similar to the source fixed in #46956
but relates to synced-flush. If peer recovery happens after indexing,
and indexing flushes some shard at the end, then a synced flush in the
test will not roll or commit translog.

Closes #46712
dnhatn added a commit that referenced this pull request Oct 3, 2019
The pattern in the latest failure is similar to the source fixed in #46956
but relates to synced-flush. If peer recovery happens after indexing,
and indexing flushes some shard at the end, then a synced flush in the
test will not roll or commit translog.

Closes #46712
dnhatn added a commit that referenced this pull request Oct 3, 2019
The pattern in the latest failure is similar to the source fixed in #46956
but relates to synced-flush. If peer recovery happens after indexing,
and indexing flushes some shard at the end, then a synced flush in the
test will not roll or commit translog.

Closes #46712
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. >test Issues or PRs that are addressing/adding tests v6.8.4 v7.3.3 v7.4.0 v7.5.0 v8.0.0-alpha1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

:qa:full-cluster-restart FullClusterRestartIT.testRecovery fails

5 participants