Skip to content

Conversation

@DaveCTurner
Copy link
Contributor

The change in #44433 introduces a state in which the cluster has no relocating
shards but still has a pending reroute task which might start a shard
relocation. TransportSearchFailuresIT failed on a PR build seemingly because
it did not wait for this pending task to complete too, reporting more active shards
than expected:

2> java.lang.AssertionError:
  Expected: <9>
       but: was <10>
      at __randomizedtesting.SeedInfo.seed([4057CA4301FE95FA:207EC88573747235]:0)
      at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
      at org.junit.Assert.assertThat(Assert.java:956)
      at org.junit.Assert.assertThat(Assert.java:923)
      at org.elasticsearch.search.basic.TransportSearchFailuresIT.testFailedSearchWithWrongQuery(TransportSearchFailuresIT.java:97)

This commit addresses this failure by waiting until there are neither pending
tasks nor shard relocations in progress.

The change in elastic#44433 introduces a state in which the cluster has no relocating
shards but still has a pending reroute task which might start a shard
relocation. `TransportSearchFailuresIT` failed on a PR build seemingly because
it did not wait for this pending task to complete too, reporting more active shards
than expected:

    2> java.lang.AssertionError:
      Expected: <9>
           but: was <10>
          at __randomizedtesting.SeedInfo.seed([4057CA4301FE95FA:207EC88573747235]:0)
          at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
          at org.junit.Assert.assertThat(Assert.java:956)
          at org.junit.Assert.assertThat(Assert.java:923)
          at org.elasticsearch.search.basic.TransportSearchFailuresIT.testFailedSearchWithWrongQuery(TransportSearchFailuresIT.java:97)

This commit addresses this failure by waiting until there are neither pending
tasks nor shard relocations in progress.
@DaveCTurner DaveCTurner added >test Issues or PRs that are addressing/adding tests :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) v8.0.0 labels Jul 18, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@original-brownbear original-brownbear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM2

@DaveCTurner DaveCTurner merged commit fd5b885 into elastic:master Jul 18, 2019
@DaveCTurner DaveCTurner deleted the 2019-07-18-fix-TransportSearchFailuresIT branch July 18, 2019 10:13
DaveCTurner added a commit that referenced this pull request Jul 18, 2019
Today we reroute the cluster as part of the process of starting a shard, which
runs at `URGENT` priority. In large clusters, rerouting may take some time to
complete, and this means that a mere trickle of shard-started events can cause
starvation for other, lower-priority, tasks that are pending on the master.

However, it isn't really necessary to perform a reroute when starting a shard,
as long as one occurs eventually. This commit removes the inline reroute from
the process of starting a shard and replaces it with a deferred one that runs
at `NORMAL` priority, avoiding starvation of higher-priority tasks.

Backport of #44433 and #44543.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) >test Issues or PRs that are addressing/adding tests v8.0.0-alpha1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants