SNAPSHOT: Deterministic ClusterState Tests #36644

original-brownbear · 2018-12-14T14:39:17Z

Deterministic Cluster State Tests for Snapshots

Single pointer to the current cluster state (TestClusterState) and DeterministicTaskQueue infrastructure to as well as no networking to be able to iterate through every step of state updates and snapshot task execution in a reproducible manner
Run a single successful snapshot

* Use `DeterministicTaskQueue` infrastructure to reproduce elastic#32265

elasticmachine · 2018-12-14T14:39:18Z

Pinging @elastic/es-distributed

…shot-tests

original-brownbear

@ywelsch did my best to clean this up and make it readable now :) Take a look when you have some time, I think this should be much closer now to what you were envisioning :)

original-brownbear · 2018-12-24T10:30:35Z

test/framework/src/main/java/org/elasticsearch/cluster/coordination/DeterministicTaskQueue.java

            @Override
            public ScheduledExecutorService scheduler() {
-                throw new UnsupportedOperationException();
+                return new ScheduledExecutorService() {


Needed to add a dummy return here since this was used by org.elasticsearch.index.shard.IndexShard#IndexShard (only used in the constructor though for the current test, so no actual implementation necessary otherwise).

original-brownbear · 2018-12-24T10:33:18Z

test/framework/src/main/java/org/elasticsearch/cluster/coordination/DeterministicTaskQueue.java

            public Cancellable scheduleWithFixedDelay(Runnable command, TimeValue interval, String executor) {
-                throw new UnsupportedOperationException();
+                // TODO: Implement fully like schedule
+                return new Cancellable() {


Added dummy return only here for now. This is used to schedule a task org.elasticsearch.indices.IndexingMemoryController.ShardsIndicesStatusChecker in org.elasticsearch.indices.IndexingMemoryController. Just a dummy for now sine that task doesn't seem relevant for this test.

might as well properly implement this, looks not that difficult (I think it's just to call super. scheduleWithFixedDelay() here) and add a test to DeterministicTaskQueueTests.

Done in 2fa91b3.

original-brownbear · 2018-12-24T10:35:20Z

server/src/test/java/org/elasticsearch/snapshots/SnapshotsServiceTests.java

+            }
+        };
+
+        TestClusterNode(DiscoveryNode node, DeterministicTaskQueue deterministicTaskQueue) throws IOException {


Setting up all the real services used for snapshotting with here with the exceptions:

MockTransportService that just short-circuits the network.

Mock ClusterStatePublisher that just short-circuits the network (I added a todo here, because I wasn't sure if we could maybe use the real thing here. It seemed very tricky to do so, but maybe it's not or worth the effort?)

original-brownbear · 2018-12-24T10:35:49Z

server/src/test/java/org/elasticsearch/snapshots/SnapshotsServiceTests.java

+                new IndexScopedSettings(settings, IndexScopedSettings.BUILT_IN_INDEX_SETTINGS);
+            indicesService = new IndicesService(
+                settings,
+                mock(PluginsService.class),


Just mocked this one out since it's not relevant for the test and just a bunch of code to get up and running.

original-brownbear · 2018-12-24T10:35:57Z

server/src/test/java/org/elasticsearch/snapshots/SnapshotsServiceTests.java

+                shardStateAction,
+                new NodeMappingRefreshAction(transportService, new MetaDataMappingService(clusterService, indicesService)),
+                repositoriesService,
+                mock(SearchService.class),


Just mocked this one out since it's not relevant for the test and just a bunch of code to get up and running.

original-brownbear · 2018-12-24T10:36:30Z

server/src/test/java/org/elasticsearch/snapshots/SnapshotsServiceTests.java

+
+        private final ClusterService clusterService;
+
+        private final RepositoriesService repositoriesService = mock(RepositoriesService.class);


Just mocked this one out since it's not that relevant for the test (we really only need it to return the repository, that's the only call we make to it) and just a bunch of code to get up and running.

original-brownbear · 2018-12-24T10:37:15Z

server/src/test/java/org/elasticsearch/snapshots/SnapshotsServiceTests.java

+            )
+        );
+
+        runOutstandingTasks();


Running all tasks so that the index is fully set up when we create the snapshot.

…shot-tests

original-brownbear · 2018-12-26T19:41:13Z

@ywelsch fixed all issues we talk about today in e719046:

Single task queue for all nodes
Real repositories service
Only a single call to run all tasks in the queue, chaining everything else via callbacks
Run non-blocking transport actions in the task queue (I simply filtered for recovery actions here, since multiple of these were blocking)
- The fact that those block lead to me still having to update the state in sync on all nodes, see e719046#diff-8216ad578b188d436b9af3c14b23a9d9R498 because we don't have a way of handling the delayed recovery exception that leads to scheduling these blocking actions with a delay in the recovery logic.

original-brownbear · 2018-12-26T20:56:16Z

Jenkins run gradle build tests 2

ywelsch

Looks very good already

server/src/test/java/org/elasticsearch/snapshots/SnapshotsServiceTests.java

ywelsch · 2018-12-27T13:38:52Z

server/src/test/java/org/elasticsearch/snapshots/SnapshotsServiceTests.java

+            repositoriesService = new RepositoriesService(
+                settings, clusterService, transportService,
+                Collections.singletonMap(FsRepository.TYPE, metaData -> {
+                        final Repository repository = new FsRepository(metaData, createEnvironment(), xContentRegistry()) {


I'm confused. With each node having their own environment, how do the nodes access a shared FS location for writing the snapshot?

I think the reason this did not fail yet is that we aren't writing any data yet because all the shards are empty?

Regardless, I cleaned this up and made sure all nodes have the same repository path in their settings now :)

server/src/test/java/org/elasticsearch/snapshots/SnapshotsServiceTests.java

…shot-tests

original-brownbear · 2018-12-27T15:38:24Z

@ywelsch thanks for taking a look! All points addressed I think -> should be good for another review.

ywelsch · 2018-12-28T08:27:11Z

server/src/test/java/org/elasticsearch/snapshots/SnapshotsServiceTests.java

+        tempDir = createTempDir();
+        deterministicTaskQueue =
+            new DeterministicTaskQueue(Settings.builder().put(NODE_NAME_SETTING.getKey(), "shared").build(), random());
+        // TODO: Random number of master nodes and simulate master failover states


we're not simulating failovers yet?

No, not yet. I was under the impression that we wanted to get the simple successful test case in first and then add those things when we last spoke about the steps here?

yes, I found the comment just confusing here, given that we have no master failovers yet.

ywelsch · 2018-12-28T10:09:45Z

server/src/test/java/org/elasticsearch/snapshots/SnapshotsServiceTests.java

+            final ClusterSettings clusterSettings = new ClusterSettings(settings, ClusterSettings.BUILT_IN_CLUSTER_SETTINGS);
+            final ThreadPool threadPool = deterministicTaskQueue.getThreadPool();
+            clusterService = new ClusterService(settings, clusterSettings, threadPool, masterService);
+            mockTransport = new MockTransport() {


perhaps it's simpler to implement DisruptableMockTransport, see CoordinatorTests

Right much nicer :), done in 1544b8a

ywelsch · 2018-12-28T10:16:22Z

server/src/test/java/org/elasticsearch/snapshots/SnapshotsServiceTests.java

+                // Mock publisher that invokes other cluster change listeners directly
+                // TODO: Run state updates on the individual nodes out of order, this is currently not possible
+                // TODO: because it can lead to running the blocking recovery tasks on the deterministicTaskQueue
+                // TODO: when a DelayRecoveryException is thrown on the transport layer as a result of


as far as I understand, the problem is not the DelayRecoveryException, but the general blocking nature of peer recoveries (e.g. PeerRecoveryTargetService blockingly waits on the recovery to complete).

Perhaps we could only have this while allocating shards, but for the duration of the snapshot, while no shards are being allocated, revert to a more randomized mode. Alternatively, we can test without replica shards for now.

Removed the replicas for now in 237f9e7, that also allows for a simpler mock transport until we have non blocking replication.

ywelsch · 2018-12-28T10:17:02Z

server/src/test/java/org/elasticsearch/snapshots/SnapshotsServiceTests.java

+                });
+            });
+            masterService.setClusterStateSupplier(currentState::get);
+            if (node.isMasterNode()) {


is this if-clause necessary?

Removed in 7259b45

ywelsch · 2018-12-28T10:23:28Z

test/framework/src/main/java/org/elasticsearch/cluster/coordination/DeterministicTaskQueue.java

            public Cancellable scheduleWithFixedDelay(Runnable command, TimeValue interval, String executor) {
-                throw new UnsupportedOperationException();
+                // TODO: Implement fully like schedule
+                return new Cancellable() {


might as well properly implement this, looks not that difficult (I think it's just to call super. scheduleWithFixedDelay() here) and add a test to DeterministicTaskQueueTests.

…shot-tests

original-brownbear · 2018-12-28T11:52:43Z

@ywelsch all points addressed :)

ywelsch

LGTM

ywelsch · 2018-12-31T10:00:20Z

server/src/test/java/org/elasticsearch/snapshots/SnapshotsServiceTests.java

+        tempDir = createTempDir();
+        deterministicTaskQueue =
+            new DeterministicTaskQueue(Settings.builder().put(NODE_NAME_SETTING.getKey(), "shared").build(), random());
+        // TODO: Random number of master nodes and simulate master failover states


yes, I found the comment just confusing here, given that we have no master failovers yet.

original-brownbear · 2018-12-31T10:16:58Z

@ywelsch thanks!

SNAPSHOT: Deterministic ClusterState Tests

32d39da

* Use `DeterministicTaskQueue` infrastructure to reproduce elastic#32265

original-brownbear added >test Issues or PRs that are addressing/adding tests :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v7.0.0 labels Dec 14, 2018

fix timeout

b315d0b

original-brownbear added the WIP label Dec 14, 2018

original-brownbear added 9 commits December 15, 2018 08:00

bck

b6278f7

Merge remote-tracking branch 'elastic/master' into deterministic-snap…

80f56f2

…shot-tests

more steps

6c6265f

Merge remote-tracking branch 'elastic/master' into deterministic-snap…

85babd9

…shot-tests

Merge remote-tracking branch 'elastic/master' into deterministic-snap…

a75e4f3

…shot-tests

cleaner

ae57266

cleaner

b732d9c

Merge remote-tracking branch 'elastic/master' into deterministic-snap…

3a9293d

…shot-tests

small comment

cc7327f

original-brownbear removed the WIP label Dec 17, 2018

original-brownbear requested a review from ywelsch December 17, 2018 10:50

Merge remote-tracking branch 'elastic/master' into deterministic-snap…

fe3df68

…shot-tests

original-brownbear added the WIP label Dec 17, 2018

original-brownbear added 10 commits December 17, 2018 14:19

randomness

bd2c3b8

Merge remote-tracking branch 'elastic/master' into deterministic-snap…

66f4518

…shot-tests

randomly pass or fail

132b425

bck

80f0c1d

shorter

1935340

Merge remote-tracking branch 'elastic/master' into deterministic-snap…

e847496

…shot-tests

cleanup

95cff32

bck

6216bc5

Merge remote-tracking branch 'elastic/master' into deterministic-snap…

c10811d

…shot-tests

work

f635fa4

original-brownbear added 6 commits December 23, 2018 10:10

almost there

ed70b7c

works

c47f4e0

cleanup

1c12dad

Merge remote-tracking branch 'elastic/master' into deterministic-snap…

7a3ff70

…shot-tests

almost there

1ed0515

fix accidental formatting change

bb7c2dd

original-brownbear commented Dec 24, 2018

View reviewed changes

original-brownbear added 2 commits December 26, 2018 15:44

Merge remote-tracking branch 'elastic/master' into deterministic-snap…

f8b1a22

…shot-tests

CR: fixes

e719046

ywelsch suggested changes Dec 27, 2018

View reviewed changes

original-brownbear added 2 commits December 27, 2018 16:10

Merge remote-tracking branch 'elastic/master' into deterministic-snap…

ac8548b

…shot-tests

CR comments

b9d8548

original-brownbear requested a review from ywelsch December 27, 2018 15:40

ywelsch reviewed Dec 28, 2018

View reviewed changes

original-brownbear added 5 commits December 28, 2018 11:37

Merge remote-tracking branch 'elastic/master' into deterministic-snap…

c84691b

…shot-tests

CR: Remove needless condition

7259b45

CR: use DisruptableMockTransport

1544b8a

CR: no replicas for now

237f9e7

CR: implement + test schedule with fixed delay

2fa91b3

original-brownbear requested a review from ywelsch December 28, 2018 11:52

ywelsch approved these changes Dec 31, 2018

View reviewed changes

original-brownbear merged commit 85be9d6 into elastic:master Dec 31, 2018

original-brownbear deleted the deterministic-snapshot-tests branch December 31, 2018 10:17

colings86 added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019


		private final ClusterService clusterService;

		private final RepositoriesService repositoriesService = mock(RepositoriesService.class);

SNAPSHOT: Deterministic ClusterState Tests #36644

SNAPSHOT: Deterministic ClusterState Tests #36644

Uh oh!

Conversation

original-brownbear commented Dec 14, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deterministic Cluster State Tests for Snapshots

Uh oh!

elasticmachine commented Dec 14, 2018

Uh oh!

original-brownbear left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

original-brownbear commented Dec 26, 2018

Uh oh!

original-brownbear commented Dec 26, 2018

Uh oh!

ywelsch left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

original-brownbear commented Dec 27, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

original-brownbear commented Dec 28, 2018

Uh oh!

ywelsch left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

original-brownbear commented Dec 31, 2018

Uh oh!

Reviewers

Assignees

Labels

original-brownbear commented Dec 14, 2018 •

edited

Loading