-
Notifications
You must be signed in to change notification settings - Fork 25.6k
SNAPSHOT: Deterministic ClusterState Tests #36644
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SNAPSHOT: Deterministic ClusterState Tests #36644
Conversation
* Use `DeterministicTaskQueue` infrastructure to reproduce elastic#32265
|
Pinging @elastic/es-distributed |
original-brownbear
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ywelsch did my best to clean this up and make it readable now :) Take a look when you have some time, I think this should be much closer now to what you were envisioning :)
| @Override | ||
| public ScheduledExecutorService scheduler() { | ||
| throw new UnsupportedOperationException(); | ||
| return new ScheduledExecutorService() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needed to add a dummy return here since this was used by org.elasticsearch.index.shard.IndexShard#IndexShard (only used in the constructor though for the current test, so no actual implementation necessary otherwise).
| public Cancellable scheduleWithFixedDelay(Runnable command, TimeValue interval, String executor) { | ||
| throw new UnsupportedOperationException(); | ||
| // TODO: Implement fully like schedule | ||
| return new Cancellable() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added dummy return only here for now. This is used to schedule a task org.elasticsearch.indices.IndexingMemoryController.ShardsIndicesStatusChecker in org.elasticsearch.indices.IndexingMemoryController. Just a dummy for now sine that task doesn't seem relevant for this test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might as well properly implement this, looks not that difficult (I think it's just to call super. scheduleWithFixedDelay() here) and add a test to DeterministicTaskQueueTests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 2fa91b3.
| } | ||
| }; | ||
|
|
||
| TestClusterNode(DiscoveryNode node, DeterministicTaskQueue deterministicTaskQueue) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Setting up all the real services used for snapshotting with here with the exceptions:
- MockTransportService that just short-circuits the network.
- Mock ClusterStatePublisher that just short-circuits the network (I added a todo here, because I wasn't sure if we could maybe use the real thing here. It seemed very tricky to do so, but maybe it's not or worth the effort?)
| new IndexScopedSettings(settings, IndexScopedSettings.BUILT_IN_INDEX_SETTINGS); | ||
| indicesService = new IndicesService( | ||
| settings, | ||
| mock(PluginsService.class), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just mocked this one out since it's not relevant for the test and just a bunch of code to get up and running.
| shardStateAction, | ||
| new NodeMappingRefreshAction(transportService, new MetaDataMappingService(clusterService, indicesService)), | ||
| repositoriesService, | ||
| mock(SearchService.class), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just mocked this one out since it's not relevant for the test and just a bunch of code to get up and running.
|
|
||
| private final ClusterService clusterService; | ||
|
|
||
| private final RepositoriesService repositoriesService = mock(RepositoriesService.class); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just mocked this one out since it's not that relevant for the test (we really only need it to return the repository, that's the only call we make to it) and just a bunch of code to get up and running.
| ) | ||
| ); | ||
|
|
||
| runOutstandingTasks(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Running all tasks so that the index is fully set up when we create the snapshot.
|
@ywelsch fixed all issues we talk about today in e719046:
|
|
Jenkins run gradle build tests 2 |
ywelsch
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks very good already
server/src/test/java/org/elasticsearch/snapshots/SnapshotsServiceTests.java
Outdated
Show resolved
Hide resolved
| repositoriesService = new RepositoriesService( | ||
| settings, clusterService, transportService, | ||
| Collections.singletonMap(FsRepository.TYPE, metaData -> { | ||
| final Repository repository = new FsRepository(metaData, createEnvironment(), xContentRegistry()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused. With each node having their own environment, how do the nodes access a shared FS location for writing the snapshot?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the reason this did not fail yet is that we aren't writing any data yet because all the shards are empty?
Regardless, I cleaned this up and made sure all nodes have the same repository path in their settings now :)
server/src/test/java/org/elasticsearch/snapshots/SnapshotsServiceTests.java
Outdated
Show resolved
Hide resolved
server/src/test/java/org/elasticsearch/snapshots/SnapshotsServiceTests.java
Outdated
Show resolved
Hide resolved
server/src/test/java/org/elasticsearch/snapshots/SnapshotsServiceTests.java
Outdated
Show resolved
Hide resolved
|
@ywelsch thanks for taking a look! All points addressed I think -> should be good for another review. |
| tempDir = createTempDir(); | ||
| deterministicTaskQueue = | ||
| new DeterministicTaskQueue(Settings.builder().put(NODE_NAME_SETTING.getKey(), "shared").build(), random()); | ||
| // TODO: Random number of master nodes and simulate master failover states |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we're not simulating failovers yet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, not yet. I was under the impression that we wanted to get the simple successful test case in first and then add those things when we last spoke about the steps here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I found the comment just confusing here, given that we have no master failovers yet.
| final ClusterSettings clusterSettings = new ClusterSettings(settings, ClusterSettings.BUILT_IN_CLUSTER_SETTINGS); | ||
| final ThreadPool threadPool = deterministicTaskQueue.getThreadPool(); | ||
| clusterService = new ClusterService(settings, clusterSettings, threadPool, masterService); | ||
| mockTransport = new MockTransport() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps it's simpler to implement DisruptableMockTransport, see CoordinatorTests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right much nicer :), done in 1544b8a
| // Mock publisher that invokes other cluster change listeners directly | ||
| // TODO: Run state updates on the individual nodes out of order, this is currently not possible | ||
| // TODO: because it can lead to running the blocking recovery tasks on the deterministicTaskQueue | ||
| // TODO: when a DelayRecoveryException is thrown on the transport layer as a result of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as far as I understand, the problem is not the DelayRecoveryException, but the general blocking nature of peer recoveries (e.g. PeerRecoveryTargetService blockingly waits on the recovery to complete).
Perhaps we could only have this while allocating shards, but for the duration of the snapshot, while no shards are being allocated, revert to a more randomized mode. Alternatively, we can test without replica shards for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the replicas for now in 237f9e7, that also allows for a simpler mock transport until we have non blocking replication.
| }); | ||
| }); | ||
| masterService.setClusterStateSupplier(currentState::get); | ||
| if (node.isMasterNode()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this if-clause necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed in 7259b45
| public Cancellable scheduleWithFixedDelay(Runnable command, TimeValue interval, String executor) { | ||
| throw new UnsupportedOperationException(); | ||
| // TODO: Implement fully like schedule | ||
| return new Cancellable() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might as well properly implement this, looks not that difficult (I think it's just to call super. scheduleWithFixedDelay() here) and add a test to DeterministicTaskQueueTests.
|
@ywelsch all points addressed :) |
ywelsch
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
| tempDir = createTempDir(); | ||
| deterministicTaskQueue = | ||
| new DeterministicTaskQueue(Settings.builder().put(NODE_NAME_SETTING.getKey(), "shared").build(), random()); | ||
| // TODO: Random number of master nodes and simulate master failover states |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I found the comment just confusing here, given that we have no master failovers yet.
|
@ywelsch thanks! |
Deterministic Cluster State Tests for Snapshots
TestClusterState) andDeterministicTaskQueueinfrastructure to as well as no networking to be able to iterate through every step of state updates and snapshot task execution in a reproducible manner