Skip to content

Conversation

@andreidan
Copy link
Contributor

@andreidan andreidan commented Aug 17, 2023

This adds data stream lifecycle service implementation support
for downsampling.
Time series backing indices for a data stream with a lifecycle
that configures downsampling will be marked as read-only,
downsampled, removed from the data stream, replaced with the
corresponding downsample index, and deleted.

Multiple rounds can be configured for a data stream, and the
latest matching round will be the first one to be executed.
If one downsampling operation is in progress, we wait until it's
finished and then we start the next downsampling operation.
Note that in this scenario a data stream could have the following
backing indices:

[.ds-metrics-2023.08.22-000002, downsample-10s-.ds-metrics-2023.08.22-000001]

If this data stream has multiple rounds of downsampling configured,
the first generation index will subsequently be downsampled again
(and again).

Marking this as non-issue as we don't yet have REST support to
parse the downsampling configuration. In an attempt to keep this already
large PR manageable, the REST support for parsing the downsample
lifecycle will be added in a subsequent PR with documentation.

@andreidan andreidan added WIP :Data Management/Data streams Data streams and their lifecycles v8.11.0 labels Aug 17, 2023
@andreidan
Copy link
Contributor Author

@elasticmachine update branch

@andreidan
Copy link
Contributor Author

@elasticmachine update branch

@andreidan
Copy link
Contributor Author

@elasticmachine update branch

@andreidan
Copy link
Contributor Author

@elasticmachine update branch

@andreidan andreidan requested a review from dakrone August 22, 2023 19:01
@andreidan andreidan added >non-issue and removed WIP labels Aug 22, 2023
@andreidan andreidan marked this pull request as ready for review August 22, 2023 19:10
@elasticsearchmachine elasticsearchmachine added the Team:Data Management Meta label for data/management team label Aug 22, 2023
@andreidan
Copy link
Contributor Author

@elasticmachine update branch

@andreidan andreidan requested a review from dakrone August 24, 2023 10:29
Copy link
Member

@dakrone dakrone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This generally looks good, I left a few more comments. I have a question about testing — is there a way for us to test the "ongoing" downsampling stuff in an integration test, or perhaps force an error and/or a very long downsampling run so that we can ensure that the check-status-for-an-existing-downsampling is working correctly?

affectedDataStreams++;
}
logger.trace(
"Data stream lifecycle service performed operations on [{}] indices, part of [{}] data streams",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking more about this. We unconditionally add the write index to the list, regardless of whether it was rolled over or not, which means that we'll always have at least 1 operation "performed" even if nothing happens. Should we add a way to decrease this if nothing happened to the write index, or keep it (potentially off-by-one), or remove the logging?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which means that we'll always have at least 1 operation "performed" even if nothing happens

I believe this is correct though? We issued a rollover request so Data Stream Lifecycle has performed something.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, though the rollover could have not actually succeeded if no conditions were met. It sounds like it's fine to keep this as-is then.

// no maintenance needed for previously started downsampling actions and we are on the last matching round so it's time
// to kick off downsampling
affectedIndices.add(index);
DownsampleAction.Request request = new DownsampleAction.Request(indexName, downsampleIndexName, null, round.config());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, is 1 day (the default wait timeout) enough for this? What would happen if the downsampling always took longer than a day? Should we include a way to increase or set this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah this was a good point that triggered to me discussing it with @martijnvg and opening #98875
Hopefully the comment in the code explains why.

We retry (through the deduplicator) if we timed-out and if we have a master failover (in which case DSL starts from scratch)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (sourceIndexMeta != null) {
// both indices exist, let's copy the origination date from the source index to the downsample index
Metadata.Builder newMetaData = Metadata.builder(state.getMetadata());
IndexMetadata updatedDownsampleMetadata = copyDataStreamLifecycleState(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Random thought, should we capture the info we need to copy to the new downsampled index when the task is created (inside the task itself) so that if the source index is removed we can still copy the correct lifecycle information into the new downsampled index? (It might be too complicated or introduce a race condition on getting the latest info)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mmmmaybe, do you mind if this is a follow-up ? (we'd have to provide the lifecycle custom and generation time at task creation time and it's quite a large change I believe)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, totally okay as a follow-up

@andreidan
Copy link
Contributor Author

This generally looks good, I left a few more comments. I have a question about testing — is there a way for us to test the "ongoing" downsampling stuff in an integration test, or perhaps force an error and/or a very long downsampling run so that we can ensure that the check-status-for-an-existing-downsampling is working correctly?

DataStreamLifecycleDownsampleIT covers this. I bumped the number of documents in the source to 50_000 and I'm seeing consistent waiting for the ongoing downsampling operation.

@andreidan
Copy link
Contributor Author

@elasticmachine update branch

@andreidan andreidan requested a review from dakrone August 25, 2023 15:17
Copy link
Member

@dakrone dakrone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for iterating on this Andrei, it looks good to me. I did run into some problems with the tests, for example, this fails occasionally:

./gradlew :x-pack:plugin:downsample:check

Because it fails with

./gradlew ':x-pack:plugin:downsample:internalClusterTest' --tests "org.elasticsearch.xpack.downsample.DataStreamLifecycleDownsampleDisruptionIT.testDataStreamLifecycleDownsampleRollingRestart {seed=[15142A2ED643D7C1:C28DBD9E804E328F]}" -Dtests.seed=15142A2ED643D7C1 -Dtests.locale=zh -Dtests.timezone=America/Antigua -Druntime.java=17
  2> java.lang.AssertionError: 
    Expected: is <success>
         but: was <started>
        at __randomizedtesting.SeedInfo.seed([15142A2ED643D7C1:C28DBD9E804E328F]:0)
        at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
        at org.junit.Assert.assertThat(Assert.java:956)
        at org.junit.Assert.assertThat(Assert.java:923)
        at org.elasticsearch.xpack.downsample.DataStreamLifecycleDownsampleDisruptionIT.lambda$testDataStreamLifecycleDownsampleRollingRestart$2(DataStreamLifecycleDownsampleDisruptionIT.java:130)

The tests are pretty long so I wanted to make sure we don't introduce instability. It fails pretty consistently for me with

./gradlew ':x-pack:plugin:downsample:internalClusterTest' --tests "org.elasticsearch.xpack.downsample.DataStreamLifecycleDownsampleDisruptionIT.testDataStreamLifecycleDownsampleRollingRestart" -Dtests.seed=BF66A5886C292F85 -Dtests.locale=id-ID -Dtests.timezone=Etc/GMT-4 -Druntime.java=17

Comment on lines 565 to 571
// this request here might seem weird, but hear me out:
// if we triggered a downsample operation, and then had a master failover (so DSL starts from scratch)
// we can't really find out if the downsampling persistent task failed (if it was successful, no worries, the next case
// SUCCESS branch will catch it and we will cruise forward)
// if the downsampling persistent task failed, we will find out only via re-issuing the downsample request (and we will
// continue to re-issue the request until we get SUCCESS)
downsampleIndexOnce(currentRound, indexName, downsampleIndexName);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a useful comment, but perhaps it should also mention that if the master has not failed over, then this will be a no-op due to the deduplication of the request?

@andreidan
Copy link
Contributor Author

@elasticmachine update branch

@andreidan
Copy link
Contributor Author

Thanks for the review Lee ❤️

Regarding the disruption test - the underlying condition that sometimes caused the failure has been fixed with
3ac174f and #98769

@andreidan andreidan merged commit b11d552 into elastic:main Aug 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Data Management/Data streams Data streams and their lifecycles >non-issue Team:Data Management Meta label for data/management team v8.11.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants