-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Initial data stream lifecycle support for downsampling #98609
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial data stream lifecycle support for downsampling #98609
Conversation
|
@elasticmachine update branch |
|
@elasticmachine update branch |
|
@elasticmachine update branch |
|
@elasticmachine update branch |
|
@elasticmachine update branch |
dakrone
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This generally looks good, I left a few more comments. I have a question about testing — is there a way for us to test the "ongoing" downsampling stuff in an integration test, or perhaps force an error and/or a very long downsampling run so that we can ensure that the check-status-for-an-existing-downsampling is working correctly?
| affectedDataStreams++; | ||
| } | ||
| logger.trace( | ||
| "Data stream lifecycle service performed operations on [{}] indices, part of [{}] data streams", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking more about this. We unconditionally add the write index to the list, regardless of whether it was rolled over or not, which means that we'll always have at least 1 operation "performed" even if nothing happens. Should we add a way to decrease this if nothing happened to the write index, or keep it (potentially off-by-one), or remove the logging?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which means that we'll always have at least 1 operation "performed" even if nothing happens
I believe this is correct though? We issued a rollover request so Data Stream Lifecycle has performed something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, though the rollover could have not actually succeeded if no conditions were met. It sounds like it's fine to keep this as-is then.
...treams/src/main/java/org/elasticsearch/datastreams/lifecycle/DataStreamLifecycleService.java
Show resolved
Hide resolved
| // no maintenance needed for previously started downsampling actions and we are on the last matching round so it's time | ||
| // to kick off downsampling | ||
| affectedIndices.add(index); | ||
| DownsampleAction.Request request = new DownsampleAction.Request(indexName, downsampleIndexName, null, round.config()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, is 1 day (the default wait timeout) enough for this? What would happen if the downsampling always took longer than a day? Should we include a way to increase or set this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah this was a good point that triggered to me discussing it with @martijnvg and opening #98875
Hopefully the comment in the code explains why.
We retry (through the deduplicator) if we timed-out and if we have a master failover (in which case DSL starts from scratch)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...treams/src/main/java/org/elasticsearch/datastreams/lifecycle/DataStreamLifecycleService.java
Show resolved
Hide resolved
...treams/src/main/java/org/elasticsearch/datastreams/lifecycle/DataStreamLifecycleService.java
Show resolved
Hide resolved
...sticsearch/datastreams/lifecycle/downsampling/ReplaceBackingWithDownsampleIndexExecutor.java
Show resolved
Hide resolved
| if (sourceIndexMeta != null) { | ||
| // both indices exist, let's copy the origination date from the source index to the downsample index | ||
| Metadata.Builder newMetaData = Metadata.builder(state.getMetadata()); | ||
| IndexMetadata updatedDownsampleMetadata = copyDataStreamLifecycleState( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Random thought, should we capture the info we need to copy to the new downsampled index when the task is created (inside the task itself) so that if the source index is removed we can still copy the correct lifecycle information into the new downsampled index? (It might be too complicated or introduce a race condition on getting the latest info)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mmmmaybe, do you mind if this is a follow-up ? (we'd have to provide the lifecycle custom and generation time at task creation time and it's quite a large change I believe)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, totally okay as a follow-up
...g/elasticsearch/datastreams/lifecycle/downsampling/ReplaceSourceWithDownsampleIndexTask.java
Show resolved
Hide resolved
|
|
@elasticmachine update branch |
dakrone
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for iterating on this Andrei, it looks good to me. I did run into some problems with the tests, for example, this fails occasionally:
./gradlew :x-pack:plugin:downsample:check
Because it fails with
./gradlew ':x-pack:plugin:downsample:internalClusterTest' --tests "org.elasticsearch.xpack.downsample.DataStreamLifecycleDownsampleDisruptionIT.testDataStreamLifecycleDownsampleRollingRestart {seed=[15142A2ED643D7C1:C28DBD9E804E328F]}" -Dtests.seed=15142A2ED643D7C1 -Dtests.locale=zh -Dtests.timezone=America/Antigua -Druntime.java=17
2> java.lang.AssertionError:
Expected: is <success>
but: was <started>
at __randomizedtesting.SeedInfo.seed([15142A2ED643D7C1:C28DBD9E804E328F]:0)
at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
at org.junit.Assert.assertThat(Assert.java:956)
at org.junit.Assert.assertThat(Assert.java:923)
at org.elasticsearch.xpack.downsample.DataStreamLifecycleDownsampleDisruptionIT.lambda$testDataStreamLifecycleDownsampleRollingRestart$2(DataStreamLifecycleDownsampleDisruptionIT.java:130)
The tests are pretty long so I wanted to make sure we don't introduce instability. It fails pretty consistently for me with
./gradlew ':x-pack:plugin:downsample:internalClusterTest' --tests "org.elasticsearch.xpack.downsample.DataStreamLifecycleDownsampleDisruptionIT.testDataStreamLifecycleDownsampleRollingRestart" -Dtests.seed=BF66A5886C292F85 -Dtests.locale=id-ID -Dtests.timezone=Etc/GMT-4 -Druntime.java=17
| // this request here might seem weird, but hear me out: | ||
| // if we triggered a downsample operation, and then had a master failover (so DSL starts from scratch) | ||
| // we can't really find out if the downsampling persistent task failed (if it was successful, no worries, the next case | ||
| // SUCCESS branch will catch it and we will cruise forward) | ||
| // if the downsampling persistent task failed, we will find out only via re-issuing the downsample request (and we will | ||
| // continue to re-issue the request until we get SUCCESS) | ||
| downsampleIndexOnce(currentRound, indexName, downsampleIndexName); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a useful comment, but perhaps it should also mention that if the master has not failed over, then this will be a no-op due to the deduplication of the request?
...sticsearch/datastreams/lifecycle/downsampling/ReplaceBackingWithDownsampleIndexExecutor.java
Outdated
Show resolved
Hide resolved
|
@elasticmachine update branch |
…ed around midnight
Co-authored-by: Lee Hinman <[email protected]>
This adds data stream lifecycle service implementation support
for downsampling.
Time series backing indices for a data stream with a lifecycle
that configures downsampling will be marked as read-only,
downsampled, removed from the data stream, replaced with the
corresponding downsample index, and deleted.
Multiple rounds can be configured for a data stream, and the
latest matching round will be the first one to be executed.
If one downsampling operation is in progress, we wait until it's
finished and then we start the next downsampling operation.
Note that in this scenario a data stream could have the following
backing indices:
If this data stream has multiple rounds of downsampling configured,
the first generation index will subsequently be downsampled again
(and again).
Marking this as non-issue as we don't yet have REST support to
parse the downsampling configuration. In an attempt to keep this already
large PR manageable, the REST support for parsing the downsample
lifecycle will be added in a subsequent PR with documentation.