Initial data stream lifecycle support for downsampling #98609

andreidan · 2023-08-17T15:53:22Z

This adds data stream lifecycle service implementation support
for downsampling.
Time series backing indices for a data stream with a lifecycle
that configures downsampling will be marked as read-only,
downsampled, removed from the data stream, replaced with the
corresponding downsample index, and deleted.

Multiple rounds can be configured for a data stream, and the
latest matching round will be the first one to be executed.
If one downsampling operation is in progress, we wait until it's
finished and then we start the next downsampling operation.
Note that in this scenario a data stream could have the following
backing indices:

[.ds-metrics-2023.08.22-000002, downsample-10s-.ds-metrics-2023.08.22-000001]

If this data stream has multiple rounds of downsampling configured,
the first generation index will subsequently be downsampled again
(and again).

Marking this as non-issue as we don't yet have REST support to
parse the downsampling configuration. In an attempt to keep this already
large PR manageable, the REST support for parsing the downsample
lifecycle will be added in a subsequent PR with documentation.

andreidan · 2023-08-18T10:45:19Z

@elasticmachine update branch

andreidan · 2023-08-20T13:12:38Z

@elasticmachine update branch

andreidan · 2023-08-22T10:29:04Z

@elasticmachine update branch

andreidan · 2023-08-22T19:01:06Z

@elasticmachine update branch

andreidan · 2023-08-24T10:24:37Z

@elasticmachine update branch

dakrone

This generally looks good, I left a few more comments. I have a question about testing — is there a way for us to test the "ongoing" downsampling stuff in an integration test, or perhaps force an error and/or a very long downsampling run so that we can ensure that the check-status-for-an-existing-downsampling is working correctly?

dakrone · 2023-08-24T17:19:00Z

...treams/src/main/java/org/elasticsearch/datastreams/lifecycle/DataStreamLifecycleService.java

+            affectedDataStreams++;
+        }
+        logger.trace(
+            "Data stream lifecycle service performed operations on [{}] indices, part of [{}] data streams",


I'm thinking more about this. We unconditionally add the write index to the list, regardless of whether it was rolled over or not, which means that we'll always have at least 1 operation "performed" even if nothing happens. Should we add a way to decrease this if nothing happened to the write index, or keep it (potentially off-by-one), or remove the logging?

which means that we'll always have at least 1 operation "performed" even if nothing happens

I believe this is correct though? We issued a rollover request so Data Stream Lifecycle has performed something.

True, though the rollover could have not actually succeeded if no conditions were met. It sounds like it's fine to keep this as-is then.

...treams/src/main/java/org/elasticsearch/datastreams/lifecycle/DataStreamLifecycleService.java

dakrone · 2023-08-24T17:26:18Z

...treams/src/main/java/org/elasticsearch/datastreams/lifecycle/DataStreamLifecycleService.java

+                    // no maintenance needed for previously started downsampling actions and we are on the last matching round so it's time
+                    // to kick off downsampling
+                    affectedIndices.add(index);
+                    DownsampleAction.Request request = new DownsampleAction.Request(indexName, downsampleIndexName, null, round.config());


Out of curiosity, is 1 day (the default wait timeout) enough for this? What would happen if the downsampling always took longer than a day? Should we include a way to increase or set this?

Ah this was a good point that triggered to me discussing it with @martijnvg and opening #98875
Hopefully the comment in the code explains why.

We retry (through the deduplicator) if we timed-out and if we have a master failover (in which case DSL starts from scratch)

Added a disruption IT https://github.com/elastic/elasticsearch/pull/98609/files#diff-f7df158addbd7b19e481e80affffaad32f58cf4d5f2a1b400e9d7b584bf7a6daR40

...treams/src/main/java/org/elasticsearch/datastreams/lifecycle/DataStreamLifecycleService.java

...sticsearch/datastreams/lifecycle/downsampling/ReplaceBackingWithDownsampleIndexExecutor.java

dakrone · 2023-08-24T17:32:58Z

...g/elasticsearch/datastreams/lifecycle/downsampling/ReplaceSourceWithDownsampleIndexTask.java

+                if (sourceIndexMeta != null) {
+                    // both indices exist, let's copy the origination date from the source index to the downsample index
+                    Metadata.Builder newMetaData = Metadata.builder(state.getMetadata());
+                    IndexMetadata updatedDownsampleMetadata = copyDataStreamLifecycleState(


Random thought, should we capture the info we need to copy to the new downsampled index when the task is created (inside the task itself) so that if the source index is removed we can still copy the correct lifecycle information into the new downsampled index? (It might be too complicated or introduce a race condition on getting the latest info)

Mmmmaybe, do you mind if this is a follow-up ? (we'd have to provide the lifecycle custom and generation time at task creation time and it's quite a large change I believe)

Yep, totally okay as a follow-up

...g/elasticsearch/datastreams/lifecycle/downsampling/ReplaceSourceWithDownsampleIndexTask.java

andreidan · 2023-08-25T09:24:56Z

This generally looks good, I left a few more comments. I have a question about testing — is there a way for us to test the "ongoing" downsampling stuff in an integration test, or perhaps force an error and/or a very long downsampling run so that we can ensure that the check-status-for-an-existing-downsampling is working correctly?

DataStreamLifecycleDownsampleIT covers this. I bumped the number of documents in the source to 50_000 and I'm seeing consistent waiting for the ongoing downsampling operation.

andreidan · 2023-08-25T10:55:49Z

@elasticmachine update branch

…art of DS

dakrone

Thanks for iterating on this Andrei, it looks good to me. I did run into some problems with the tests, for example, this fails occasionally:

./gradlew :x-pack:plugin:downsample:check

Because it fails with

./gradlew ':x-pack:plugin:downsample:internalClusterTest' --tests "org.elasticsearch.xpack.downsample.DataStreamLifecycleDownsampleDisruptionIT.testDataStreamLifecycleDownsampleRollingRestart {seed=[15142A2ED643D7C1:C28DBD9E804E328F]}" -Dtests.seed=15142A2ED643D7C1 -Dtests.locale=zh -Dtests.timezone=America/Antigua -Druntime.java=17
  2> java.lang.AssertionError: 
    Expected: is <success>
         but: was <started>
        at __randomizedtesting.SeedInfo.seed([15142A2ED643D7C1:C28DBD9E804E328F]:0)
        at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
        at org.junit.Assert.assertThat(Assert.java:956)
        at org.junit.Assert.assertThat(Assert.java:923)
        at org.elasticsearch.xpack.downsample.DataStreamLifecycleDownsampleDisruptionIT.lambda$testDataStreamLifecycleDownsampleRollingRestart$2(DataStreamLifecycleDownsampleDisruptionIT.java:130)

The tests are pretty long so I wanted to make sure we don't introduce instability. It fails pretty consistently for me with

./gradlew ':x-pack:plugin:downsample:internalClusterTest' --tests "org.elasticsearch.xpack.downsample.DataStreamLifecycleDownsampleDisruptionIT.testDataStreamLifecycleDownsampleRollingRestart" -Dtests.seed=BF66A5886C292F85 -Dtests.locale=id-ID -Dtests.timezone=Etc/GMT-4 -Druntime.java=17

dakrone · 2023-08-25T17:34:54Z

...treams/src/main/java/org/elasticsearch/datastreams/lifecycle/DataStreamLifecycleService.java

+                // this request here might seem weird, but hear me out:
+                // if we triggered a downsample operation, and then had a master failover (so DSL starts from scratch)
+                // we can't really find out if the downsampling persistent task failed (if it was successful, no worries, the next case
+                // SUCCESS branch will catch it and we will cruise forward)
+                // if the downsampling persistent task failed, we will find out only via re-issuing the downsample request (and we will
+                // continue to re-issue the request until we get SUCCESS)
+                downsampleIndexOnce(currentRound, indexName, downsampleIndexName);


This is a useful comment, but perhaps it should also mention that if the master has not failed over, then this will be a no-op due to the deduplication of the request?

...sticsearch/datastreams/lifecycle/downsampling/ReplaceBackingWithDownsampleIndexExecutor.java

andreidan · 2023-08-26T06:02:39Z

@elasticmachine update branch

andreidan · 2023-08-26T09:10:22Z

Thanks for the review Lee ❤️

Regarding the disruption test - the underlying condition that sometimes caused the failure has been fixed with
3ac174f and #98769

…ed around midnight

Co-authored-by: Lee Hinman <[email protected]>

Initial data stream lifecycle support for downsampling

3558d5a

andreidan added WIP :Data Management/Data streams Data streams and their lifecycles v8.11.0 labels Aug 17, 2023

License header

c9eacfa

Merge branch 'main' into streams-lifecycle-downsampling

26270e2

elasticmachine and others added 6 commits August 20, 2023 22:42

Merge branch 'main' into streams-lifecycle-downsampling

4672f4e

Reuse DownsamplingConfig#generateDownsampleIndexName and add another IT

ffe3c3c

Create task with the dedup listener

1a60a21

Adjust javadoc a bit

fd4282e

Add update lifecycle downsampling test

03cbcea

Unit test ReplaceSourceWithDownsampleIndexTask

5717a2f

elasticmachine and others added 8 commits August 22, 2023 19:59

Merge branch 'main' into streams-lifecycle-downsampling

beaa7d4

Unit tests

5b0b2b4

Drop irrelevant settings

4afc192

Test we save the downsample source index name

3cade06

More unit tests and bug fixes

6a010fb

Spotless

f65cb35

More tests

a75e07e

Unit tests

ba8485d

Merge branch 'main' into streams-lifecycle-downsampling

464d06f

andreidan requested a review from dakrone August 22, 2023 19:01

andreidan added >non-issue and removed WIP labels Aug 22, 2023

andreidan marked this pull request as ready for review August 22, 2023 19:10

elasticsearchmachine added the Team:Data Management Meta label for data/management team label Aug 22, 2023

andreidan added 2 commits August 24, 2023 11:00

Invert lifecycle metadata check

9f84bd5

Drop continue

d8f5247

Merge branch 'main' into streams-lifecycle-downsampling

b5f0d26

andreidan requested a review from dakrone August 24, 2023 10:29

andreidan added 2 commits August 24, 2023 17:42

Logging, renaming of method, drop a method

b94e56a

Evaluate the downsample status in an (exhaustive) switch expression

f714a0a

dakrone requested changes Aug 24, 2023

View reviewed changes

Use 50_000 docs to catch DSL in "waiting for downsample to complete"

ce02e7a

Check acknowledged responses

9938033

elasticmachine and others added 7 commits August 25, 2023 20:25

Merge branch 'main' into streams-lifecycle-downsampling

f959b42

Spotless

33cd5ae

Downsample (through the dedup) even if that task is started

8e1bc11

Replace from source

5f09289

Set origination date and lifecycle custom even if the source is not p…

5327de8

…art of DS

Add data stream lifecycle downsmapling disruption IT

0f8df48

Reuse DataStreamLifecycleDriver

2c44390

andreidan requested a review from dakrone August 25, 2023 15:17

Make disruption during downsampling more likely

291c843

dakrone approved these changes Aug 25, 2023

View reviewed changes

elasticmachine and others added 2 commits August 26, 2023 15:32

Merge branch 'main' into streams-lifecycle-downsampling

b566e70

Avoid IndexNotFoundException when assigning the downsample task

3ac174f

andreidan and others added 3 commits August 26, 2023 10:17

Make integration tests resilient to data stream backing indices creat…

f36edae

…ed around midnight

Update comment

c942853

Update log message

53d06fd

Co-authored-by: Lee Hinman <[email protected]>

andreidan merged commit b11d552 into elastic:main Aug 26, 2023

Initial data stream lifecycle support for downsampling #98609

Initial data stream lifecycle support for downsampling #98609

Uh oh!

Conversation

andreidan commented Aug 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andreidan commented Aug 18, 2023

Uh oh!

andreidan commented Aug 20, 2023

Uh oh!

andreidan commented Aug 22, 2023

Uh oh!

andreidan commented Aug 22, 2023

Uh oh!

andreidan commented Aug 24, 2023

Uh oh!

dakrone left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

andreidan commented Aug 25, 2023

Uh oh!

andreidan commented Aug 25, 2023

Uh oh!

dakrone left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

andreidan commented Aug 26, 2023

Uh oh!

andreidan commented Aug 26, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

andreidan commented Aug 17, 2023 •

edited

Loading