Skip to content

Conversation

@jbaiera
Copy link
Member

@jbaiera jbaiera commented Jul 8, 2022

When an automated snapshot fails, the last failure for a policy is captured and stored in the cluster state. Similarly, we store the last successful snapshot invocation as well. We do not track how many invocations have passed between a successful snapshot and the most recent failure. These stats would be helpful for reporting on SLM policy health.

Instead of a fixed delay, snapshot lifecycle policies are scheduled using a cron expression which can produce variable execution times between snapshot attempts. This makes it difficult to select a window of time where continuous snapshot failure becomes indicative of a problem instead of a transient issue. By including the count of failed invocations since last success we can provide health reporting logic that allows for some transient failures while remaining agnostic of variable execution times that cron can produce.

@jbaiera jbaiera added >enhancement :Data Management/ILM+SLM Index and Snapshot lifecycle management v8.4.0 labels Jul 8, 2022
@elasticmachine elasticmachine added the Team:Data Management Meta label for data/management team label Jul 8, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@elasticsearchmachine
Copy link
Collaborator

Hi @jbaiera, I've created a changelog YAML for you.

Copy link
Member

@dakrone dakrone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This generally looks good to me, but I think we should make it non-null and treat the default missing value as 0 invocations, what do you think?

@jbaiera
Copy link
Member Author

jbaiera commented Jul 11, 2022

@elasticmachine run elasticsearch-ci/docs

Copy link
Member

@dakrone dakrone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jbaiera jbaiera merged commit b790256 into elastic:master Jul 12, 2022
@jbaiera jbaiera deleted the slm-add-invocation-counts branch July 12, 2022 15:26
weizijun added a commit to weizijun/elasticsearch that referenced this pull request Jul 13, 2022
* upstream/master: (38 commits)
  Simplify map copying (elastic#88432)
  Make DiffableUtils.diff implementation agnostic (elastic#88403)
  Ingest: Start separating Metadata from IngestSourceAndMetadata (elastic#88401)
  Move runtime fields base scripts out of scripting fields api package. (elastic#88488)
  Enable TRACE Logging for test and increase timeout (elastic#88477)
  Mute ReactiveStorageIT#testScaleDuringSplitOrClone (elastic#88480)
  Track the count of failed invocations since last successful policy snapshot (elastic#88398)
  Avoid noisy exceptions on data nodes when aborting snapshots (elastic#88476)
  Fix ReactiveStorageDeciderServiceTests testNodeSizeForDataBelowLowWatermark (elastic#88452)
  INFO logging of snapshot restore and completion (elastic#88257)
  unmute test (elastic#88454)
  Updatable API keys - noop check (elastic#88346)
  Corrected an incomplete sentence. (elastic#86542)
  Use consistent shard map type in IndexService (elastic#88465)
  Stop registering TestGeoShapeFieldMapperPlugin in ESIntegTestCase (elastic#88460)
  TSDB: RollupShardIndexer logging improvements (elastic#88416)
  Audit API key ID when create or grant API keys (elastic#88456)
  Bound random negative size test in SearchSourceBuilderTests#testNegativeSizeErrors (elastic#88457)
  Updatable API keys - logging audit trail event (elastic#88276)
  Polish reworked LoggedExec task (elastic#88424)
  ...

# Conflicts:
#	x-pack/plugin/rollup/src/main/java/org/elasticsearch/xpack/rollup/v2/RollupShardIndexer.java
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Data Management/ILM+SLM Index and Snapshot lifecycle management >enhancement Team:Data Management Meta label for data/management team v8.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants