Removing phases from an ILM policy while some indices are mid-policy can cause log spam

**Elasticsearch version** (`bin/elasticsearch --version`): observed in logs from clusters on versions ranging from 6.8.5 to 7.7.0, reproducible on 7.6.2, there is a different but similar bug ([see below](https://github.com/elastic/elasticsearch/issues/55175#issuecomment-613682808)) in prerelease versions of 7.7.0.

If a policy is changed to remove some phases while it is in use by an index that is partway through the policy, ILM can get "stuck" in a state for that index where it emits a log every poll interval for each index that's "stuck". There's no functionality errors, as this (as far as I'm aware) can only occur when at the end of a policy anyway, just log spam.

The log entries in question look like this:
```
2020-04-14T12:02:50,691][ERROR][o.e.x.i.IndexLifecycleRunner] [Elastic-MBP-4.local] current step [{"phase":"warm","action":"complete","name":"complete"}] for index [test-000001] with policy [my_policy] is not recognized
```

It appears that this occurs when an index is at the end of a phase (e.g. in step `warm/completed/completed`) when the policy is updated to remove phases following where the index is currently.  In the example of an index in the `warm` phase, the JSON definition for the phase is cached, but when there are no subsequent phases, the step will be generated from the cached policy JSON as `completed/completed/completed`, rather than `warm/completed/completed`, which leads to this error.

This bug appears to have been fixed as part of #51631 in 7.7.0, but another bug was introduced ([see below](https://github.com/elastic/elasticsearch/issues/55175#issuecomment-613682808)) which leads to very similar log spam. (Update: Bug only present in prerelease versions of 7.7.0)

Reproduction script (it is recommended to set `indices.lifecycle.history_index_enabled` to `false` to make the ILM logs easier to read):
```
// Set ILM poll interval to 1s to speed things up
PUT _cluster/settings
{
    "transient": {
        "indices.lifecycle.poll_interval": "1s",
        "logger.org.elasticsearch.xpack.ilm": "trace",
        "logger.org.elasticsearch.xpack.core.ilm": "trace"
    }
}

// Create a policy with multiple phases + an index using it
PUT _ilm/policy/my_policy
{
    "policy": {
        "phases": {
            "hot": {
                "actions": {
                    "rollover": {
                        "max_docs": 1
                    }
                }
            },
            "warm": {
                "actions": {
                    "set_priority": {
                        "priority": 10
                    }
                }
            },
            "delete": {
                "min_age": "1d",
                "actions": {
                    "delete": {}
                }
            }
        }
    }
}

PUT test-000001
{
    "settings": {
        "index.number_of_shards": 3,
        "index.lifecycle.name": "my_policy",
        "index.lifecycle.rollover_alias": "test-alias"
    },
    "aliases": {
        "test-alias": {
            "is_write_index": true
        }
    }
}

// Once the index is checking rollover conditions, index a doc
POST test-alias/_doc
{
  "foo": "bar"
}

// Wait for the index to finish the warm phase, then update the policy to remove the warm & delete phases
PUT _ilm/policy/my_policy
{
    "policy": {
        "phases": {
            "hot": {
                "actions": {
                    "rollover": {
                        "max_docs": 2
                    }
                }
            }
        }
    }
}
```

Following all that, this error should now be emitted every poll interval (`1s` per script above):
```
2020-04-14T12:02:50,691][ERROR][o.e.x.i.IndexLifecycleRunner] [Elastic-MBP-4.local] current step [{"phase":"warm","action":"complete","name":"complete"}] for index [test-000001] with policy [my_policy] is not recognized
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Removing phases from an ILM policy while some indices are mid-policy can cause log spam #55175

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Removing phases from an ILM policy while some indices are mid-policy can cause log spam #55175

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions