Skip to content

Removing phases from an ILM policy while some indices are mid-policy can cause log spam #55175

@gwbrown

Description

@gwbrown

Elasticsearch version (bin/elasticsearch --version): observed in logs from clusters on versions ranging from 6.8.5 to 7.7.0, reproducible on 7.6.2, there is a different but similar bug (see below) in prerelease versions of 7.7.0.

If a policy is changed to remove some phases while it is in use by an index that is partway through the policy, ILM can get "stuck" in a state for that index where it emits a log every poll interval for each index that's "stuck". There's no functionality errors, as this (as far as I'm aware) can only occur when at the end of a policy anyway, just log spam.

The log entries in question look like this:

2020-04-14T12:02:50,691][ERROR][o.e.x.i.IndexLifecycleRunner] [Elastic-MBP-4.local] current step [{"phase":"warm","action":"complete","name":"complete"}] for index [test-000001] with policy [my_policy] is not recognized

It appears that this occurs when an index is at the end of a phase (e.g. in step warm/completed/completed) when the policy is updated to remove phases following where the index is currently. In the example of an index in the warm phase, the JSON definition for the phase is cached, but when there are no subsequent phases, the step will be generated from the cached policy JSON as completed/completed/completed, rather than warm/completed/completed, which leads to this error.

This bug appears to have been fixed as part of #51631 in 7.7.0, but another bug was introduced (see below) which leads to very similar log spam. (Update: Bug only present in prerelease versions of 7.7.0)

Reproduction script (it is recommended to set indices.lifecycle.history_index_enabled to false to make the ILM logs easier to read):

// Set ILM poll interval to 1s to speed things up
PUT _cluster/settings
{
    "transient": {
        "indices.lifecycle.poll_interval": "1s",
        "logger.org.elasticsearch.xpack.ilm": "trace",
        "logger.org.elasticsearch.xpack.core.ilm": "trace"
    }
}

// Create a policy with multiple phases + an index using it
PUT _ilm/policy/my_policy
{
    "policy": {
        "phases": {
            "hot": {
                "actions": {
                    "rollover": {
                        "max_docs": 1
                    }
                }
            },
            "warm": {
                "actions": {
                    "set_priority": {
                        "priority": 10
                    }
                }
            },
            "delete": {
                "min_age": "1d",
                "actions": {
                    "delete": {}
                }
            }
        }
    }
}

PUT test-000001
{
    "settings": {
        "index.number_of_shards": 3,
        "index.lifecycle.name": "my_policy",
        "index.lifecycle.rollover_alias": "test-alias"
    },
    "aliases": {
        "test-alias": {
            "is_write_index": true
        }
    }
}

// Once the index is checking rollover conditions, index a doc
POST test-alias/_doc
{
  "foo": "bar"
}

// Wait for the index to finish the warm phase, then update the policy to remove the warm & delete phases
PUT _ilm/policy/my_policy
{
    "policy": {
        "phases": {
            "hot": {
                "actions": {
                    "rollover": {
                        "max_docs": 2
                    }
                }
            }
        }
    }
}

Following all that, this error should now be emitted every poll interval (1s per script above):

2020-04-14T12:02:50,691][ERROR][o.e.x.i.IndexLifecycleRunner] [Elastic-MBP-4.local] current step [{"phase":"warm","action":"complete","name":"complete"}] for index [test-000001] with policy [my_policy] is not recognized

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions