Skip to content

[Transform] add the ability to delete documents from the destination index #67916

@hendrikmuhs

Description

@hendrikmuhs

Transform provides a persistent view on data by pivoting them or providing the latest state. With "continuous mode" this view gets updated and kept up-to-date.

However, transform keeps adding new data. However you might want to age out old data or remove data from the persistent view on other criteria. Especially for latest we see a lack of functionality. With latest you might want to delete entities that haven't been seen for a longer period. E.g. if you transform host information you might want to remove decommissioned hosts.

Overall integration

Retention will be part of the overall transform configuration:

{
    "source": { ... },
    "dest": { ... },

    "pivot": { ... },   
OR  
    "latest": { ... },
    "retention_policy": {
        "name": {...}
    }

Therefore, retention_policy will be available for both pivot and latest.

The choice for nesting at an extra level gives us an extension point for later. The first retention_policy to be implemented is time:

Time based retention

    "retention_policy": {
        "time": {
            "field": "@timestamp",
            "max_age": "30d"
        }
    }

This policy requires you to configure a timestamp field (likely the same field as used for sync) and a max_age. Data that is older than max_age is considered outdated and will be removed as part of checkpointing:

Retention integration into checkpoints

Retention will be implemented as last step of checkpointing, that means it runs at the final phase of checkpointing. When a checkpoint is completed, data that should be deleted as defined by the policy. Retention is calculated based on the checkpoint time.

Retention policy updating

Updating the retention policy is supported by _update. If _update is called on a running transform, update gets effective when a new checkpoint gets started. The currently running checkpoint will use the current policy.

FYI: @elastic/ml-ui it would be good to support retention policy in the update fly-out

Retention policy stats

For measuring the retention policy we add 2 counters to _stats:

documents_deleted

Total number of documents deleted in the transform destination index by this transform.

delete_time_in_ms

Cumulative sum of time spend deleting documents in the transform destination index by this transform.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions