-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Description
Transform provides a persistent view on data by pivoting them or providing the latest state. With "continuous mode" this view gets updated and kept up-to-date.
However, transform keeps adding new data. However you might want to age out old data or remove data from the persistent view on other criteria. Especially for latest we see a lack of functionality. With latest you might want to delete entities that haven't been seen for a longer period. E.g. if you transform host information you might want to remove decommissioned hosts.
Overall integration
Retention will be part of the overall transform configuration:
{
"source": { ... },
"dest": { ... },
"pivot": { ... },
OR
"latest": { ... },
"retention_policy": {
"name": {...}
}
Therefore, retention_policy will be available for both pivot and latest.
The choice for nesting at an extra level gives us an extension point for later. The first retention_policy to be implemented is time:
Time based retention
"retention_policy": {
"time": {
"field": "@timestamp",
"max_age": "30d"
}
}
This policy requires you to configure a timestamp field (likely the same field as used for sync) and a max_age. Data that is older than max_age is considered outdated and will be removed as part of checkpointing:
Retention integration into checkpoints
Retention will be implemented as last step of checkpointing, that means it runs at the final phase of checkpointing. When a checkpoint is completed, data that should be deleted as defined by the policy. Retention is calculated based on the checkpoint time.
Retention policy updating
Updating the retention policy is supported by _update. If _update is called on a running transform, update gets effective when a new checkpoint gets started. The currently running checkpoint will use the current policy.
FYI: @elastic/ml-ui it would be good to support retention policy in the update fly-out
Retention policy stats
For measuring the retention policy we add 2 counters to _stats:
documents_deleted
Total number of documents deleted in the transform destination index by this transform.
delete_time_in_ms
Cumulative sum of time spend deleting documents in the transform destination index by this transform.