Skip to content

[ML-Dataframe] Job configuration storage design discussion for dataframe builder #33952

@hendrikmuhs

Description

@hendrikmuhs

Design discussion for ML dataframe builder https://github.com/elastic/elasticsearch/tree/feature/fib

Work in progress.

Intro

A dataframe builder job uses persistent task that can be created/deleted/started/stopped. This issue discusses ways to store the configuration of a dataframe builder job.

Configuration payload

A configuration object is supposed to be small, it contains fields like source and destination index and a list of aggregation configurations (which makes parsing tightly coupled to aggregations). In future there will be likely more options but it will remain relatively small.

Problem

The biggest risk is breaking backwards compatibility and/or creating a road block for deprecating aggregations.

Solutions

A Do nothing - use cluster state

Use persistent task to store the configuration, which means the config is stored in the cluster state and needs to be read back on cluster start.

pro:

  • simplest

con:

  • if the configuration becomes invalid, parsing breaks, node does not start
B Careful Parsing

Same as A, but wrap the parsing code with error handling to not cause total failure.

pro:

  • does not fail if configuration gets invalid
C deferred parsing

Store the config in a blob to avoid parsing at startup and parse it on start of the persistent job

pro:

  • does not fail if configuration gets invalid
  • does not slow down startup

con

  • ugly blob in cluster state
D Store configuration in index, only keep ID's in cluster state

Store only the (unique) job id in the cluster state and store the configs in a separate private index

pro:

  • does not fail if configuration gets invalid
  • does not slow down startup
  • does not pollute cluster state (size)

con:

  • extra logic, error handling

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions