Skip to content

Add better support for metric data types (TSDB) #74660

@imotov

Description

@imotov

Phase 0 - Inception

  • Obtain schemas annotated with dimensions and metrics from the Metrics team (small) @nik9000
  • Prototyping Lucene Data Pull Mechanism(medium) @imotov
  • Prototyping Data Pull Mechanism in elasticsearch @imotov

Phase 1 - Mappings

Phase 2 - Ingest

Phase 2.1 Ingest follow ups

- [ ] Build the _id from dimension values
- [ ] Investigate moving timestamp to the front of the _id to automatically get an optimization on _id searches. Not sure if worth it - but possible. #84928 could be an alternative

  • Bring back something in the spirit of the append-only optimization but that works for tsdb. That's super improve write performance. Extract append-only optimization from Engine #84771 is a partial prototype
  • We store the _id in lucene stored fields. We could regenerate it from the _source or from doc values for the @timestamp and the _tsid. That'd save some bytes per document.
  • Move IndexRequest#autoGeneratId? It's a bit spook where it is but I don't like it any other place.
  • Improve error messages in _update_by_query when modifying the dimensions or @timestamp
  • On translog replay and recovery and replicas we regenerate the _id and assert that it matches the _id from the primary. Should we? Probably. Let's make sure.
  • Add tsdb benchmarks to the nightlies
    - [ ] Document best practices for using dimensions-based ID generator including how to use this with component templates

Phase 3.1 QL storage API (Postponed)

Phase 3.2 - Search MVP

Plans time series support in _search api are superceded by plans for this in ES|QL.

Phase 3.3 - Rollup / Downsampling

Phase 3.4 - TSID aggs (superseded by tsdb in ES|QL)

~~ - [ ] Update min, max, sum, avg pipeline aggs for intermediate result filtering optimization ~~
~~ - [ ] Sliding window aggregation ~~
~~ - [ ] A way to filter to windows within the sliding window. Like "measurements take in the last 30 seconds of the window". ~~
~~ - [ ] Open transform issue for newly added time series aggs ~~
~~ - [ ] Benchmarks for the tsid agg ~~

Phase 3.5 - Downsampling follow ups

  • Handling histograms
  • SQL support for downsampling

Phase 4.0 - Compression

Phase 5.0 - Follow-ups and Nice-to-have-s

  • Default the setting's value to all of the keyword dimensions
  • Support shard splitting on time_series indices
  • Make an object or interface for _id's values. Right now it's a String that we encode with Uid.encodeId. That was reasonable. Maybe it still is. But it feels complex and for tsdb who's _id is always some bytes. And encoding it also wastes a byte about 1/128 of the time. It's a common prefix byte so this is probably not really an issue. But still. This is a big change but it'd make ES easier to read. Probably wouldn't really improve the storage though.
  • Figure out how to specify tsdb settings in component templates. For example index.routing_path can be specified in a composable index template if data stream template' index_mode is set to time_series. But if this setting is specified in a component template then it is required to also set the index.mode index setting. This feels backwards. @martijnvg
  • In order to retrieve the routing values (defined in index.routin_path), the source needs to be parsed on coordinating node. However in the case that an ingest pipeline is executed this, then the source of document will be parsed for the second time. Ideally the routing values should be extracted when ingest is performed. Similar to how the @timestamp field is already retrieved from a document during pipeline execution.
  • In order to determine the backing index a document should be to, a timestamp is parsed into Instant. The format being used is: strict_date_optional_time_nanos||strict_date_optional_time||epoch_millis. This to allow regular data format, data nanos date format and epoch since mills defined as string. We can optimise the data parsing if we know the exact format being used. For example if on data stream there is parameter that indices that exact data format we can optimise parsing by either using strict_date_optional_time_nanos, strict_date_optional_time or epoch_millis.

Metadata

Metadata

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions