-
Notifications
You must be signed in to change notification settings - Fork 25.6k
[Transform] Align transform checkpoint range with date_histogram interval for better performance #74004
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Transform] Align transform checkpoint range with date_histogram interval for better performance #74004
Conversation
48634ef to
944fea1
Compare
6a65d22 to
dc3a876
Compare
|
Pinging @elastic/ml-core (Team:ML) |
hendrikmuhs
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should make this an optional feature and add a setting (hello, naming discussion ;-) ). The default should be to write incomplete/interim buckets as we do today. We might identify some cases where we always want to align to bucket boundaries, e.g. if frequency and bucket boundaries are small (e.g for fixed intervals with less than 60s it might make sense to align per default). To be discussed, we should give users some guidance about when this is useful.
IMO batch transforms should stay as they are, the assumption here is static data and if the user wants it bounded, he can specify a query.
.../src/main/java/org/elasticsearch/xpack/transform/checkpoint/TimeBasedCheckpointProvider.java
Outdated
Show resolved
Hide resolved
przemekwitek
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be discussed, we should give users some guidance about when this is useful.
I had an impression that we want to free the user from the responsibility of enabling this optimization themselves.
Do you envision situations where this optimization is harmful for the user to the point they want to disable it?
Or is it more of a safety measure so that the user has a way out if the optimization doesn't work for them for any unforseen reason?
IMO batch transforms should stay as they are, the assumption here is static data and if the user wants it bounded, he can specify a query.
Ok, makes sense.
.../src/main/java/org/elasticsearch/xpack/transform/checkpoint/TimeBasedCheckpointProvider.java
Outdated
Show resolved
Hide resolved
dc3a876 to
dda0983
Compare
5bd2af5 to
6491904
Compare
...t-high-level/src/main/java/org/elasticsearch/client/transform/transforms/SettingsConfig.java
Outdated
Show resolved
Hide resolved
Done. |
4846ae9 to
785a706
Compare
hendrikmuhs
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added 2 comments / questions,
I have a suspicion:
The checkpoint provider is passed in as argument to the indexer and it's also final. That means changing the setting via _update won't be applied, but you have to stop and start the transform to force a reload of the checkpoint provider.
If I am right, this is an existing bug. However we could leave it aside for this PR, some re-factoring is probably required. Nevertheless I think we should fix it for this release cycle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What the reason for requiring the top grouping?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was under impression that the date histogram grouping must be the first (top) specified in order to split all data into time buckets properly.
If it's not first then some other buckets (e.g.: terms) that are first will make the date histogram buckets mixed.
Does it make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, nevermind.
I've changed it so that it searches for the first date histogram source that matches on field name.
.../src/main/java/org/elasticsearch/xpack/transform/checkpoint/TimeBasedCheckpointProvider.java
Outdated
Show resolved
Hide resolved
785a706 to
c8c34d7
Compare
davidkyle
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I read the PR description Align transform checkpoint range with date_histogram interval for better performance I understand what that means but I'm not sure what interim_results means in this context. Perhaps write_partial | incomplete | interim_buckets or align_checkpoints or just make apply the optimisation automatically and remove the config option.
...t-high-level/src/main/java/org/elasticsearch/client/transform/transforms/SettingsConfig.java
Outdated
Show resolved
Hide resolved
...t-high-level/src/main/java/org/elasticsearch/client/transform/transforms/SettingsConfig.java
Outdated
Show resolved
Hide resolved
5a605ea to
721d6cf
Compare
przemekwitek
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I read the PR description
Align transform checkpoint range with date_histogram interval for better performanceI understand what that means but I'm not sure whatinterim_resultsmeans in this context. Perhapswrite_partial | incomplete | interim_bucketsoralign_checkpoints
interim_results is the setting name we came up with during the discussion although I agree it is not perfect. We can of course revisit that.
or just make apply the optimisation automatically and remove the config option.
This is the option we discarded with the assumption there will be users that don't want the optimization for any reason.
...t-high-level/src/main/java/org/elasticsearch/client/transform/transforms/SettingsConfig.java
Outdated
Show resolved
Hide resolved
...t-high-level/src/main/java/org/elasticsearch/client/transform/transforms/SettingsConfig.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, nevermind.
I've changed it so that it searches for the first date histogram source that matches on field name.
hendrikmuhs
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
4cc0747 to
2d788ae
Compare
|
run elasticsearch-ci/bwc |
1 similar comment
|
run elasticsearch-ci/bwc |
4cca62f to
a575b77
Compare
I'm merging this with |
* master: (868 commits) Query API key - Rest spec and yaml tests (elastic#76238) Delay shard reassignment from nodes which are known to be restarting (elastic#75606) Reenable bwc tests for elastic#76475 (elastic#76576) Set version to 7.15 in BWC code (elastic#76577) Don't remove warning headers on all failure (elastic#76434) Disable bwc tests for elastic#76475 (elastic#76541) Re-enable bwc tests (elastic#76567) Keep track of data recovered from snapshots in RecoveryState (elastic#76499) [Transform] Align transform checkpoint range with date_histogram interval for better performance (elastic#74004) EQL: Remove "wildcard" function (elastic#76099) Fix 'accept' and 'content_type' fields for search_mvt API Add persistent licensed feature tracking (elastic#76476) Add system data streams to feature state snapshots (elastic#75902) fix the error message for instance methods that don't exist (elastic#76512) ILM: Add validation of the number_of_shards parameter in Shrink Action of ILM (elastic#74219) remove dashboard only reserved role (elastic#76507) Fix Stack Overflow in UnassignedInfo in Corner Case (elastic#76480) Add (Extended)KeyUsage KeyUsage, CipherSuite & Protocol to SSL diagnostics (elastic#65634) Add recovery from snapshot to tests (elastic#76535) Reenable BwC Tests after elastic#76532 (elastic#76534) ...
This PR makes the checkpoint boundaries aligned with top-level
date_histogrambucket boundaries.The optimization is applied only when the transform config has
date_histogramas the first group in thegroup_bylist.Relates #62746