-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Script: Time series compile and cache evict metrics #79078
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Script: Time series compile and cache evict metrics #79078
Conversation
Collects compilation and cache eviction metrics for each script context. Metrics are available in _nodes/stats in 5m/15m/1d buckets. Refs: elastic#62899
|
Pinging @elastic/es-core-infra (Team:Core/Infra) |
jdconrad
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some initial comments. I need more time to process the time series logic.
server/src/main/java/org/elasticsearch/script/TimeSeriesCounter.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/script/TimeSeriesCounter.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/script/TimeSeriesCounter.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/script/TimeSeriesCounter.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/script/TimeSeriesCounter.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/script/TimeSeriesCounter.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/script/TimeSeriesCounter.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/script/TimeSeriesCounter.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/script/TimeSeriesCounter.java
Outdated
Show resolved
Hide resolved
|
@elasticmachine update branch |
|
Re: 4 above. We'll start with the simplest implementation, which is triplicate the write. Once to We can always decide to increase implementation complexity to avoid the writing multiple times. |
…ing/cache-metrics-collect-pr
…ing/cache-metrics-collect-pr
…ing/cache-metrics-collect-pr
…ing/cache-metrics-collect-pr
|
Updated the PR based on the feedback above.
The following concepts are used:
There are ascii diagrams representing increment within a bucket, roll over to a new bucket, skipping buckets and moving to a new epoch.
The API only exposes 24h/15m/5m.
The active bucket is tracked via
The internal implementation is called |
colings86
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stu-elastic I left some comments. Additionally, could we add documentation to this PR so we have documentation explaining these stats to users?
| import java.io.IOException; | ||
| import java.util.Objects; | ||
|
|
||
| public class TimeSeries implements Writeable, ToXContentFragment { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Could we add a javadoc here explaining that this is the response object and that the metrics are collected by TimeSeriesCounter. This avoids a couple of "find usages" calls in the IDE to link the two if you don't know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
| } | ||
|
|
||
| /** | ||
| * The total number of events for all time covered gby the counters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| * The total number of events for all time covered gby the counters. | |
| * The total number of events for all time covered by the counters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
| * 300[c]-> 320[f] | ||
| * | ||
| * [a] Beginning of the current epoch | ||
| * startOfEpoch = 200 = (t / duration) * duration = (235 / 100) * 100 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can we avoid spacing this out so much so it's easier to read? (applies to below ones too)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed excess spacing.
| adder.increment(); | ||
| lock.writeLock().lock(); | ||
| try { | ||
| if (t < twentyFourHours.earliestTimeInCounter()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not clear to me why we would ever expect this to happen? Since this counter is always called with now() I would have thought that we would never expect t to go backwards between successive calls and even if there is a race condition I would not have thought we would expect it to go back so far?
IF the above is true then I'm not sure the right behaviour is to trash all the current stats and start again if we get an increment from a "long" time in the past? Erroring probably also isn't a good option here since we don't want to stop the compilation or execution of the script (I think? though it might be worthy of an assert to ensure we catch it in tests) but maybe we should just not increment if this happens?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're using ThreadPool.absoluteTimeInMillis(). I'll switch to ThreadPool.relativeTimeInMillis.
If we have a very large odd update we have three options:
A) Increment the current bucket assuming it will catch back up soon
B) Ignore the update assuming it will catch back up soon
C) Clear the bucket assuming this is the "new normal"
A & B are better with temporary blips, C is good if there's a one-time adjustment but bad if the weird adjustments keep happening.
I chose C to avoid the odd "getting stuck" possibility.
| */ | ||
| public long sum(long end) { | ||
| long start = end - duration; | ||
| if (start >= nextBucketStartTime() || start < 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I follow the thinking on returning 0 if start < 0 here? Below we are saying we will emit incomplete buckets if the start is before the earliest time in the counter so I'm not sure why start < 0 is different?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was to avoid issues with math on negative time values. In the current version, TimeSeriesCounter.now() ensures time is never negative.
| public void testOnePerSecond() { | ||
| long time = now; | ||
| long t; | ||
| long next = randomLongBetween(1, HOUR); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we rename this to something like nextAssertCheck so its easier to see that this is just controlling when we run the asserts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
server/src/test/java/org/elasticsearch/script/TimeSeriesCounterTests.java
Show resolved
Hide resolved
server/src/test/java/org/elasticsearch/script/TimeSeriesCounterTests.java
Show resolved
Hide resolved
|
After a chat with @colings86, here's the next steps:
|
…ing/cache-metrics-collect-pr
…ing/cache-metrics-collect-pr
|
@jdconrad and @colings86 I've addressed all outstanding comments. Please re-review. |
colings86
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
jdconrad
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for walking me through it again! Changed LGTM.
Collects compilation and cache eviction metrics for
each script context.
Metrics are available in _nodes/stats in 5m/15m/1d
buckets.
Refs: #62899