Skip to content

Conversation

@stu-elastic
Copy link
Contributor

Collects compilation and cache eviction metrics for
each script context.

Metrics are available in _nodes/stats in 5m/15m/1d
buckets.

Refs: #62899

Collects compilation and cache eviction metrics for
each script context.

Metrics are available in _nodes/stats in 5m/15m/1d
buckets.

Refs: elastic#62899
@stu-elastic stu-elastic added >feature v7.16.0 >enhancement :Core/Infra/Scripting Scripting abstractions, Painless, and Mustache and removed >feature labels Oct 13, 2021
@stu-elastic stu-elastic marked this pull request as ready for review October 13, 2021 15:02
@elasticmachine elasticmachine added the Team:Core/Infra Meta label for core/infra team label Oct 13, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra (Team:Core/Infra)

@rjernst rjernst removed their assignment Oct 13, 2021
@rjernst rjernst self-requested a review October 13, 2021 15:07
Copy link
Contributor

@jdconrad jdconrad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some initial comments. I need more time to process the time series logic.

@stu-elastic
Copy link
Contributor Author

@elasticmachine update branch

@stu-elastic
Copy link
Contributor Author

Re: 4 above. We'll start with the simplest implementation, which is triplicate the write. Once to 5m, once to 15m and once to 24h, this will have worse performance than writing once but that cost implementation complexity in the current version.

We can always decide to increase implementation complexity to avoid the writing multiple times.

@stu-elastic stu-elastic added v8.1.0 and removed v8.0.0 labels Oct 28, 2021
@stu-elastic
Copy link
Contributor Author

stu-elastic commented Oct 28, 2021

Updated the PR based on the feedback above.

Simplify and clarify terminology.

The following concepts are used: bucket, epoch, earliestTimeInCounter, counterExpired, nextBucketStartTime.

Overall I think the structure needs to be better described, perhaps with ascii diagrams to explain the relationship of one array being a zoomed in version of one bucket from another array, and what the rollover and skew behaviors are.

There are ascii diagrams representing increment within a bucket, roll over to a new bucket, skipping buckets and moving to a new epoch.

Remove flexibility. While it may be that we want to reuse this counter in the future for other cases, right now we have a very specific use case, 24h/15m/5m. I think we should tailor the implementation to work with these constraints. It will simplify the implementation (and testing necessary!).

The API only exposes 24h/15m/5m.

Simplify the implementation. Recalculating the current index is unnecessary if we keep track of our current active bucket within each array. Having the current index has a bunch of advantages...

The active bucket is tracked via curBucket.

Consider using an array per metric. While the high/low precision is interesting, it makes the implementation more difficult to understand. I think it would be more straightforward to have the arrays named based on the metric they are keeping track of. Each of these could even be generalized to a tiny implementation class that wraps the array and current bucket index. This way a lot of these utilities could be implemented directly on this class, like summing the array, advancing the bucket, They can have a "parent" as well, where rolling over propagates to the current bucket of the parent

The internal implementation is called Counter, per discussion on my comment above, this PR does not implement a parent bucket to simplify the implementation as much as possible.

@stu-elastic stu-elastic requested a review from colings86 October 28, 2021 02:26
Copy link
Contributor

@colings86 colings86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stu-elastic I left some comments. Additionally, could we add documentation to this PR so we have documentation explaining these stats to users?

import java.io.IOException;
import java.util.Objects;

public class TimeSeries implements Writeable, ToXContentFragment {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Could we add a javadoc here explaining that this is the response object and that the metrics are collected by TimeSeriesCounter. This avoids a couple of "find usages" calls in the IDE to link the two if you don't know.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

}

/**
* The total number of events for all time covered gby the counters.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* The total number of events for all time covered gby the counters.
* The total number of events for all time covered by the counters.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

* 300[c]-> 320[f]
*
* [a] Beginning of the current epoch
* startOfEpoch = 200 = (t / duration) * duration = (235 / 100) * 100
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we avoid spacing this out so much so it's easier to read? (applies to below ones too)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed excess spacing.

adder.increment();
lock.writeLock().lock();
try {
if (t < twentyFourHours.earliestTimeInCounter()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear to me why we would ever expect this to happen? Since this counter is always called with now() I would have thought that we would never expect t to go backwards between successive calls and even if there is a race condition I would not have thought we would expect it to go back so far?

IF the above is true then I'm not sure the right behaviour is to trash all the current stats and start again if we get an increment from a "long" time in the past? Erroring probably also isn't a good option here since we don't want to stop the compilation or execution of the script (I think? though it might be worthy of an assert to ensure we catch it in tests) but maybe we should just not increment if this happens?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're using ThreadPool.absoluteTimeInMillis(). I'll switch to ThreadPool.relativeTimeInMillis.

If we have a very large odd update we have three options:
A) Increment the current bucket assuming it will catch back up soon
B) Ignore the update assuming it will catch back up soon
C) Clear the bucket assuming this is the "new normal"

A & B are better with temporary blips, C is good if there's a one-time adjustment but bad if the weird adjustments keep happening.

I chose C to avoid the odd "getting stuck" possibility.

*/
public long sum(long end) {
long start = end - duration;
if (start >= nextBucketStartTime() || start < 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I follow the thinking on returning 0 if start < 0 here? Below we are saying we will emit incomplete buckets if the start is before the earliest time in the counter so I'm not sure why start < 0 is different?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was to avoid issues with math on negative time values. In the current version, TimeSeriesCounter.now() ensures time is never negative.

public void testOnePerSecond() {
long time = now;
long t;
long next = randomLongBetween(1, HOUR);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we rename this to something like nextAssertCheck so its easier to see that this is just controlling when we run the asserts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@stu-elastic
Copy link
Contributor Author

After a chat with @colings86, here's the next steps:

  • Move timeProvider into TimeSeriesCounter, this makes the public interface clear. TimeSeriesCounter.inc() and TimeSeriesCounter.timeSeries() no longer take a long.
  • in Counter, change the parameter t in inc to now and indicate in Javadocs that the value of now is treated as metadata, the code assumes the increment happens "now" and uses the parameter to determine how to update the state in response to forward movements in time. Users of Counter should not dump a bunch of existing events in any order and expect a deterministic outcome.
  • TimeSeriesCounter will still reset all counters if it receives an event from timeProvider greater than 24 hours ago.
  • Counter will still clamp all events from the past through the current bucket time range to the current bucket.
  • Counter will not handle zero negative times.
  • TimeSeriesCounter will expect to recieve System.currentTimeMillis to avoid requiring Counter to handle negative times.

@stu-elastic
Copy link
Contributor Author

@jdconrad and @colings86 I've addressed all outstanding comments. Please re-review.

Copy link
Contributor

@colings86 colings86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@jdconrad jdconrad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for walking me through it again! Changed LGTM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Core/Infra/Scripting Scripting abstractions, Painless, and Mustache >enhancement Team:Core/Infra Meta label for core/infra team v8.1.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants