-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Add parsing for percentiles ranks #23974
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add parsing for percentiles ranks #23974
Conversation
cbuescher
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tlrx thanks, I had a first look, I think it looks great given that this is already a very complicated aggregation to parse. I left a few comments, most are just considered to be questions or suggestions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Already mentioned somewhere else, but adding again for reference: I think we need to implement the same logic here as in InternalAggregation (either treat null/empty() different here or change this in InternalAggregation).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I understand, the type will be constant. I don't think we should make it a stor argument, instead use some constant like PercentileRanks.TYPE_NAME in getType() directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Already mentioned above, can we use PercentileRanks.TYPE_NAME here directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe its just me disliking Tuple, but if we could keep two maps (e.g. percentiles, percentilesAsString) that would avoid a lot of the v1(), v2() usage that I find a bit confusing to read later in this class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yuck! I think I see now why these response keys make using ObjectParser almost impossible. Nothing to do about this unfortunately, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, didn't know map.compute(), seems to be handy. I still wonder if this logic could be a bit simpler to read if we had two maps instead of Map<Double, Tuple<Double, String>>.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I see the use of the type in the ctor now. I'd still consider individual contants in ParsedHDRPercentileRanks and ParsedTDigestPercentileRanks and overwriting getType() separately as an alternative. Maybe just different tastes, nothing big.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, override getType would be more readable, thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of providing the instance parser by overwriting this method in the subtests, can subtest simply provide their own xContentRegistry with the appropriate parser by overwriting getNamedWriteableRegistry() from AbstractWireSerializingTestCase? I might be missing something though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NamedWriteableRegistry and NamedXContentRegistry are two different beasts, so maybe you're suggestion to have an abstract method like getNamedXContentRegistry in each test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, my bad. What I meant was overwriting xContentRegistry() from ESTestCase, I mixed that up. That way you can e.g. provide the parser that you need in the test without the need of instanceParser(). I don't mind either way, feel free to use it or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh ok, I see what you meant.
Yes, that's the way to go, I didn't think about overriding xContentRegistry() and that will help. I usd instanceParser() to mimic instanceReader() we already have, but you suggestion is better. I'll update the pull request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: reactivate randomization
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
woops, thanks :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What was wrong here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing, it should not have been commited. (I used intellij 2017.1 and it has classpath issues with Gradle)
a0e083d to
f41909e
Compare
|
@cbuescher Thanks a lot for your review. I think I implemented almost all your comments except the one about tests. Please let me know what you think |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this can be provided by overwriting EsTestCase#xContentRegistry().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see. I didn't think about this method as I wanted to mimic the already existing instanceReader() one, but it makes sense.
I'll update the PR.
|
@tlrx thanks a lot, I left one small clarification, I wasn't clear on my part. The rest LGTM from my side. |
|
@cbuescher Thanks! I updated again to use the |
javanna
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left some comments, looks great though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh oh what happened here? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Intellij 2017.1 with gradle :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: call context name given that it's a simple string which holds the name. I would love not to need the cast here to String but I didn't find a way...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can use the setter that will be added by Christoph's PR once that gets merged
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are reusing InternalPercentile on purpose here I think. I wonder if it makes sense to rename that class later, as it differs quite a bit from the other internal classes and it can be shared between hl client and es core. Or maybe Percentile should be a class instead of an interface given that it has a single straightforward impl.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also thought about it when I wrote this. I think it makes sense to use InternalPercentile here and I'm in favor of merging the implementation with its interface in core.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this implement PercentileRanks ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally, I wonder where it has gone :/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I saw this in Christoph's PR too. Hopefully you don't need it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool! this means that we don't have to write unit tests for each single agg anymore, as this one is generic enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's the idea, yes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we make it final then and remove InternalCardinalityTests#testFromXContent ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolutely
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we use //norelease instead of TODOs here just to make sure? Though I don't think those fail the build anymore, so maybe it is the same after all. I am a bit paranoid on forgetting these. I would love to merge this branch back with no TODOs left.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering if we should have a single final impl for this in the base test class that contains all of the aggs that we can parse. That would be useful later when we will parse nested aggs and bucket aggs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so, but I think it is OK for now and should be easy to change later. I'm in favor of keeping it like this until we have many parsable aggregations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would insist on doing it now. Adding a line to an existing method for each agg is going to cost less than having to do it for all aggs afterwards. Any reason why we should postpone this? are we not sure it is the way to go?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. I wanted to change that in a follow up method to limit the set of changes of this PR but OK, let's do it now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am ok with a followup as well. but in the previous comments you said "when we have many parsable aggregations". I would do it earlier than that :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, was the idea to reuse this also for percentiles (InternalHDRPercentiles and InternalTDigestPercentiles)? That looks quite straight-forward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that was the idea. To be honest I have to give it a try but it should be doable
7a8589a to
4e7b1a8
Compare
|
@javanna @cbuescher Thanks for your reviews. I looked at it again and I found few issues in my previous code, so I updated it again. It now correctly implement the It also mutualize the parsing logic in a That would be awesome if you could have another look. |
cbuescher
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update, the additions look good to me. I left a few suggestions and questions, this is really a tough one when it gets to the details but I think its almost there.
| if (percentile == null) { | ||
| return Double.NaN; | ||
| InternalPercentile getPercentile(double percent) { | ||
| for (InternalPercentile percentile : percentiles) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does is make sense to keep a Map in addition to the List to make lookups like this faster? I don't know if its worth it though if the expected number of elements in the list is small.
| } | ||
| } | ||
| return percentile; | ||
| return null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before this you returnes Double.NaN, was that wrong?
| } | ||
| } | ||
| } | ||
| if (key != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain why we don't need the check anymore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should not have reverted this, sorry.
|
|
||
| private static ObjectParser<ParsedTDigestPercentileRanks, Void> PARSER = | ||
| new ObjectParser<>("ParsedTDigestPercentileRanks", true, ParsedTDigestPercentileRanks::new); | ||
| new ObjectParser<>(ParsedTDigestPercentileRanks.class.getSimpleName(), true, ParsedTDigestPercentileRanks::new); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used the NAME constant used in getWritableName as name for the parser. I guess it doesn't matter much, but maybe we want to stay consistent with this? I think the parser name currently is only used for error messages though.
|
|
||
| @Override | ||
| protected void assertFromXContent(T aggregation, ParsedAggregation parsedAggregation) { | ||
| super.assertFromXContent(aggregation, parsedAggregation); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this when the intention is to make super.assertFromXContent abstract?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed, I'd remove this line
| for (Percentile percentile : percentileRanks) { | ||
| Double value = percentile.getValue(); | ||
| assertEquals(percentileRanks.percent(value), parsedPercentileRanks.percent(value), 0); | ||
| if (format != DocValueFormat.RAW) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this assertion should also work if the formatter in the original aggregation is DocValueFormat.RAW. The original objects method returns a value formatted with DocValueFormat.RAW, I think the parsed version should do the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, thanks
|
|
||
| public abstract class InternalPercentilesRanksTestCase<T extends InternalAggregation> extends InternalAggregationTestCase<T> { | ||
|
|
||
| private final boolean keyed = randomBoolean(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can probably moved to a local variable in createTestInstance(...)
| public abstract class InternalPercentilesRanksTestCase<T extends InternalAggregation> extends InternalAggregationTestCase<T> { | ||
|
|
||
| private final boolean keyed = randomBoolean(); | ||
| private final DocValueFormat format = randomDocValueFormat(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can probably moved to a local variable in createTestInstance(...) if it is not needed anymore in assertFromXContent() (see comment below)
|
|
||
| private DocValueFormat randomDocValueFormat() { | ||
| if (randomBoolean()) { | ||
| return new DocValueFormat.DateTime(DateFieldMapper.DEFAULT_DATE_TIME_FORMATTER, DateTimeZone.UTC); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if in this case we really need the DateTime formatter. Although I think its theoretically possible to use it with percentiles aggregations I doubt that the output would make much sense (I think it truncates any double value from the InternalPercentile to a long and then make a date out of it, but that output doesn't make much sense). Maybe I missed something, but otherwise we could test with DocValueFormat.Decimal() here instead? I implemented equals/hashCode for that formatter in #24085.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, thanks
javanna
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left a few comments, nothing major. LGTM otherwise.
| } | ||
|
|
||
| InternalPercentile getPercentile(double percent) { | ||
| for (InternalPercentile percentile : percentiles) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather use a LinkedHashMap for this, why a list that we have to iterate through when retrieving stuff per key?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh right that would be better. Not sure why I moved back to a list of internal percentile.
| builder.field(String.valueOf(key), percentile.getValue()); | ||
|
|
||
| String valueAsString = percentileAsString(key); | ||
| if (valueAsString != null && valueAsString.isEmpty() == false) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strings.hasLength? But when can it happen that the key is empty? isn't that an issue?
| builder.field(CommonFields.KEY.getPreferredName(), key); | ||
| builder.field(CommonFields.VALUE.getPreferredName(), percentile.getValue()); | ||
| String valueAsString = percentileAsString(key); | ||
| if (valueAsString != null && valueAsString.isEmpty() == false) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we make it final then and remove InternalCardinalityTests#testFromXContent ?
|
|
||
| @Override | ||
| protected void assertFromXContent(T aggregation, ParsedAggregation parsedAggregation) { | ||
| super.assertFromXContent(aggregation, parsedAggregation); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed, I'd remove this line
| for (Percentile percentile : percentileRanks) { | ||
| Double value = percentile.getValue(); | ||
| assertEquals(percentileRanks.percent(value), parsedPercentileRanks.percent(value), 0); | ||
| if (format != DocValueFormat.RAW) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would insist on doing it now. Adding a line to an existing method for each agg is going to cost less than having to do it for all aggs afterwards. Any reason why we should postpone this? are we not sure it is the way to go?
| cdfValues[i] = randomCdfValues.get(i); | ||
| } | ||
|
|
||
| TDigestState state = new TDigestState(100); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: I see this method moved to this new base class. But is TDigestState also used in HDR percentile ranks? I thought it applied only to standard percentile ranks, maybe I got this wrong though.
Also, if this method is not supposed to be overridden, make it final?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a left over, sorry for the confusion. I marked the method final +1
| double[] cdfValues, boolean keyed, DocValueFormat format); | ||
|
|
||
| @Override | ||
| protected void assertFromXContent(T aggregation, ParsedAggregation parsedAggregation) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make it final?
javanna
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left two minors, LGTM otherwise
|
|
||
| public abstract class AbstractParsedPercentiles extends ParsedAggregation implements Iterable<Percentile> { | ||
|
|
||
| private final Map<Double, Double> percentiles = new LinkedHashMap<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: (could also be a followup) is there a test that fails if we switch to the wrong impl? Maybe we should compare the iterator returned but the internal object and the iterator returned by the parsed object?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: (could also be a followup) is there a test that fails if we switch to the wrong impl?
No, but I'm not sure to understand what you mean.
Maybe we should compare the iterator returned but the internal object and the iterator returned by the parsed object?
I thought it was the case but I looked again and added more verifications in #24160 - it compares both iterators obtained from the internal aggregation and the parsed one. It raised a bug in the parsed implementations where the Percentile retrieved from iterators where not equals. I looked at the internal implementation again and the percents/values are inverted (compared to the internal percentiles NOT ranks aggregations) so I changed the parsed implementation to behave the same.
the parsing side of things where key & value are "inverted" (compared to the percentiles NOT rank aggregations)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perfect, thanks for doing that. always nice when adding tests leads to finding bugs :)
I previously meant that I would like to see a test fail if we lose ordering, by going back to HashMap rather than LinkedHashMap. I think that should be the case with #24160 .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, got it. Yes it is now checked in #24160.
| static List<NamedXContentRegistry.Entry> getNamedXContents() { | ||
| Map<String, ContextParser<Object, ? extends Aggregation>> namedXContents = new HashMap<>(); | ||
| namedXContents.put(InternalHDRPercentileRanks.NAME, (p, c) -> ParsedHDRPercentileRanks.fromXContent(p, (String) c)); | ||
| namedXContents.put(InternalTDigestPercentileRanks.NAME, (p, c) -> ParsedTDigestPercentileRanks.fromXContent(p, (String) c)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't we have cardinality here too? I hope the cardinality testFromXContent fails otherwise?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I added it here. No, the test doesn't fail because it allows the testFromXContent() test to fail without errors for all the aggregations that are not parsable yet (see the try/catch block in the test method)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh right. I think we should make this test a bit smarter otherwise it's trappy. Can we run the test only for known aggs, may using an assume? or ignore the exception only when it's correct to do so?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I always forgot about assume. This is a good suggestion, thanks. I fixed this in #24160
|
Thanks a lot @cbuescher @javanna |
This commit adds the logic for parsing the percentiles ranks aggregations.
This is an attempt to parse InternalHDRPercentilesRanks and InternalTDigestPercentilesRanks aggregation.