Skip to content

Conversation

@nik9000
Copy link
Member

@nik9000 nik9000 commented Aug 3, 2020

This way is faster, saving about 8% on the microbenchmark that rounds to
the nearest month. That is in the hot path for date_histogram which is
a very popular aggregation so it seems worth it to at least try and
speed it up a little.

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo (:Analytics/Aggregations)

@elasticmachine elasticmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Aug 3, 2020
This way is faster, saving about 8% on the microbenchmark that rounds to
the nearest month. That is in the hot path for `date_histogram` which is
a very popular aggregation so it seems worth it to at least try and
speed it up a little.
* Blindly believe the numbers that your microbenchmark produces but verify them by measuring e.g. with `-prof perfasm`.
* Run more threads than your number of CPU cores (in case you run multi-threaded microbenchmarks).
* Look only at the `Score` column and ignore `Error`. Instead take countermeasures to keep `Error` low / variance explainable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote all this while I was working on this because I couldn't figure out how to get actual intel assembly instructions. It turns out that if you install the hsdis that they build as part of the openjdk you'll just get jvm opcodes. Those are neat, but not useful to me. Instead you need the hsdis from the "Free Code Manipulation Library". Which you have to build. And it isn't a simple as git clone foo && cd foo && ./configure && make && sudo make install. You need to do what I did.

I can separate this out into a separate change if either change is controversial.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems helpful to me... I would appreciate instructions like this if I were getting started :)

Copy link
Contributor

@polyfractal polyfractal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we tried this with different JVM versions and/or processors? This sorta feels like it might be very specific to particular JVMs, or how the JVM optimizes for different processors?

* Blindly believe the numbers that your microbenchmark produces but verify them by measuring e.g. with `-prof perfasm`.
* Run more threads than your number of CPU cores (in case you run multi-threaded microbenchmarks).
* Look only at the `Score` column and ignore `Error`. Instead take countermeasures to keep `Error` low / variance explainable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems helpful to me... I would appreciate instructions like this if I were getting started :)

@nik9000
Copy link
Member Author

nik9000 commented Aug 3, 2020

Have we tried this with different JVM versions and/or processors? This sorta feels like it might be very specific to particular JVMs, or how the JVM optimizes for different processors?

Just the bundled one on. None of the assembly that it spits out is particularly strange looking so I expect it'd be ok. I'm not sure that my weekend project is worth a huge amount of digging. If we think it'd turn sour sometimes then I don't think it is worth merging.

Copy link
Contributor

@polyfractal polyfractal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, sounds good to me. Easy enough to revert if it tanks the benchmarks or something unexpected, but likely harmless in the worst case and plus some speed in the best case :)

if (year % 100 != 0) {
return true;
}
return ((year / 100) & 3) == 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something I was thinking about - do we see the difference because of the change from year % 400 to (year / 100) & 3, or because of unrolling the short circuit logic? And if it's only the former, does it make more sense to leave this as a one-liner and let the optimizer short circuit it as necessary? I'm more asking out of curiosity about how the optimizer works than out of a need to see this changed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the short circuiting is actually the same here. At least, that is how I read the Java. I believe the speed is entirely around the change from (year % 400) == 0 to ((year / 100) & 3) == 0. We certainly could keep it as a one liner, but I think it as harder to read that way. Certainly harder for me to track down the disassembly chunks because the disassembler sticks line numbers in there. Not that you can truly believe them though.

@nik9000 nik9000 merged commit 1af8d9f into elastic:master Aug 5, 2020
nik9000 added a commit to nik9000/elasticsearch that referenced this pull request Aug 5, 2020
This way is faster, saving about 8% on the microbenchmark that rounds to
the nearest month. That is in the hot path for `date_histogram` which is
a very popular aggregation so it seems worth it to at least try and
speed it up a little.
nik9000 added a commit that referenced this pull request Aug 10, 2020
This way is faster, saving about 8% on the microbenchmark that rounds to
the nearest month. That is in the hot path for `date_histogram` which is
a very popular aggregation so it seems worth it to at least try and
speed it up a little.

Co-authored-by: Elastic Machine <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Analytics/Aggregations Aggregations >non-issue Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v7.10.0 v8.0.0-alpha1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants