-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Speed up PK lookups at index time. #19856
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up PK lookups at index time. #19856
Conversation
|
This is basically a revert of #14070. I'm stalling this on making cache keys / closed listeners less trappy. |
|
Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually? |
|
@jpountz Is this PR worth reviving? |
|
Absolutely! |
At index time Elasticsearch needs to look up the version associated with the `_id` of the document that is being indexed, which is often the bottleneck for indexing. While reviewing the output of the `jfr` telemetry from a Rally benchmark, I saw that significant time was spent in `ConcurrentHashMap#get` and `ThreadLocal#get`. The reason is that we cache lookup objects per thread and segment, and for every indexed document, we first need to look up the cache associated with this segment (`ConcurrentHashMap#get`) and then get a state that is local to the current thread (`ThreadLocal#get`). So if you are indexing N documents per second and have S segments, both these methods will be called N*S times per second. This commit changes version lookup to use a cache per index reader rather than per segment. While this makes cache entries live for less long, we now only need to do one call to `ConcurrentHashMap#get` and `ThreadLocal#get` per indexed document.
896fa1f to
fc1db9e
Compare
|
I rebased this PR! |
s1monw
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left a small comment LGTM otherwise
| IndexReader.CacheHelper cacheHelper = reader.getCoreCacheHelper(); | ||
| CloseableThreadLocal<PerThreadIDVersionAndSeqNoLookup> ctl = lookupStates.get(cacheHelper.getKey()); | ||
| private static PerThreadIDVersionAndSeqNoLookup[] getLookupState(IndexReader reader, String uidField) throws IOException { | ||
| // We cache on the top level |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you leave a comment why we moved to this? you can also reference this PR for documentation purposes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 I had more comments initially but they apparently got lost with the rebase
* master: Upgrade icu4j for the ICU analysis plugin to 59.1 (elastic#25243) move assertBusy to use CheckException (elastic#25246) Use SPI in High Level Rest Client to load XContent parsers (elastic#25098) [TEST] test that low level REST client leaves path untouched (elastic#25193) Speed up PK lookups at index time. (elastic#19856) [Docs] Fix documentation for percentiles bucket aggregation (elastic#25229) Upgrade to lucene-7.0.0-snapshot-92b1783. (elastic#25222) Build: Add master flag for disabling bwc tests (elastic#25230) Scripting: Rename SearchScript.needsScores to needs_score (elastic#25235) Support script context stateful factory in Painless. (elastic#25233) FastVectorHighlighter should not cache the field query globally (elastic#25197) Remove QUERY_AND_FETCH BWC for pre-5.3.0 nodes (elastic#25223) Add more missing AggregationBuilder getters (elastic#25198) Extract the snapshot/restore full cluster restart tests from the translog full cluster restart tests (elastic#25204)
* master: (44 commits) Upgrade icu4j for the ICU analysis plugin to 59.1 (elastic#25243) move assertBusy to use CheckException (elastic#25246) Use SPI in High Level Rest Client to load XContent parsers (elastic#25098) [TEST] test that low level REST client leaves path untouched (elastic#25193) Speed up PK lookups at index time. (elastic#19856) [Docs] Fix documentation for percentiles bucket aggregation (elastic#25229) Upgrade to lucene-7.0.0-snapshot-92b1783. (elastic#25222) Build: Add master flag for disabling bwc tests (elastic#25230) Scripting: Rename SearchScript.needsScores to needs_score (elastic#25235) Support script context stateful factory in Painless. (elastic#25233) FastVectorHighlighter should not cache the field query globally (elastic#25197) Remove QUERY_AND_FETCH BWC for pre-5.3.0 nodes (elastic#25223) Add more missing AggregationBuilder getters (elastic#25198) Extract the snapshot/restore full cluster restart tests from the translog full cluster restart tests (elastic#25204) Refactor TransportShardBulkAction.executeUpdateRequest and add tests Make sure range queries are correctly profiled. (elastic#25108) Test: allow setting socket timeout for rest client (elastic#25221) Migration docs for elastic#25080 (elastic#25218) Remove `discovery.type` BWC layer from the EC2/Azure/GCE plugins elastic#25080 When stopping via systemd only kill the JVM, not its control group (elastic#25195) ...
At index time Elasticsearch needs to look up the version associated with the
_uidof the document that is being indexed, which is often the bottleneck forindexing.
While reviewing the output of the
jfrtelemetry from a Rally benchmark, I sawthat significant time was spent in
ConcurrentHashMap#getandThreadLocal#get.The reason is that we cache lookup objects per thread and segment, and for every
indexed document, we first need to look up the cache associated with this
segment (
ConcurrentHashMap#get) and then get a state that is local to thecurrent thread (
ThreadLocal#get). So if you are indexing N documents persecond and have S segments, both these methods will be called N*S times per
second.
This commit changes version lookup to use a cache per index reader rather than
per segment. While this makes cache entries live for less long, we now only need
to do one call to
ConcurrentHashMap#getandThreadLocal#getper indexeddocument.
NOTE: I had to remove
IndexReader#getCombinedCoreAndDeletesKeyfrom theforbidden APIs in order to make it work.
Here are some screenshots of the hot methods as reported when indexing 100M empty documents with rally using 1 shard, 0 replicas and otherwise default settings.


Before:
After:
ThreadLocal#getis no longer among the hottest methods after the change, and whileConcurrentHashMap#getis still there, its main callers are now the live version map and the circuit breaker, not the state we maintain for version lookups in the index.Here is the report from
rally compareon two races that have been run without telemetry (so that it does not affect the benchmark results). The speedup looks real.