-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Remove shards per gb of heap guidance #86223
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove shards per gb of heap guidance #86223
Conversation
|
Pinging @elastic/es-distributed (Team:Distributed) |
|
Pinging @elastic/es-docs (Team:Docs) |
This guidance does not apply any longer. The overhead per shard has been significantly reduced in recent versions and removed rule of thumb will be too pessimistic in many if not most cases and might be too optimistic in other specific ones. => Replace guidance with rule of thumb per field count on data nodes and rule of thumb by index count (which is far more relevant nowadays than shards) for master nodes. relates elastic#77466
04106a5 to
f9dd093
Compare
| Every mapped field also carries some overhead in terms of memory usage and disk | ||
| space. By default {es} will automatically create a mapping for every field in | ||
| space. While the exact amount resource usage of each additional field depends | ||
| on the type of field, an additional 512 bytes in memory use for each additional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
512B seems tight, is that a worst-case guarantee? Even in 8.2? I think we can afford to be a bit conservative with the recommendation here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's quite conservative for 8.3 I'd say. it's ~4x what a heap dump for Beats indices would suggest in 8.3. 8.2 is probably about twice as expensive (this is only sorta true and depends on the field type).
=> how about 1kb per field then to have a nice round number?
| As a result, the more fields the mappings in your indices contain, the fewer | ||
| indices and shards will each individual data node be able to hold, | ||
| all other things being equal. In addition to the per field overhead you should | ||
| at least have an additional 512Mb of heap available per data node. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: bytes not bits
| at least have an additional 512Mb of heap available per data node. | |
| at least have an additional 512MB of heap available per data node. |
Also is that 512MiB enough for everything else the node might need to do, or is that just enough for it to survive? It sounds pretty tight, maybe we need some more caveats here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the flip side of that is that running with a few indices with few fields on a 512MB heap with small workload is perfectly possible too. So either the limit need to be lower or a less tight wording than "at least".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea was that the 512B above was chosen conservative enough that any additional overheads are covered. In the end 512M is plenty for just data node operations IMO if we have 0.5k/field and especially so if we go with 1k/field.
Maybe add caveats for when the data node is used for ingest and/or coordination? (it's hard to put numbers on either but we could just warn about them in general?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The circuit breakers that apply to data node operations don't really take the shards-and-fields overhead into account tho, e.g. the fielddata and request breakers just act as if it's ok to use 40% and 60% of the total heap respectively. In the example here, 80% of the heap is already consumed with shards-and-fields overhead, so really (with the real-memory CB) there's only 15% left for doing any useful work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
80% of the heap is already consumed with shards-and-fields overhead
Not really. Now that I made it 1024B in particular, we vastly overestimate the direct per-field overhead. So for larger nodes, we will have about a 50%+ margin here now. If we recommend 4G heap for the fields in the example, then really the steady state memory usage from them is probably 20% of the heap tops leaving enough room for the other operations.
The 512M was mostly meant to add a little more margin for error and to get reasonable numbers for smaller nodes.
| depends on various factors. The include but are not limited to the size of its mapping, | ||
| the number of shards per index or whether its mapping is shared with other indices. | ||
| A good rule of thumb is to aim for 3000 indices or fewer per GB of heap on master nodes. | ||
| For example, if your cluster contains 12,000 indices and each master node should have |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| For example, if your cluster contains 12,000 indices and each master node should have | |
| For example, if your cluster contains 12,000 indices then each dedicated master node should have |
for non-dedicated masters should we say something about allowing for this much overhead too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My assumption was that once these numbers start to matter you'd be using dedicated master nodes anyway?
The overhead is definitely quite a bit less if you're on a non-dedicated master node because you share network buffers and such with other operations I'd say. Maybe better to add a line recommending dedicated masters here above say 5k indices or so instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure, that seems a pretty low threshold to be going for dedicated masters and quite possibly the wrong variable to choose too. A 3x31GiB cluster with enough CPU to spare should be able to deal with more than 5k indices, and probably better to stick with 3 big nodes rather than add 3 tiny masters on top?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A 3x31GiB cluster with enough CPU to spare should be able to deal with more than 5k indices, and probably better to stick with 3 big nodes rather than add 3 tiny masters on top?
Yea probably in most cases (disks would be my biggest worry here but with much more CS batching nowadays it's less of an issue probably). I think I'd keep it simple then and really just state that the numbers must be added for non-dedicated master nodes even if it's a little pessimistic heap wise if that's ok?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good to me
henningandersen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Armin, left a few comments.
| As a result, the more fields the mappings in your indices contain, the fewer | ||
| indices and shards will each individual data node be able to hold, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| As a result, the more fields the mappings in your indices contain, the fewer | |
| indices and shards will each individual data node be able to hold, | |
| As a result, the more mapped fields per index, the fewer | |
| indices and shards can be held on each individual data node, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++ added a verb to this though
| As a result, the more fields the mappings in your indices contain, the fewer | ||
| indices and shards will each individual data node be able to hold, | ||
| all other things being equal. In addition to the per field overhead you should | ||
| at least have an additional 512Mb of heap available per data node. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the flip side of that is that running with a few indices with few fields on a 512MB heap with small workload is perfectly possible too. So either the limit need to be lower or a less tight wording than "at least".
| heap memory. The exact amount of heap memory each additional index requires | ||
| depends on various factors. The include but are not limited to the size of its mapping, | ||
| the number of shards per index or whether its mapping is shared with other indices. | ||
| A good rule of thumb is to aim for 3000 indices or fewer per GB of heap on master nodes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we put a few preconditions on this rule of thumb, like less than 1000 fields and with most indices created identically from the same template?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... and not using dynamic mapping updates.
I think these days Beats creates mappings with a lot more than 1000 fields tho. I've seen ~5k. IMO at least some of the advice should be applicable to such indices.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These numbers are taken from Beats indices which do indeed come with 4.x k fields on average. The field count really doesn't matter much here, what matters is mostly that mappings are shared. Added a comment to that effect that also mentions the dynamic mapping updates now.
|
Jenkins run elasticsearch-ci/part-2 (unrelated) |
DaveCTurner
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a few more nits but this looks pretty much good to go IMO, I'll let Henning take a final look too
|
|
||
| Every mapped field also carries some overhead in terms of memory usage and disk | ||
| space. While the exact amount resource usage of each additional field depends | ||
| on the type of field, an additional 1024B in memory use for each additional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| on the type of field, an additional 1024B in memory use for each additional | |
| on the type of field, an additional 1024B in memory used for each additional |
Should this be used instead of use? Also I removed an additional space.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either is acceptable IMO ("memory use" works here as a nounal phrase)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remove that extra space then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the example, we talked about having 1000 shards that each contains 4000 fields means a node of 4.5GB heap can hold 1000 shards.
Later on we talked about for non-master node, each node should preserve total_indices_in_cluster/3000 per GB heap.
a good rule of thumb is to aim for 3000 indices or fewer per GB of heap on master nodes
For non-dedicated master nodes, the same rule holds and should be added to the heap requirements of other node roles.
Does that mean we should also mention that one should adjust cluster.max_shards_per_node setting based on the heap available on a node?
max_idling_shards_per_node = (per_node_memory_in_GB/2 - 0.5GB - total_indices_in_cluster/3000) / number_of_fields_per_shard x GB_to_B ...(1)
total_indices_in_cluster = number_of_data_nodes x max_idling_shards_per_node ...(2)
From (1) and (2), you can derive:
max_idling_shards_per_node = (per_node_memory_in_GB/2 - 0.5GB) x 3000 / (3 x number_of_fields_per_shard/1000 + number_of_data_node) ...(3)
The largest node (heap size) we recommend for Elasticsearch is 32GB heap or 64GB RAM would be able to hold 7,269 idling shards in 1-data-node-cluster or 4,295 idling shards in a 10-data-node-cluster given 4,000 fields per shard based on this formula. For a 60-data-node cluster, max_idling shards_per_node will be dropped further to 1,313. The more data nodes in the cluster, the few shards per node each node can hold.
The effect of fields count appears to be linear in a 1-node cluster changed to almost negligible as the cluster size grows large (>30 nodes).
Later on on the doc, we refer to adjusting the default cluster.max_shards_per_node value from 1,000 to 1,200 temporarily. Do we want to expand this section to include memory-based max idling shards per node value?
Each index and shard has overhead section?
|
Hi, as this is just a doc change, are we able to merge this back into the 8.3.0? |
henningandersen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay here. I added a few more comments.
| indices which are created from the same index template and | ||
| <<explicit-mapping,do not use dynamic mapping updates>>. | ||
|
|
||
| If your indices are are mostly created from a small number of templates and do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a paragraph for the non-happy case, i.e., using dynamic fields with in theory completely different mappings all over the place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would we want to put in that paragraph? It's hard to put a number on the dynamic case and we already mention that the rule here doesn't apply when using dynamic mapping updates. Does it really make sense to go into more detail here? (particularly when we try to somewhat discourage using dynamic mapping updates to begin with?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think of something pretty simple and conservative. Perhaps just 1-2kb per field?
For dynamic mappings, it can make the impact of those very visible. Like you need a 1GB heap if you do not use them but a 30GB heap (or whatever the calculation says) if you use them. That can be a good motivation for a redesign?
Another case could be applications or search use cases using 100s of specific indices with specific mappings. They fall outside the above and we then leave them with no guidance. I think those cases will definitely also result in pretty small numbers anyway, but providing the guidance seems helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it might make sense to defer this until #86639 is addressed - IMO it's too complicated to explain here how dynamic mappings might work with the deduplication mechanism.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would set it up as a worst-case with no deduplication. I agree that anything in between is hard, but providing a no-sharing guidance should be relatively easy? It puts an upper boundary on the no-sharing required heap-size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but providing a no-sharing guidance should be relatively easy? It puts an upper boundary on the no-sharing required heap-size.
But this totally depends on the size of the mappings used. Also, the effect of not sharing isn't as large as it is made out to be here I think. A compressed beats mapping for example is only ~6k. So for 3k indices without any sharing whatsoever we're only talking about an 18M outright increase in the direct heap requirements for the master node. There's additional effects here from making stats APIs heavier without dedup, sending larger cluster states etc. but I find it very hard to capture these in a heap size number.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We talked offline about this and it seems we could remove the deduplication assumption here, since it is in effect very little extra data on the masters and is more of a cpu concern than memory.
Co-authored-by: Henning Andersen <[email protected]>
… drop-obviously-broken-docs
|
@elasticmachine run elasticsearch-ci/docs |
|
@henningandersen @DaveCTurner how about 588bc87 ? |
henningandersen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, sorry for the back and forth on the deduplication piece.
DaveCTurner
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM2
|
Thanks all! |
|
Are we still ok to backport this to 8.3 and 8.2 as per its labels? |
This guidance does not apply any longer. The overhead per shard has been significantly reduced in recent versions and removed rule of thumb will be too pessimistic in many if not most cases and might be too optimistic in other specific ones. => Replace guidance with rule of thumb per field count on data nodes and rule of thumb by index count (which is far more relevant nowadays than shards) for master nodes. relates elastic#77466 Co-authored-by: David Turner <[email protected]> Co-authored-by: Henning Andersen <[email protected]>
This guidance does not apply any longer. The overhead per shard has been significantly reduced in recent versions and removed rule of thumb will be too pessimistic in many if not most cases and might be too optimistic in other specific ones. => Replace guidance with rule of thumb per field count on data nodes and rule of thumb by index count (which is far more relevant nowadays than shards) for master nodes. relates elastic#77466 Co-authored-by: David Turner <[email protected]> Co-authored-by: Henning Andersen <[email protected]>
This guidance does not apply any longer. The overhead per shard has been significantly reduced in recent versions and removed rule of thumb will be too pessimistic in many if not most cases and might be too optimistic in other specific ones. => Replace guidance with rule of thumb per field count on data nodes and rule of thumb by index count (which is far more relevant nowadays than shards) for master nodes. relates #77466 Co-authored-by: David Turner <[email protected]> Co-authored-by: Henning Andersen <[email protected]> Co-authored-by: David Turner <[email protected]> Co-authored-by: Henning Andersen <[email protected]>
This guidance does not apply any longer. The overhead per shard has been significantly reduced in recent versions and removed rule of thumb will be too pessimistic in many if not most cases and might be too optimistic in other specific ones. => Replace guidance with rule of thumb per field count on data nodes and rule of thumb by index count (which is far more relevant nowadays than shards) for master nodes. relates #77466 Co-authored-by: David Turner <[email protected]> Co-authored-by: Henning Andersen <[email protected]> Co-authored-by: David Turner <[email protected]> Co-authored-by: Henning Andersen <[email protected]>
This guidance does not apply any longer.
The overhead per shard has been significantly reduced in recent versions
and this kind of rule of thumb will be too pessimistic in many if not
most cases and might be too optimistic in other specific ones.
-> remove it
relates #77466