Skip to content

Conversation

@original-brownbear
Copy link
Contributor

This guidance does not apply any longer.
The overhead per shard has been significantly reduced in recent versions
and this kind of rule of thumb will be too pessimistic in many if not
most cases and might be too optimistic in other specific ones.
-> remove it

relates #77466

@original-brownbear original-brownbear added >docs General docs changes :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. v8.3.0 v8.2.1 labels Apr 27, 2022
@elasticmachine elasticmachine added Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. Team:Docs Meta label for docs team labels Apr 27, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-docs (Team:Docs)

This guidance does not apply any longer.
The overhead per shard has been significantly reduced in recent versions
and removed rule of thumb will be too pessimistic in many if not
most cases and might be too optimistic in other specific ones.

=> Replace guidance with rule of thumb per field count on data nodes and
rule of thumb by index count (which is far more relevant nowadays than
shards) for master nodes.

relates elastic#77466
Every mapped field also carries some overhead in terms of memory usage and disk
space. By default {es} will automatically create a mapping for every field in
space. While the exact amount resource usage of each additional field depends
on the type of field, an additional 512 bytes in memory use for each additional
Copy link
Contributor

@DaveCTurner DaveCTurner May 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

512B seems tight, is that a worst-case guarantee? Even in 8.2? I think we can afford to be a bit conservative with the recommendation here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's quite conservative for 8.3 I'd say. it's ~4x what a heap dump for Beats indices would suggest in 8.3. 8.2 is probably about twice as expensive (this is only sorta true and depends on the field type).
=> how about 1kb per field then to have a nice round number?

As a result, the more fields the mappings in your indices contain, the fewer
indices and shards will each individual data node be able to hold,
all other things being equal. In addition to the per field overhead you should
at least have an additional 512Mb of heap available per data node.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: bytes not bits

Suggested change
at least have an additional 512Mb of heap available per data node.
at least have an additional 512MB of heap available per data node.

Also is that 512MiB enough for everything else the node might need to do, or is that just enough for it to survive? It sounds pretty tight, maybe we need some more caveats here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the flip side of that is that running with a few indices with few fields on a 512MB heap with small workload is perfectly possible too. So either the limit need to be lower or a less tight wording than "at least".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea was that the 512B above was chosen conservative enough that any additional overheads are covered. In the end 512M is plenty for just data node operations IMO if we have 0.5k/field and especially so if we go with 1k/field.
Maybe add caveats for when the data node is used for ingest and/or coordination? (it's hard to put numbers on either but we could just warn about them in general?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The circuit breakers that apply to data node operations don't really take the shards-and-fields overhead into account tho, e.g. the fielddata and request breakers just act as if it's ok to use 40% and 60% of the total heap respectively. In the example here, 80% of the heap is already consumed with shards-and-fields overhead, so really (with the real-memory CB) there's only 15% left for doing any useful work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

80% of the heap is already consumed with shards-and-fields overhead

Not really. Now that I made it 1024B in particular, we vastly overestimate the direct per-field overhead. So for larger nodes, we will have about a 50%+ margin here now. If we recommend 4G heap for the fields in the example, then really the steady state memory usage from them is probably 20% of the heap tops leaving enough room for the other operations.
The 512M was mostly meant to add a little more margin for error and to get reasonable numbers for smaller nodes.

depends on various factors. The include but are not limited to the size of its mapping,
the number of shards per index or whether its mapping is shared with other indices.
A good rule of thumb is to aim for 3000 indices or fewer per GB of heap on master nodes.
For example, if your cluster contains 12,000 indices and each master node should have
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
For example, if your cluster contains 12,000 indices and each master node should have
For example, if your cluster contains 12,000 indices then each dedicated master node should have

for non-dedicated masters should we say something about allowing for this much overhead too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My assumption was that once these numbers start to matter you'd be using dedicated master nodes anyway?
The overhead is definitely quite a bit less if you're on a non-dedicated master node because you share network buffers and such with other operations I'd say. Maybe better to add a line recommending dedicated masters here above say 5k indices or so instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure, that seems a pretty low threshold to be going for dedicated masters and quite possibly the wrong variable to choose too. A 3x31GiB cluster with enough CPU to spare should be able to deal with more than 5k indices, and probably better to stick with 3 big nodes rather than add 3 tiny masters on top?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A 3x31GiB cluster with enough CPU to spare should be able to deal with more than 5k indices, and probably better to stick with 3 big nodes rather than add 3 tiny masters on top?

Yea probably in most cases (disks would be my biggest worry here but with much more CS batching nowadays it's less of an issue probably). I think I'd keep it simple then and really just state that the numbers must be added for non-dedicated master nodes even if it's a little pessimistic heap wise if that's ok?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good to me

Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Armin, left a few comments.

Comment on lines 74 to 75
As a result, the more fields the mappings in your indices contain, the fewer
indices and shards will each individual data node be able to hold,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
As a result, the more fields the mappings in your indices contain, the fewer
indices and shards will each individual data node be able to hold,
As a result, the more mapped fields per index, the fewer
indices and shards can be held on each individual data node,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++ added a verb to this though

As a result, the more fields the mappings in your indices contain, the fewer
indices and shards will each individual data node be able to hold,
all other things being equal. In addition to the per field overhead you should
at least have an additional 512Mb of heap available per data node.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the flip side of that is that running with a few indices with few fields on a 512MB heap with small workload is perfectly possible too. So either the limit need to be lower or a less tight wording than "at least".

heap memory. The exact amount of heap memory each additional index requires
depends on various factors. The include but are not limited to the size of its mapping,
the number of shards per index or whether its mapping is shared with other indices.
A good rule of thumb is to aim for 3000 indices or fewer per GB of heap on master nodes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we put a few preconditions on this rule of thumb, like less than 1000 fields and with most indices created identically from the same template?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... and not using dynamic mapping updates.

I think these days Beats creates mappings with a lot more than 1000 fields tho. I've seen ~5k. IMO at least some of the advice should be applicable to such indices.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These numbers are taken from Beats indices which do indeed come with 4.x k fields on average. The field count really doesn't matter much here, what matters is mostly that mappings are shared. Added a comment to that effect that also mentions the dynamic mapping updates now.

@original-brownbear
Copy link
Contributor Author

Jenkins run elasticsearch-ci/part-2 (unrelated)

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few more nits but this looks pretty much good to go IMO, I'll let Henning take a final look too


Every mapped field also carries some overhead in terms of memory usage and disk
space. While the exact amount resource usage of each additional field depends
on the type of field, an additional 1024B in memory use for each additional
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
on the type of field, an additional 1024B in memory use for each additional
on the type of field, an additional 1024B in memory used for each additional

Should this be used instead of use? Also I removed an additional space.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either is acceptable IMO ("memory use" works here as a nounal phrase)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove that extra space then.

Copy link
Contributor

@Leaf-Lin Leaf-Lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the example, we talked about having 1000 shards that each contains 4000 fields means a node of 4.5GB heap can hold 1000 shards.

Later on we talked about for non-master node, each node should preserve total_indices_in_cluster/3000 per GB heap.

a good rule of thumb is to aim for 3000 indices or fewer per GB of heap on master nodes
For non-dedicated master nodes, the same rule holds and should be added to the heap requirements of other node roles.

Does that mean we should also mention that one should adjust cluster.max_shards_per_node setting based on the heap available on a node?

max_idling_shards_per_node = (per_node_memory_in_GB/2 - 0.5GB - total_indices_in_cluster/3000) / number_of_fields_per_shard x GB_to_B           ...(1)

total_indices_in_cluster   = number_of_data_nodes x max_idling_shards_per_node                                                                  ...(2)

From (1) and (2), you can derive:

max_idling_shards_per_node = (per_node_memory_in_GB/2 - 0.5GB) x 3000 / (3 x number_of_fields_per_shard/1000 + number_of_data_node)             ...(3)

The largest node (heap size) we recommend for Elasticsearch is 32GB heap or 64GB RAM would be able to hold 7,269 idling shards in 1-data-node-cluster or 4,295 idling shards in a 10-data-node-cluster given 4,000 fields per shard based on this formula. For a 60-data-node cluster, max_idling shards_per_node will be dropped further to 1,313. The more data nodes in the cluster, the few shards per node each node can hold.

The effect of fields count appears to be linear in a 1-node cluster changed to almost negligible as the cluster size grows large (>30 nodes).

Later on on the doc, we refer to adjusting the default cluster.max_shards_per_node value from 1,000 to 1,200 temporarily. Do we want to expand this section to include memory-based max idling shards per node value?

⚠️ , we might want to emphasise idling a bit more in the Each index and shard has overhead section?

@Leaf-Lin
Copy link
Contributor

Hi, as this is just a doc change, are we able to merge this back into the 8.3.0?

Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay here. I added a few more comments.

indices which are created from the same index template and
<<explicit-mapping,do not use dynamic mapping updates>>.

If your indices are are mostly created from a small number of templates and do
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a paragraph for the non-happy case, i.e., using dynamic fields with in theory completely different mappings all over the place?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would we want to put in that paragraph? It's hard to put a number on the dynamic case and we already mention that the rule here doesn't apply when using dynamic mapping updates. Does it really make sense to go into more detail here? (particularly when we try to somewhat discourage using dynamic mapping updates to begin with?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think of something pretty simple and conservative. Perhaps just 1-2kb per field?

For dynamic mappings, it can make the impact of those very visible. Like you need a 1GB heap if you do not use them but a 30GB heap (or whatever the calculation says) if you use them. That can be a good motivation for a redesign?

Another case could be applications or search use cases using 100s of specific indices with specific mappings. They fall outside the above and we then leave them with no guidance. I think those cases will definitely also result in pretty small numbers anyway, but providing the guidance seems helpful.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might make sense to defer this until #86639 is addressed - IMO it's too complicated to explain here how dynamic mappings might work with the deduplication mechanism.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would set it up as a worst-case with no deduplication. I agree that anything in between is hard, but providing a no-sharing guidance should be relatively easy? It puts an upper boundary on the no-sharing required heap-size.

Copy link
Contributor Author

@original-brownbear original-brownbear Jun 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but providing a no-sharing guidance should be relatively easy? It puts an upper boundary on the no-sharing required heap-size.

But this totally depends on the size of the mappings used. Also, the effect of not sharing isn't as large as it is made out to be here I think. A compressed beats mapping for example is only ~6k. So for 3k indices without any sharing whatsoever we're only talking about an 18M outright increase in the direct heap requirements for the master node. There's additional effects here from making stats APIs heavier without dedup, sending larger cluster states etc. but I find it very hard to capture these in a heap size number.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We talked offline about this and it seems we could remove the deduplication assumption here, since it is in effect very little extra data on the masters and is more of a cpu concern than memory.

@henningandersen
Copy link
Contributor

@elasticmachine run elasticsearch-ci/docs

@original-brownbear
Copy link
Contributor Author

@henningandersen @DaveCTurner how about 588bc87 ?

Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, sorry for the back and forth on the deduplication piece.

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM2

@original-brownbear
Copy link
Contributor Author

Thanks all!

@original-brownbear original-brownbear merged commit 2a5d65c into elastic:master Jun 9, 2022
@original-brownbear original-brownbear deleted the drop-obviously-broken-docs branch June 9, 2022 13:31
@DaveCTurner
Copy link
Contributor

Are we still ok to backport this to 8.3 and 8.2 as per its labels?

original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Jun 29, 2022
This guidance does not apply any longer.
The overhead per shard has been significantly reduced in recent versions
and removed rule of thumb will be too pessimistic in many if not
most cases and might be too optimistic in other specific ones.

=> Replace guidance with rule of thumb per field count on data nodes and
rule of thumb by index count (which is far more relevant nowadays than
shards) for master nodes.

relates elastic#77466

Co-authored-by: David Turner <[email protected]>
Co-authored-by: Henning Andersen <[email protected]>
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Jun 29, 2022
This guidance does not apply any longer.
The overhead per shard has been significantly reduced in recent versions
and removed rule of thumb will be too pessimistic in many if not
most cases and might be too optimistic in other specific ones.

=> Replace guidance with rule of thumb per field count on data nodes and
rule of thumb by index count (which is far more relevant nowadays than
shards) for master nodes.

relates elastic#77466

Co-authored-by: David Turner <[email protected]>
Co-authored-by: Henning Andersen <[email protected]>
elasticsearchmachine pushed a commit that referenced this pull request Jun 29, 2022
This guidance does not apply any longer.
The overhead per shard has been significantly reduced in recent versions
and removed rule of thumb will be too pessimistic in many if not
most cases and might be too optimistic in other specific ones.

=> Replace guidance with rule of thumb per field count on data nodes and
rule of thumb by index count (which is far more relevant nowadays than
shards) for master nodes.

relates #77466

Co-authored-by: David Turner <[email protected]>
Co-authored-by: Henning Andersen <[email protected]>

Co-authored-by: David Turner <[email protected]>
Co-authored-by: Henning Andersen <[email protected]>
elasticsearchmachine pushed a commit that referenced this pull request Jun 29, 2022
This guidance does not apply any longer.
The overhead per shard has been significantly reduced in recent versions
and removed rule of thumb will be too pessimistic in many if not
most cases and might be too optimistic in other specific ones.

=> Replace guidance with rule of thumb per field count on data nodes and
rule of thumb by index count (which is far more relevant nowadays than
shards) for master nodes.

relates #77466

Co-authored-by: David Turner <[email protected]>
Co-authored-by: Henning Andersen <[email protected]>

Co-authored-by: David Turner <[email protected]>
Co-authored-by: Henning Andersen <[email protected]>
@original-brownbear original-brownbear restored the drop-obviously-broken-docs branch April 18, 2023 20:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. >docs General docs changes Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. Team:Docs Meta label for docs team v8.2.1 v8.3.1 v8.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants