Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 21 additions & 0 deletions docs/reference/how-to/disk-usage.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -158,3 +158,24 @@ on disk usage. In particular, integers should be stored using an integer type
stored in a `scaled_float` if appropriate or in the smallest type that fits the
use-case: using `float` over `double`, or `half_float` over `float` will help
save storage.

[float]
=== Use index sorting to colocate similar documents

When Elasticsearch stores `_source`, it compresses multiple documents at once
in order to improve the overall compression ratio. For instance it is very
common that documents share the same field names, and quite common that they
share some field values, especially on fields that have a low cardinality or
a https://en.wikipedia.org/wiki/Zipf%27s_law[zipfian] distribution.

By default documents are compressed together in the order that they are added
to the index. If you enabled <<index-modules-index-sorting,index sorting>>
then instead they are compressed in sorted order. Sorting documents with similar
structure, fields, and values together should improve the compression ratio.

[float]
=== Put fields in the same order in documents

Due to the fact that multiple documents are compressed together into blocks,
it is more likely to find longer duplicate strings in those `_source` documents
if fields always occur in the same order.
42 changes: 42 additions & 0 deletions docs/reference/how-to/search-speed.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -326,3 +326,45 @@ queries, they should be mapped as a `keyword`.
<<index-modules-index-sorting,Index sorting>> can be useful in order to make
conjunctions faster at the cost of slightly slower indexing. Read more about it
in the <<index-modules-index-sorting-conjunctions,index sorting documentation>>.

[float]
=== Use `preference` to optimize cache utilization

There are multiple caches that can help with search performance, such as the
https://en.wikipedia.org/wiki/Page_cache[filesystem cache], the
<<shard-request-cache,request cache>> or the <<query-cache,query cache>>. Yet
all these caches are maintained at the node level, meaning that if you run the
same request twice in a row, have 1 <<glossary-replica-shard,replica>> or more
and use https://en.wikipedia.org/wiki/Round-robin_DNS[round-robin], the default
routing algorithm, then those two requests will go to different shard copies,
preventing node-level caches from helping.

Since it is common for users of a search application to run similar requests
one after another, for instance in order to analyze a narrower subset of the
index, using a preference value that identifies the current user or session
could help optimize usage of the caches.

[float]
=== Replicas might help with throughput, but not always

In addition to improving resiliency, replicas can help improve throughput. For
instance if you have a single-shard index and three nodes, you will need to
set the number of replicas to 2 in order to have 3 copies of your shard in
total so that all nodes are utilized.

Now imagine that you have a 2-shards index and two nodes. In one case, the
number of replicas is 0, meaning that each node holds a single shard. In the
second case the number of replicas is 1, meaning that each node has two shards.
Which setup is going to perform best in terms of search performance? Usually,
the setup that has fewer shards per node in total will perform better. The
reason for that is that it gives a greater share of the available filesystem
cache to each shard, and the filesystem cache is probably Elasticsearch's
number 1 performance factor. At the same time, beware that a setup that does
not have replicas is subject to failure in case of a single node failure, so
there is a trade-off between throughput and availability.

So what is the right number of replicas? If you have a cluster that has
`num_nodes` nodes, `num_primaries` primary shards _in total_ and if you want to
be able to cope with `max_failures` node failures at once at most, then the
right number of replicas for you is
`max(max_failures, ceil(num_nodes / num_primaries) - 1)`.