Skip to content

Conversation

@tlrx
Copy link
Member

@tlrx tlrx commented Jan 19, 2022

Today the setting indices.recovery.max_bytes_per_sec defaults to different values depending on the node roles, the JVM version and the system total memory that can be detected.

The current logic to set the default value can be summarized as:

  • 40 MB for non-data nodes
  • 40 MB for data nodes that runs on a JVM version < 14
  • 40 MB for data nodes that have one of the data_hot, data_warm, data_content or data roles

Nodes with only data_cold and/or data_frozen roles as data roles have a default value that depends of the available memory:

  • with ≤ 4 GB of available memory, the default is 40 MB
  • with more than 4 GB and less or equal to 8 GB, the default is 60 MB
  • with more than 8 GB and less or equal to 16 GB, the default is 90 MB
  • with more than 16 GB and less or equal to 32 GB, the default is 125 MB
  • and above 32 GB, the default is 250 MB

While those defaults served us well, we want to evaluate if we can define more appropriate defaults if Elasticsearch were to know better the limits (or properties) of the hardware it is running on - something that Elasticsearch cannot extract by itself but can derive from settings that are provided at startup.

This pull request introduces the following new node settings:

  • node.bandwidth.recovery.network
  • node.bandwidth.recovery.disk.read
  • node.bandwidth.recovery.disk.write

Those settings are not dynamic and must be set before the node starts. When they are set Elasticsearch detects the minimum available bandwidth among the network, disk read and disk write available bandwidths and computes a maximum bytes per seconds limit that will be a fraction of the min. available bandwidth. By default 40% of the min. bandwidth is used but that can be dynamically configured by an operator (using the node.bandwidth.recovery.operator.factor setting) or by the user directly (using a different setting node.bandwidth.recovery.factor).

The limit computed from available bandwidths is then compared to pre existing limitations like the one set through the indices.recovery.max_bytes_per_sec setting or the one that is computed by Elasticsearch from the node's physical memory on dedicated cold/frozen nodes. Elasticsearch will try to use the highest possible limit among those values, while not exceeding an overcommit ratio that is also defined through a node setting (see node.bandwidth.recovery.operator.factor.max_overcommit).

This overcommit ratio is here to prevent the rate limit to be set to a value that is greater than 100 times (by default) the minimum available bandwidth.

For example, a node with the following bandwidths:

  • node.bandwidth.recovery.network : 1GB/s
  • node.bandwidth.recovery.disk.read: 500 MB/s
  • node.bandwidth.recovery.disk.write: 250 MB/s

Elasticsearch picks up the minimum available bandwidth (250 MB/s) and applies the default ratio of 0.4 to compute an available bandwidth of 100 MB/s. This value is then compare with pre existing limit, for example the default 40MB/s, and Elasticsearch uses the highest possible one, in this case 100 MB/s. In any case the limit cannot exceed 250 MB/s*100.

@tlrx tlrx added >enhancement :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. labels Jan 19, 2022
Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving a bit of initial feedback.

* Scaling factor applied to {@link #INDICES_RECOVERY_MAX_BYTES_PER_SEC_SETTING} when it is derived from external settings like disk
* and network bandwidths. See {@link #recoveryMaxBytesPerSecBasedOnExternalSettings(Settings)}.
*/
public static final Setting<Double> NODE_RECOVERY_MAX_BYTES_PER_SEC_FACTOR_SETTING = Setting.doubleSetting(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be nice to have two settings, one that is operator controlled and one that the user can set. These should be dynamic too.

Also, I think we should define a max value of 1, at least for the user controlled one (but both seems fine to me).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a user defined and an operator defined setting for this.


} else {
// computes max_bytes_per_sec from memory (for dedicated cold/frozen nodes)
value = recoveryMaxBytesPerSecBasedOnMemory(settings);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should calculate this also in the case of external settings and use the max value of the two. The existing "from memory settings" is our minimum value since we would not want to lower it at least initially.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a tricky one. I think I'd rather stick with this override behaviour and build the existing from-memory behaviour into the numbers provided by the environment. If we took the max here then we wouldn't be able to properly support slower/cheaper cold nodes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed I updated the code to use the highest limit among the one computed from the available bandwidths information and the one from indices.recovery.max_bytes_per_sec or derived from the node memory. I also added a maximum allowed overcommit to avoid situation where a user configured an too high value.

assert NODE_ROLES_SETTING.get(settings).stream().anyMatch(DiscoveryNodeRole::canContainData);

final long diskBandwidth = NODE_DISK_AVAILABLE_BANDWIDTH_SETTING.get(settings).getBytes();
assert diskBandwidth > 0L : diskBandwidth;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assert does not match the minimum value in the setting, I wonder if we can add validation there to avoid it being set to 0? Or maybe for now just assert >= 0 here?

* Disk's write bandwidth allocated for this node. When both this setting and {@link #NODE_NETWORK_AVAILABLE_BANDWIDTH_SETTING}
* are defined they are used to adjust the default value for {@link #INDICES_RECOVERY_MAX_BYTES_PER_SEC_SETTING} per node.
*/
public static final Setting<ByteSizeValue> NODE_DISK_AVAILABLE_BANDWIDTH_SETTING = Setting.byteSizeSetting(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we want both read and write bandwidth here?

It does complicate things a bit, since we may then need more factors (a generic, a read and a write factor), but if we have those available, we can better change our defaults moving forward.

Also, I think we could clarify that this number should be for 1/2MB reads or writes, matching our default chunk size. I think we may want to add more of those in the future, but I think it is fine to stick to one for now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added read & write bandwidths along with their corresponding factors. Let me know if that's ok.

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the operator has defined our available bandwidth I think we should not let an explicit indices.recovery.max_bytes_per_sec value completely override it, but that's my understanding of what happens with this implementation. I think we should at least respect both limits (i.e. take the min) but we could even just ignore indices.recovery.max_bytes_per_sec if we're in a bandwidth-defined environment.

@henningandersen
Copy link
Contributor

henningandersen commented Jan 28, 2022

I think we should at least respect both limits (i.e. take the min)

Given that this first version will be conservative and somewhat incomplete (e.g., not boost vacates), I think we should (for now) let the max_bytes_per_sec setting override completely. Once we have completed more aspects of the project, I would agree to do as you propose though.

elasticsearchmachine pushed a commit that referenced this pull request Jan 28, 2022
…83261)

`indices.recovery.max_bytes_per_sec` has a default value that depends on
multiple criteria that are well documented but not unit tested. This
pull request introduces unit tests that verifies the current behavior so
that future changes like #82819 are less likely to break things.
tlrx added a commit to tlrx/elasticsearch that referenced this pull request Jan 28, 2022
…lastic#83261)

`indices.recovery.max_bytes_per_sec` has a default value that depends on
multiple criteria that are well documented but not unit tested. This
pull request introduces unit tests that verifies the current behavior so
that future changes like elastic#82819 are less likely to break things.
tlrx added a commit to tlrx/elasticsearch that referenced this pull request Jan 28, 2022
…lastic#83261)

`indices.recovery.max_bytes_per_sec` has a default value that depends on
multiple criteria that are well documented but not unit tested. This
pull request introduces unit tests that verifies the current behavior so
that future changes like elastic#82819 are less likely to break things.
tlrx added a commit that referenced this pull request Jan 31, 2022
…83261) (#83283)

`indices.recovery.max_bytes_per_sec` has a default value that depends on
multiple criteria that are well documented but not unit tested. This
pull request introduces unit tests that verifies the current behavior so
that future changes like #82819 are less likely to break things.

Backport of #83261
tlrx added a commit that referenced this pull request Jan 31, 2022
…83261) (#83289)

`indices.recovery.max_bytes_per_sec` has a default value that depends on
multiple criteria that are well documented but not unit tested. This
pull request introduces unit tests that verifies the current behavior so
that future changes like #82819 are less likely to break things.

Backport of #83261
@elasticsearchmachine
Copy link
Collaborator

Hi @tlrx, I've created a changelog YAML for you.

@tlrx tlrx marked this pull request as ready for review January 31, 2022 12:23
@elasticmachine elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Jan 31, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@tlrx
Copy link
Member Author

tlrx commented Jan 31, 2022

@DaveCTurner @henningandersen Thanks for your valuable feedback. I've updated the code, hopefully it better matches your expectations. It lacks some unit tests but I think it's good enough to be reviewed. Thanks!

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some thoughts and comments inline. I also have a different suggestion for the naming of the settings:

node.bandwidth.recovery.disk.read
node.bandwidth.recovery.disk.write
node.bandwidth.recovery.network
node.bandwidth.recovery.factor
node.bandwidth.recovery.factor.read
node.bandwidth.recovery.factor.write
node.bandwidth.recovery.operator.factor
node.bandwidth.recovery.operator.factor.read
node.bandwidth.recovery.operator.factor.write
node.bandwidth.recovery.operator.factor.max_overcommit

I think it's worthwhile to put all this stuff in its own namespace and suggest node.bandwidth here. Then because we will be benchmarking just a recovery workload (fairly sequential reads/writes in 512kiB chunks) I suggest putting the node specs in n.b.recovery. Also dropping the user from the non-operator settings.

* Operator-defined factors have a value in (0.0, 1.0]
*/
private static Setting<Double> operatorFactorSetting(String key, double defaultValue) {
return new Setting<>(key, Double.toString(defaultValue), s -> Setting.parseDouble(s, 0d, 1d, key), v -> {
Copy link
Contributor

@DaveCTurner DaveCTurner Jan 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we may want to permit the operator to set this above 1.0.

NB this is not a blocking comment, we can relax this later.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not a blocking comment, we can relax this later.

Yes, I don't have a strong opinion on this.

Henning suggested that the user setting should have a max. of 1.0 and maybe the operator one too, that's what I picked up for simplicity... we already introduce a bunch of settings that fall back to each others and increasing here require to check against the max overcommit one too.

* User-defined factors have a value in (0.0, 1.0] and fall back to a corresponding operator factor setting.
*/
private static Setting<Double> userFactorSetting(String key, Setting<Double> operatorFallback) {
return new Setting<>(key, operatorFallback, s -> Setting.parseDouble(s, 0d, 1d, key), v -> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... ah but this one should have a max of 1.0 so perhaps we can't just fall back to the operator setting here?


public static final Setting<ByteSizeValue> INDICES_RECOVERY_MAX_BYTES_PER_SEC_SETTING = Setting.byteSizeSetting(
"indices.recovery.max_bytes_per_sec",
settings -> recoveryMaxBytesPerSecBasedOnMemory(settings).getStringRep(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we revert this change to indices.recovery.max_bytes_per_sec here? I think this setting is now unchanged right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I reverted in ad2604d

}

private void computeMaxBytesPerSec(Settings settings) {
validateBandwidthsSettings(settings);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll run this as a cluster state applier so we can't throw exceptions if we're in an invalid state somehow. We can definitely assert that they're valid, but in production we must apply whatever we have.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch 🤦🏻

I renamed to validateNodeBandwidthRecoverySettings and moved the method in the constructor in 8d71b8b.

} else {
finalMaxBytesPerSec = maxBytesPerSec;
}
logger.fatal("{} ad {}", MAX_BYTES_PER_SEC_USER_FACTOR_SETTING.get(settings), NODE_ROLES_SETTING.get(settings));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leftover?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I removed this c0ff25d. Thanks.

}

final long availableBytesPerSec = Math.round(
Math.min(diskBandwidthBytesPerSec, networkBandwidthBytesPerSec) * MAX_BYTES_PER_SEC_USER_FACTOR_SETTING.get(settings)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think applying the read/write factor and then the overall factor might lead to surprise, given that we have other settings like cluster.routing.allocation.node_concurrent{,_incoming,_outgoing}_recoveries where the unqualified setting sets both of the qualified ones.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, can we instead make node.recovery.max_bytes_per_sec.user_factor.writes default to the default one.

I think we should then make the calculation by calculating the read speed as min(availableDiskReadBandwidth, networkBandwidthBytesPerSec) * read_factor (likewise for write speed).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I agree it makes more sense like you both suggested. I pushed a change in 6f7c47d.

The max. overcommit is calculated from Math.min(read, write) * overcommitFactor (ie what would have been the limit if bandwidths were applied ajusted to the overcommit ratio we allow). Let me know if you want it calculated differently.

Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of comments, need a bit more time but wanted to relay comments early.

"node.network.allocated_bandwidth"
);

static final double DEFAULT_FACTOR_VALUE = 0.8d;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should lower this to say 0.4. 0.8 seems like it could disturb ongoing operations?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I set it to 0.4 in d9ac148.

}

final long availableBytesPerSec = Math.round(
Math.min(diskBandwidthBytesPerSec, networkBandwidthBytesPerSec) * MAX_BYTES_PER_SEC_USER_FACTOR_SETTING.get(settings)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, can we instead make node.recovery.max_bytes_per_sec.user_factor.writes default to the default one.

I think we should then make the calculation by calculating the read speed as min(availableDiskReadBandwidth, networkBandwidthBytesPerSec) * read_factor (likewise for write speed).

DaveCTurner added a commit that referenced this pull request Feb 1, 2022
tlrx added a commit that referenced this pull request Feb 2, 2022
)

This commit updates the Operator-only functionality doc to 
mention the operator only settings introduced in #82819.

It also adds an integration test for those operator only 
settings that would have caught #83359.
tlrx added a commit to tlrx/elasticsearch that referenced this pull request Feb 2, 2022
Today the setting indices.recovery.max_bytes_per_sec defaults to different
values depending on the node roles, the JVM version and the system total
memory that can be detected.

The current logic to set the default value can be summarized as:

    40 MB for non-data nodes
    40 MB for data nodes that runs on a JVM version < 14
    40 MB for data nodes that have one of the data_hot, data_warm, data_content or data roles

Nodes with only data_cold and/or data_frozen roles as data roles have a
default value that depends of the available memory:

    with ≤ 4 GB of available memory, the default is 40 MB
    with more than 4 GB and less or equal to 8 GB, the default is 60 MB
    with more than 8 GB and less or equal to 16 GB, the default is 90 MB
    with more than 16 GB and less or equal to 32 GB, the default is 125 MB
    and above 32 GB, the default is 250 MB

While those defaults served us well, we want to evaluate if we can define
more appropriate defaults if Elasticsearch were to know better the limits
(or properties) of the hardware it is running on - something that Elasticsearch
cannot extract by itself but can derive from settings that are provided at startup.

This pull request introduces the following new node settings:

    node.bandwidth.recovery.network
    node.bandwidth.recovery.disk.read
    node.bandwidth.recovery.disk.write

Those settings are not dynamic and must be set before the node starts.
When they are set Elasticsearch detects the minimum available bandwidth
among the network, disk read and disk write available bandwidths and computes
a maximum bytes per seconds limit that will be a fraction of the min. available
bandwidth. By default 40% of the min. bandwidth is used but that can be
dynamically configured by an operator
(using the node.bandwidth.recovery.operator.factor setting) or by the user
directly (using a different setting node.bandwidth.recovery.factor).

The limit computed from available bandwidths is then compared to pre existing
limitations like the one set through the indices.recovery.max_bytes_per_sec setting
or the one that is computed by Elasticsearch from the node's physical memory
on dedicated cold/frozen nodes. Elasticsearch will try to use the highest possible
limit among those values, while not exceeding an overcommit ratio that is also
defined through a node setting
(see node.bandwidth.recovery.operator.factor.max_overcommit).

This overcommit ratio is here to prevent the rate limit to be set to a value that is
greater than 100 times (by default) the minimum available bandwidth.

Backport of elastic#82819 for 7.17.1
tlrx added a commit to tlrx/elasticsearch that referenced this pull request Feb 2, 2022
…tic#83350)

The setting node.bandwidth.recovery.operator.factor.max_overcommit
wasn't added to the list of cluster settings and to the list of settings to
consume for updates.

Relates elastic#82819
tlrx added a commit to tlrx/elasticsearch that referenced this pull request Feb 2, 2022
tlrx pushed a commit to tlrx/elasticsearch that referenced this pull request Feb 2, 2022
tlrx added a commit to tlrx/elasticsearch that referenced this pull request Feb 2, 2022
…ngs (elastic#82819)

Today the setting indices.recovery.max_bytes_per_sec defaults to different 
values depending on the node roles, the JVM version and the system total 
memory that can be detected.

The current logic to set the default value can be summarized as:

    40 MB for non-data nodes
    40 MB for data nodes that runs on a JVM version < 14
    40 MB for data nodes that have one of the data_hot, data_warm, data_content or data roles

Nodes with only data_cold and/or data_frozen roles as data roles have a 
default value that depends of the available memory:

    with ≤ 4 GB of available memory, the default is 40 MB
    with more than 4 GB and less or equal to 8 GB, the default is 60 MB
    with more than 8 GB and less or equal to 16 GB, the default is 90 MB
    with more than 16 GB and less or equal to 32 GB, the default is 125 MB
    and above 32 GB, the default is 250 MB

While those defaults served us well, we want to evaluate if we can define 
more appropriate defaults if Elasticsearch were to know better the limits 
(or properties) of the hardware it is running on - something that Elasticsearch 
cannot extract by itself but can derive from settings that are provided at startup.

This pull request introduces the following new node settings:

    node.bandwidth.recovery.network
    node.bandwidth.recovery.disk.read
    node.bandwidth.recovery.disk.write

Those settings are not dynamic and must be set before the node starts. 
When they are set Elasticsearch detects the minimum available bandwidth 
among the network, disk read and disk write available bandwidths and computes 
a maximum bytes per seconds limit that will be a fraction of the min. available 
bandwidth. By default 40% of the min. bandwidth is used but that can be 
dynamically configured by an operator 
(using the node.bandwidth.recovery.operator.factor setting) or by the user 
directly (using a different setting node.bandwidth.recovery.factor).

The limit computed from available bandwidths is then compared to pre existing 
limitations like the one set through the indices.recovery.max_bytes_per_sec setting 
or the one that is computed by Elasticsearch from the node's physical memory 
on dedicated cold/frozen nodes. Elasticsearch will try to use the highest possible 
limit among those values, while not exceeding an overcommit ratio that is also 
defined through a node setting 
(see node.bandwidth.recovery.operator.factor.max_overcommit).

This overcommit ratio is here to prevent the rate limit to be set to a value that is 
greater than 100 times (by default) the minimum available bandwidth.
tlrx added a commit to tlrx/elasticsearch that referenced this pull request Feb 2, 2022
…tic#83350)

The setting node.bandwidth.recovery.operator.factor.max_overcommit 
wasn't added to the list of cluster settings and to the list of settings to 
consume for updates.

Relates elastic#82819
tlrx pushed a commit to tlrx/elasticsearch that referenced this pull request Feb 2, 2022
tlrx added a commit to tlrx/elasticsearch that referenced this pull request Feb 2, 2022
tlrx added a commit to tlrx/elasticsearch that referenced this pull request Feb 2, 2022
…stic#83372)

This commit updates the Operator-only functionality doc to 
mention the operator only settings introduced in elastic#82819.

It also adds an integration test for those operator only 
settings that would have caught elastic#83359.
tlrx added a commit to tlrx/elasticsearch that referenced this pull request Feb 4, 2022
tlrx added a commit that referenced this pull request Feb 4, 2022
elasticsearchmachine pushed a commit that referenced this pull request Feb 9, 2022
…nal settings (#83413)

* Adjust indices.recovery.max_bytes_per_sec according to external settings

Today the setting indices.recovery.max_bytes_per_sec defaults to different
values depending on the node roles, the JVM version and the system total
memory that can be detected.

The current logic to set the default value can be summarized as:

    40 MB for non-data nodes
    40 MB for data nodes that runs on a JVM version < 14
    40 MB for data nodes that have one of the data_hot, data_warm, data_content or data roles

Nodes with only data_cold and/or data_frozen roles as data roles have a
default value that depends of the available memory:

    with ≤ 4 GB of available memory, the default is 40 MB
    with more than 4 GB and less or equal to 8 GB, the default is 60 MB
    with more than 8 GB and less or equal to 16 GB, the default is 90 MB
    with more than 16 GB and less or equal to 32 GB, the default is 125 MB
    and above 32 GB, the default is 250 MB

While those defaults served us well, we want to evaluate if we can define
more appropriate defaults if Elasticsearch were to know better the limits
(or properties) of the hardware it is running on - something that Elasticsearch
cannot extract by itself but can derive from settings that are provided at startup.

This pull request introduces the following new node settings:

    node.bandwidth.recovery.network
    node.bandwidth.recovery.disk.read
    node.bandwidth.recovery.disk.write

Those settings are not dynamic and must be set before the node starts.
When they are set Elasticsearch detects the minimum available bandwidth
among the network, disk read and disk write available bandwidths and computes
a maximum bytes per seconds limit that will be a fraction of the min. available
bandwidth. By default 40% of the min. bandwidth is used but that can be
dynamically configured by an operator
(using the node.bandwidth.recovery.operator.factor setting) or by the user
directly (using a different setting node.bandwidth.recovery.factor).

The limit computed from available bandwidths is then compared to pre existing
limitations like the one set through the indices.recovery.max_bytes_per_sec setting
or the one that is computed by Elasticsearch from the node's physical memory
on dedicated cold/frozen nodes. Elasticsearch will try to use the highest possible
limit among those values, while not exceeding an overcommit ratio that is also
defined through a node setting
(see node.bandwidth.recovery.operator.factor.max_overcommit).

This overcommit ratio is here to prevent the rate limit to be set to a value that is
greater than 100 times (by default) the minimum available bandwidth.

Backport of #82819 for 7.17.1

* Add missing max overcommit factor to list of (dynamic) settings (#83350)

The setting node.bandwidth.recovery.operator.factor.max_overcommit
wasn't added to the list of cluster settings and to the list of settings to
consume for updates.

Relates #82819

* Operator factor settings should have the OperatorDynamic setting property (#83359)

Relates #82819

* Add docs for node bandwith settings (#83361)

Relates #82819

* Adjust for 7.17.1

* remove draft

* remove docs/changelog/83350.yaml

Co-authored-by: David Turner <[email protected]>
elasticsearchmachine pushed a commit that referenced this pull request Feb 9, 2022
…al settings (#83414)

* Adjust indices.recovery.max_bytes_per_sec according to external settings (#82819)

Today the setting indices.recovery.max_bytes_per_sec defaults to different 
values depending on the node roles, the JVM version and the system total 
memory that can be detected.

The current logic to set the default value can be summarized as:

    40 MB for non-data nodes
    40 MB for data nodes that runs on a JVM version < 14
    40 MB for data nodes that have one of the data_hot, data_warm, data_content or data roles

Nodes with only data_cold and/or data_frozen roles as data roles have a 
default value that depends of the available memory:

    with ≤ 4 GB of available memory, the default is 40 MB
    with more than 4 GB and less or equal to 8 GB, the default is 60 MB
    with more than 8 GB and less or equal to 16 GB, the default is 90 MB
    with more than 16 GB and less or equal to 32 GB, the default is 125 MB
    and above 32 GB, the default is 250 MB

While those defaults served us well, we want to evaluate if we can define 
more appropriate defaults if Elasticsearch were to know better the limits 
(or properties) of the hardware it is running on - something that Elasticsearch 
cannot extract by itself but can derive from settings that are provided at startup.

This pull request introduces the following new node settings:

    node.bandwidth.recovery.network
    node.bandwidth.recovery.disk.read
    node.bandwidth.recovery.disk.write

Those settings are not dynamic and must be set before the node starts. 
When they are set Elasticsearch detects the minimum available bandwidth 
among the network, disk read and disk write available bandwidths and computes 
a maximum bytes per seconds limit that will be a fraction of the min. available 
bandwidth. By default 40% of the min. bandwidth is used but that can be 
dynamically configured by an operator 
(using the node.bandwidth.recovery.operator.factor setting) or by the user 
directly (using a different setting node.bandwidth.recovery.factor).

The limit computed from available bandwidths is then compared to pre existing 
limitations like the one set through the indices.recovery.max_bytes_per_sec setting 
or the one that is computed by Elasticsearch from the node's physical memory 
on dedicated cold/frozen nodes. Elasticsearch will try to use the highest possible 
limit among those values, while not exceeding an overcommit ratio that is also 
defined through a node setting 
(see node.bandwidth.recovery.operator.factor.max_overcommit).

This overcommit ratio is here to prevent the rate limit to be set to a value that is 
greater than 100 times (by default) the minimum available bandwidth.

* Add missing max overcommit factor to list of (dynamic) settings (#83350)

The setting node.bandwidth.recovery.operator.factor.max_overcommit 
wasn't added to the list of cluster settings and to the list of settings to 
consume for updates.

Relates #82819

* Add docs for node bandwith settings (#83361)

Relates #82819

* Operator factor settings should have the OperatorDynamic setting property (#83359)

Relates #82819

* Document and test operator-only node bandwidth recovery settings (#83372)

This commit updates the Operator-only functionality doc to 
mention the operator only settings introduced in #82819.

It also adds an integration test for those operator only 
settings that would have caught #83359.

* remove draft

* remove docs/changelog/83350.yaml

Co-authored-by: David Turner <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. >enhancement Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. v7.17.1 v8.0.1 v8.1.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants