-
Notifications
You must be signed in to change notification settings - Fork 25.6k
SearchRequest#allowPartialSearchResults does not handle successful retries #43095
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SearchRequest#allowPartialSearchResults does not handle successful retries #43095
Conversation
…tries When set to false, allowPartialSearchResults option does not check if the shard failures have been reseted to null. The atomic array, that is used to record shard failures, is filled with a null value if a successful request on a shard happens after a failure on a shard of another replica. In this case the atomic array is not empty but contains only null values so this shouldn't be considered as a failure since all shards are successful (some replicas have failed but the retries on another replica succeeded). This change fixes this bug by checking the content of the atomic array and fails the request only if allowPartialSearchResults is set to false and at least one shard failure is not null. Closes elastic#40743
|
Pinging @elastic/es-search |
markharwood
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - test passing here.
Just some commented-out code in the test class that looks like it needs dealing with
| } | ||
| //} else { | ||
| // initializing.add(routing); | ||
| //} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comments need removing?
javanna
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…tries (#43095) When set to false, allowPartialSearchResults option does not check if the shard failures have been reseted to null. The atomic array, that is used to record shard failures, is filled with a null value if a successful request on a shard happens after a failure on a shard of another replica. In this case the atomic array is not empty but contains only null values so this shouldn't be considered as a failure since all shards are successful (some replicas have failed but the retries on another replica succeeded). This change fixes this bug by checking the content of the atomic array and fails the request only if allowPartialSearchResults is set to false and at least one shard failure is not null. Closes #40743
…tries (#43095) When set to false, allowPartialSearchResults option does not check if the shard failures have been reseted to null. The atomic array, that is used to record shard failures, is filled with a null value if a successful request on a shard happens after a failure on a shard of another replica. In this case the atomic array is not empty but contains only null values so this shouldn't be considered as a failure since all shards are successful (some replicas have failed but the retries on another replica succeeded). This change fixes this bug by checking the content of the atomic array and fails the request only if allowPartialSearchResults is set to false and at least one shard failure is not null. Closes #40743
Before elastic#57042 the max_buckets test would consistently pass because the request would consistently fail. In particular, the request would fail on the data node. After elastic#57042 it only fails on the coordinating node. When the max_buckets test is run in a mixed version cluster it consistently fails on *either* the data node or the coordinating node. Except when the coordinating node is missing elastic#43095. In that case if the one data node has elastic#57042 and one does not, *and* the one that doesn't gets the request first, fails it as expected, and then the coordinating node retries the request on the node with elastic#57042. When that happens the request fails mysteriously with "partial shard failures" as the error message but not partial failures reported. This is *exactly* the bug fixed in elastic#43095. This updates the test to be skipped in mixed version clusters without elastic#43095 because they *sometimes* fail the test spuriously. The request fails in those cases, just like we expect, but with a mysterious error message. Closes elastic#57657
Before #57042 the max_buckets test would consistently pass because the request would consistently fail. In particular, the request would fail on the data node. After #57042 it only fails on the coordinating node. When the max_buckets test is run in a mixed version cluster it consistently fails on *either* the data node or the coordinating node. Except when the coordinating node is missing #43095. In that case if the one data node has #57042 and one does not, *and* the one that doesn't gets the request first, fails it as expected, and then the coordinating node retries the request on the node with #57042. When that happens the request fails mysteriously with "partial shard failures" as the error message but not partial failures reported. This is *exactly* the bug fixed in #43095. This updates the test to be skipped in mixed version clusters without #43095 because they *sometimes* fail the test spuriously. The request fails in those cases, just like we expect, but with a mysterious error message. Closes #57657
When set to false, allowPartialSearchResults option does not check if the
shard failures have been reseted to null. The atomic array, that is used to record
shard failures, is filled with a null value if a successful request on a shard happens
after a failure on a shard of another replica. In this case the atomic array is not empty
but contains only null values so this shouldn't be considered as a failure since all
shards are successful (some replicas have failed but the retries on another replica succeeded).
This change fixes this bug by checking the content of the atomic array and fails the request only
if allowPartialSearchResults is set to false and at least one shard failure is not null.
Closes #40743