-
Notifications
You must be signed in to change notification settings - Fork 25.6k
[Test] More resilience for token invalidation test #75319
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Test] More resilience for token invalidation test #75319
Conversation
This PR add more resilience to testExpiredTokensDeletedAfterExpiration by wrapping the empty error assertion into an assertBusy block. In rare cases, it is possible for search to return a deleted but not yet refreshed document. Trying to invalidate this document then result into MissDocumentException. This state is transient and should go away once all shards refresh. Resolves: elastic#67347
|
Pinging @elastic/es-security (Team:Security) |
| assertThat(newResponse.getPreviouslyInvalidatedTokens(), equalTo(0)); | ||
| assertThat(newResponse.getErrors(), empty()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was the reviewer for the previous change (#64757). But somehow I didn't realise that we should wrap getErrors() here instead of getPreviouslyInvalidatedTokens() because:
- The original failure (TokenAuthIntegTests testExpiredTokensDeletedAfterExpiration failure #56903) was about
getErrors(). getPreviouslyInvalidatedTokens()should always return0since the refersh token is never invalidated before. If the cluster behaves really weirdly, we could probably expectgetInvalidatedTokens()to be1, but notgetPreviouslyInvalidatedTokens().
...src/internalClusterTest/java/org/elasticsearch/xpack/security/authc/TokenAuthIntegTests.java
Outdated
Show resolved
Hide resolved
| InvalidateTokenResponse invalidateRefreshTokenResponse = restClient.security().invalidateToken( | ||
| invalidateRefreshTokenRequest, SECURITY_REQUEST_OPTIONS); | ||
| assertThat(invalidateRefreshTokenResponse.getInvalidatedTokens(), equalTo(0)); | ||
| assertThat(invalidateRefreshTokenResponse.getErrors(), empty()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of the problems here is that we have this very complex test, but nothing testing the more normal cases.
As far as I can tell there is nothing to test what happens if you try to invalidate an access token and then invalidate its refresh token. What's that behaviour supposed to be?
And this test is trying to test a behaviour that we mostly don't care about - specifically what happens if you try to invalidate a refresh token after it's been deleted from the index? The answer is it should either behave like a token that never existed, or it should behave like a token that was recently invalidated. We don't really mind which.
I don't think we should ever return errors though - it's not the caller's problem that the cleanup job just ran.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but nothing testing the more normal cases
We can (and should) add more tests. But I'd prefer a separate PR. This one is just to tame the number of CI failures.
And this test is trying to test a behaviour that we mostly don't care about - specifically what happens if you try to invalidate a refresh token after it's been deleted from the index?
If we go by the test name literally, to me, testExpiredTokensDeletedAfterExpiration means it is sufficient to ensure that the expired token document is deleted once the cleanup job runs. It has nothing to do with subsuquent invalidations after the deletion. And like you said, any subsequent invalidation should just behave like the document never exists and we do have a separate test (testInvalidateNotValidRefreshTokens) for that. If this is the case, I'd say the test is sufficient after the second assertBusy block which asserts the document no longer exists after triggering the cleanup job.
So a simpler fix is to just drop all the code after it. We can then have separate PRs to add tests for more common use cases, e.g. invalidate refresh token after access token is invalidated etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Part of my concern here is that this test says it's OK to return errors for a period of time after the deletion occurs, as long as they eventually go away.
Why is that? That seems like it's just accepting that there is a bug that we don't want to fix.
And due to the weirdness in these tests overall, it's hard to tell whether that's a bug that manifests more generally, or whether it's just an edge case that we don't care about and should stop testing for.
Because it sounds like we don't care, and we're just trying to get the test to run.
Why do these assertions model the behaviour that we actually want from the code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a new test for "deleting access and refresh tokens one after the other with randomness" to cover the more realistic use case as you suggested.
As for this test, the elaborated code indeed suggests for a bug. There are also other bugs in TokenService and they are all more or less caused by the same dirty read problem. All of them are quite rare and this one is even less a practical issue. So we didn't get time around to tackle them, plus they all have some complexities. Even for this one, code wise it is likely easy to skip any DocumentMissingException and just returns normally. But I am not sure whether this would be covering up other subtle and likely very rare problems. In my opinion, it is at least safer to report a false error instead of silently skip the error (unless we are perfectly sure). All these sounds like a bit too much for a PR that is meant to reduce our CI backlog. I can raise an issue to track the bug, but I'd prefer to keep it separate from this PR.
…csearch/xpack/security/authc/TokenAuthIntegTests.java Co-authored-by: Tim Vernum <[email protected]>
tvernum
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
We need to fix this test. We clearly have some underlying issues around requests that come in at the same time as the cleanup job, but they're more theoretical than of practical concern.
I agree there are more issues to fix. But I think it is more about the code itself rather than the test. The test is a reflect on how the code logic being non-ideal. Once we get around to fix the code, the test should be also be fixed as part of it. |
💔 Backport failed
To backport manually run |
This PR add more resilience to testExpiredTokensDeletedAfterExpiration by wrapping the empty error assertion into an assertBusy block. In rare cases, it is possible for search to return a deleted but not yet refreshed document. Trying to invalidate this document then result into MissDocumentException. This state is transient and should go away once all shards refresh. Resolves: elastic#67347
This PR add more resilience to testExpiredTokensDeletedAfterExpiration by wrapping the empty error assertion into an assertBusy block. In rare cases, it is possible for search to return a deleted but not yet refreshed document. Trying to invalidate this document then result into MissDocumentException. This state is transient and should go away once all shards refresh. Resolves: #67347
This PR adds more resilience to testExpiredTokensDeletedAfterExpiration
by wrapping the empty error assertion into an assertBusy block. In rare
cases, it is possible for search to return a deleted but not yet
refreshed document. Trying to invalidate this document then result into
MissDocumentException. This state is transient and should go away once
all shards refresh.
Resolves: #67347