-
Notifications
You must be signed in to change notification settings - Fork 25.6k
[ML] Fix master node deadlock during ML daily maintenance #31691
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Pinging @elastic/ml-core |
1465a92 to
785084e
Compare
785084e to
3d2cdf4
Compare
droberts195
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
As we discussed, it's not ideal to be introducing yet another qa suite, ml-native-multi-node-tests. However, as a followup we'll move all the ML native tests into this and get rid of the current ml-native-tests suite. So relatively soon we'll be back to the current number of qa suites.
ywelsch
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I've not looked at the tests in detail but rather focused on rechecking all MlDataRemover implementations to see if there might be possibly other blocking calls there. I have not found any, but noticed that ExpiredForecastsRemover#findForecastsToDelete possibly parses up to 10000 docs in one go on the network thread. This is not ideal, but can be addressed in a follow-up PR and does not need to block this blocker.
| private SearchResponse initScroll() { | ||
| LOGGER.trace("ES API CALL: search index {}", index); | ||
|
|
||
| Transports.assertNotTransportThread("BatchedDocumentsIterator makes blocking calls"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
preferably put this into the next() method instead so it will also cover the other blocking calls in this class.
Could you also write this as assert Transports.assertNotTransportThread(...), this will save from extra CPU in non-debug mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we need this at all. The blocking call to the client executes it anyway. The issue was that there was no testing. I think this entire transport action can use a ml threadpool instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, it's just a more helpful error message
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please put it into an assert if you keep it. I’d remove it.
|
|
||
| private void deleteExpiredData(Iterator<MlDataRemover> mlDataRemoversIterator, | ||
| ActionListener<DeleteExpiredDataAction.Response> listener) { | ||
| Transports.assertNotTransportThread("ML Daily Maintenance"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is unnecessary imo. Can we have a comment here why we fork?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it was to fail faster during testing/development. I'll remove both assertNotTransportThread calls and add a comment
…e correct thread pool
s1monw
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
Does this change need to be forward-ported @davidkyle? |
|
@jasontedor Thanks for merging. No not in this state, I intend to make further changes to the ML integration tests and incorporate @ywelsch's suggestions. |
|
Thanks for clarifying @davidkyle! |
This is the implementation for master and 6.x of elastic#31691. Relates elastic#31683
In multi-node clusters
TransportDeleteExpiredDataActioncan try to execute a blocking search on the transport client thread which causes the node to stop communicating. The searches in this action should execute in the Machine Learning thread pool.Symptoms appear after the
MlDailyMaintenanceServiceis triggered which corresponds to the following message in the log file:[INFO ][o.e.x.m.MlDailyMaintenanceService] triggering scheduled [ML] maintenance tasksA work-around is to disable Machine Learning on all nodes in the cluster.
The assertions in BaseFuture should have found this in testing but the tests ran in a single node cluster. I added a new gradle project
ml-native-multi-node-testswhich execute in a 3 node cluster and movedDeleteExpiredDataITto it so the test now hits the failure case.Closes #31683