[ML] Fix master node deadlock during ML daily maintenance #31691

davidkyle · 2018-06-29T15:24:06Z

In multi-node clusters TransportDeleteExpiredDataAction can try to execute a blocking search on the transport client thread which causes the node to stop communicating. The searches in this action should execute in the Machine Learning thread pool.

Symptoms appear after the MlDailyMaintenanceService is triggered which corresponds to the following message in the log file:

[INFO ][o.e.x.m.MlDailyMaintenanceService] triggering scheduled [ML] maintenance tasks

A work-around is to disable Machine Learning on all nodes in the cluster.

The assertions in BaseFuture should have found this in testing but the tests ran in a single node cluster. I added a new gradle project ml-native-multi-node-tests which execute in a 3 node cluster and moved DeleteExpiredDataIT to it so the test now hits the failure case.

Closes #31683

elasticmachine · 2018-06-29T15:24:07Z

Pinging @elastic/ml-core

droberts195

LGTM

As we discussed, it's not ideal to be introducing yet another qa suite, ml-native-multi-node-tests. However, as a followup we'll move all the ML native tests into this and get rid of the current ml-native-tests suite. So relatively soon we'll be back to the current number of qa suites.

ywelsch

LGTM. I've not looked at the tests in detail but rather focused on rechecking all MlDataRemover implementations to see if there might be possibly other blocking calls there. I have not found any, but noticed that ExpiredForecastsRemover#findForecastsToDelete possibly parses up to 10000 docs in one go on the network thread. This is not ideal, but can be addressed in a follow-up PR and does not need to block this blocker.

ywelsch · 2018-06-29T17:26:18Z

...in/ml/src/main/java/org/elasticsearch/xpack/ml/job/persistence/BatchedDocumentsIterator.java

    private SearchResponse initScroll() {
        LOGGER.trace("ES API CALL: search index {}", index);

+        Transports.assertNotTransportThread("BatchedDocumentsIterator makes blocking calls");


preferably put this into the next() method instead so it will also cover the other blocking calls in this class.
Could you also write this as assert Transports.assertNotTransportThread(...), this will save from extra CPU in non-debug mode.

I wonder if we need this at all. The blocking call to the client executes it anyway. The issue was that there was no testing. I think this entire transport action can use a ml threadpool instead

True, it's just a more helpful error message

Please put it into an assert if you keep it. I’d remove it.

s1monw · 2018-06-29T17:52:24Z

...gin/ml/src/main/java/org/elasticsearch/xpack/ml/action/TransportDeleteExpiredDataAction.java


    private void deleteExpiredData(Iterator<MlDataRemover> mlDataRemoversIterator,
                                   ActionListener<DeleteExpiredDataAction.Response> listener) {
+        Transports.assertNotTransportThread("ML Daily Maintenance");


This is unnecessary imo. Can we have a comment here why we fork?

Yes it was to fail faster during testing/development. I'll remove both assertNotTransportThread calls and add a comment

…e correct thread pool

s1monw

LGTM

jasontedor · 2018-06-29T21:01:56Z

Does this change need to be forward-ported @davidkyle?

davidkyle · 2018-06-29T21:05:59Z

@jasontedor Thanks for merging.

No not in this state, I intend to make further changes to the ML integration tests and incorporate @ywelsch's suggestions.

jasontedor · 2018-06-29T21:51:18Z

Thanks for clarifying @davidkyle!

This is the implementation for master and 6.x of elastic#31691. Relates elastic#31683

This is the implementation for master and 6.x of #31691. Native tests are changed to use multi-node clusters in #31757. Relates #31683

davidkyle added >bug :ml Machine learning v6.3.1 labels Jun 29, 2018

davidkyle requested review from droberts195 and s1monw June 29, 2018 15:24

davidkyle force-pushed the daily-maintenance-tp-fix branch from 1465a92 to 785084e Compare June 29, 2018 15:43

ML expired data cleanup cannot run on the client thread

3d2cdf4

davidkyle force-pushed the daily-maintenance-tp-fix branch from 785084e to 3d2cdf4 Compare June 29, 2018 15:44

davidkyle requested a review from bleskes June 29, 2018 15:50

droberts195 approved these changes Jun 29, 2018

View reviewed changes

jasontedor requested review from jasontedor and ywelsch June 29, 2018 16:15

ywelsch approved these changes Jun 29, 2018

View reviewed changes

davidkyle added the blocker label Jun 29, 2018

s1monw approved these changes Jun 29, 2018

View reviewed changes

Remove redundant assertions and comment on the importance of using th…

69f71be

…e correct thread pool

s1monw approved these changes Jun 29, 2018

View reviewed changes

jasontedor merged commit eb782d0 into elastic:6.3 Jun 29, 2018

davidkyle deleted the daily-maintenance-tp-fix branch June 29, 2018 21:01

davidkyle mentioned this pull request Jul 3, 2018

[ML] Switch ML native QA tests to use a 3 node cluster #31757

Merged

droberts195 changed the title ~~[ML] Expired data cleanup cannot run on the client thread~~ [ML] Fix master node deadlock during ML daily maintenance Jul 5, 2018

davidkyle mentioned this pull request Jul 5, 2018

[ML] MlDailyMaintenanceService can block network threads #31683

Closed

dimitris-athanasiou mentioned this pull request Jul 5, 2018

[ML] Fix master node deadlock during ML daily maintenance #31836

Merged

dimitris-athanasiou added a commit to dimitris-athanasiou/elasticsearch that referenced this pull request Jul 6, 2018

[ML] Ensure ML daily maintenance service does not block IO

50dc8c1

This is the implementation for master and 6.x of elastic#31691. Relates elastic#31683

dimitris-athanasiou added a commit that referenced this pull request Jul 7, 2018

[ML] Fix master node deadlock during ML daily maintenance (#31836)

49ba271

This is the implementation for master and 6.x of #31691. Native tests are changed to use multi-node clusters in #31757. Relates #31683

dimitris-athanasiou added a commit that referenced this pull request Jul 7, 2018

[ML] Fix master node deadlock during ML daily maintenance (#31836)

3c01f1c

This is the implementation for master and 6.x of #31691. Native tests are changed to use multi-node clusters in #31757. Relates #31683

[ML] Fix master node deadlock during ML daily maintenance #31691

[ML] Fix master node deadlock during ML daily maintenance #31691

Uh oh!

Conversation

davidkyle commented Jun 29, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticmachine commented Jun 29, 2018

Uh oh!

droberts195 left a comment

Choose a reason for hiding this comment

Uh oh!

ywelsch left a comment

Choose a reason for hiding this comment

Uh oh!

ywelsch Jun 29, 2018

Choose a reason for hiding this comment

Uh oh!

s1monw Jun 29, 2018

Choose a reason for hiding this comment

Uh oh!

davidkyle Jun 29, 2018

Choose a reason for hiding this comment

Uh oh!

s1monw Jun 29, 2018

Choose a reason for hiding this comment

Uh oh!

s1monw Jun 29, 2018

Choose a reason for hiding this comment

Uh oh!

davidkyle Jun 29, 2018

Choose a reason for hiding this comment

Uh oh!

s1monw left a comment

Choose a reason for hiding this comment

Uh oh!

jasontedor commented Jun 29, 2018

Uh oh!

davidkyle commented Jun 29, 2018

Uh oh!

jasontedor commented Jun 29, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

davidkyle commented Jun 29, 2018 •

edited

Loading