Skip to content

Conversation

@ktoso
Copy link
Member

@ktoso ktoso commented Jan 27, 2020

Implement missing SWIM functionality to move across the unreachable -> alive edge and signal those to the ClusterShell.

Motivation:

Without this the unreachable state was really terminal, and a node would never become reachable again.

Realistically, we'll indeed want to signal unreachable, and immediately signal down in the downing strategy, however that is not how the state is designed -- it shall allow moving back to alive. As only the dead state is "terminal".

More to be done here soon, but this unlocks the "big hardening PR" which was failing because of some SWIM problems that the previous ( this one was important #400 ) and this PR address.

Modifications:

  • Implements marking as "change" thanks to which we can emit an event to the cluster shell only if the change really was "effective".
  • allows for disabling the system provided SWIM -- helpful in the clustered tests where we also drive a SWIM manually -- logs would otherwise contain both the fake and real one, which was confusing
  • provide integration test for the reachability flip/flopping
    • I am struggling with making this test reproduced and clean in the SWIMShellClusteredTests 🆘 ❗️ still a bit sucky, but managed to have it at least not timeout due to Swift deciding to recompile all projects when swift run is hit in the integration test... 0c1df2d
    • pushed attempts today as // FIXME: Can't seem to implement a hardened test like this... func ignored_test_swim_shouldNotifyClusterAboutUnreachableNode_andThenReachableAgain() throws { and need to revisit somehow OR decide that we do not test that specific dance without integration test? (would be a lest down)
  • fix the warnings build setting so it actually works
  • many more tests, though it still needs more about edge cases and reconnecting (!)

Result:

##
## This source file is part of the Swift Distributed Actors open source project
##
## Copyright (c) 2018-2019 Apple Inc. and the Swift Distributed Actors project authors
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: 2020

@ktoso ktoso requested review from drexin and yim-lee January 27, 2020 16:41
@ktoso
Copy link
Member Author

ktoso commented Jan 27, 2020

Whoop, missed a warning; Means tho that the build finally does check for them :)

01:42:08 [947/996] Compiling DistributedActorsTests ActorIsolationFailureHandlingTests+XCTest.swift
01:42:08 /code/Tests/DistributedActorsTests/ActorLeakingTests.swift:162:33: error: result of call to 'spawn(_:of:props:_:)' is unused
01:42:08                     try context.spawn(.anonymous, b)
01:42:08                                 ^    ~~~~~~~~~~~~~~~

📗 fixed...

Copy link
Member

@yim-lee yim-lee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I can't review the intricate parts of the logic too well but there seems to be good test cases to cover the bases. 👍

@ktoso
Copy link
Member Author

ktoso commented Jan 27, 2020

Sorry I can't review the intricate parts of the logic too well but there seems to be good test cases to cover the bases. 👍

Indeed, it's quite tricky and one has to know the paper and our cluster internals well.. Thanks for having a look @yim-lee!

@ktoso
Copy link
Member Author

ktoso commented Jan 28, 2020

Hmm integration test not happy on linux it seems

@ktoso
Copy link
Member Author

ktoso commented Jan 28, 2020

So doing swift run of the two tasks causes them to recompile the project (O M G), so the integration test takes ages due to the recompilation... Happens only on linux really it seems. I can't seem to work around it, so just ensured that the timeouts are waiting for the runs to start.

@ktoso
Copy link
Member Author

ktoso commented Jan 28, 2020

Failure on the known ActorSingletonPluginClusteredTests.test_singletonByClusterLeadership which #376 works towards stabilizing

@ktoso
Copy link
Member Author

ktoso commented Jan 28, 2020

@swift-server-bot test this please

@ktoso
Copy link
Member Author

ktoso commented Jan 28, 2020

Debugging the integration test getting stuck on CI with no logs printed from some point...

@ktoso
Copy link
Member Author

ktoso commented Jan 28, 2020

The problem was using SIGTSTP and not SIGSTOP to suspend the process which made the CI integration test not pass.

There's another task to complete the more "unit/integration" test

}
}
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow up ticket #405

@ktoso ktoso merged commit 3c0bbab into apple:master Jan 28, 2020
@ktoso ktoso deleted the wip-swim-detect-alive-again branch January 28, 2020 07:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SWIMShell: SWIM shell never emits .reachable after a node is found reachable again Build with more strict build settings

2 participants