-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Description
Currently we do not offer a straightforward way to enable security on a cluster without doing a full cluster restart. This is because a cluster cannot communicate when half of it is plaintext and half of it is TLS.
We can support both TLS and plaintext communications on new nodes that have enabled security. This would allow a rolling restart when enabling security.
There solution is:
- On incoming connections, check if the first incoming bytes are 'E', 'S'. These bytes do not overlap with TLS bytes. If the first bytes are 'E', 'S' then it is a plaintext connection. Otherwise assume it is TLS.
- On outgoing connections, attempt to open a TLS connection. If the other node is plaintext that connection will fail quickly. If the connection fails, attempt a plaintext connection.
There are two tricky parts of this issue:
-
Normally we do not want to allow dual stack. @jaymode and I discussed a number of potential solutions and we were struggling to find a good one. We could add some type of dynamic setting that indicates if dual stack is supported. However, there are a number of failure scenarios here as a restarted node with security NEEDS to know if it supports plaintext before joining a cluster. To ensure that the restarted node knows this, the customer would need to add this setting to each of the nodes yml files. And then they have to update the setting (once the rolling restart is complete) to disable this functionality.
-
We also need to have some type of reaper process that will kill plaintext connections once the dual stack is disabled. This is problematic as it is possible that a connection is open, but we have not yet identified if it is plaintext. I think in this scenario we just want to kill the connection anyway. We could also check on every channel read/write on a dual stack plaintext channel if the dual stack networking is still enabled.
I think we need to discuss the best approach to the first issue. I can probably resolve the second one once we know how we are approaching the enablement of this functionality.