-
Notifications
You must be signed in to change notification settings - Fork 117
Retry the submit-application request to multiple nodes #69
Retry the submit-application request to multiple nodes #69
Conversation
| import scala.reflect.ClassTag | ||
| import scala.util.Random | ||
|
|
||
| private[kubernetes] class MultiServerFeignTarget[T : ClassTag]( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is inspired by http-remoting.
ash211
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll give this a shot and see if it fixes the issue in my test environment.
| val resetTargetHttpClient = new Client { | ||
| override def execute(request: Request, options: Options): Response = { | ||
| val response = baseHttpClient.execute(request, options) | ||
| if (response.status() >= 200 && response.status() < 300) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you mean 2xx responses to reset that target? I'd expect non-2xx responses to be what triggers a reset and fails over to another URI in the list
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See remoting - not sure, what's the right thing to check here though?
| override def url(): String = threadLocalShuffledServers.get.head | ||
|
|
||
| /** | ||
| * Cloning the target is done on every request, for use on the current |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scalastyle is complaining about this -- the Spark project uses javadoc style comments not scaladoc
|
Hmm, so this works but the failover takes a significant amount of time to fail over: With this patch: (43sec gap) (21sec gap) (41sec gap) With a separate patch that just drops the master: logs: (7sec gap) It seems like it takes a significant amount of time to fail over to a different endpoint. Does it make sense to reduce the timeout for the pings when checking for liveness so it fails over faster? Is there something else we can do to reduce the time spent waiting for failover? |
|
Is there a reason we don't combine the two? It seems like we should drop the master from the candidate list in any case before trying to connect to the nodes. |
|
Just picking a single node at random from among the rest should also suffice. Kube-proxy is a critical system pod and should be running on all non-master nodes. If it's not running, there are usually issues on that node, and the administrator would need to step in. |
|
Hm, I think the original implementation this was inspired from assumes that the servers it failed to connect to may become available later. This might not be the same semantics as to what we're trying to accomplish, but as a principle it would be neater to not make the |
|
Dropping the master from the node list either way seems like a good idea -- it would reduce the time for failover at least in my case, and it sounds like kube proxy on the master node isn't guaranteed anyway. @mccheah what do you think of dropping the master node from the list of URIs? |
|
If we can filter the nodes at the level above the |
|
We can use the Node's |
|
Edited my above comment: fixed, as I had accidentally inverted the logic. |
|
my
What normally creates that on a node's spec? I do see a taint on the node -- maybe that could be used as a proxy for whether the nodeport is running on a node? |
|
The node controller is responsible for setting The safest thing would be to check the taints annotation, and the |
|
I added more logging to the @foxish - what's the exact annotation key and value I need to check? |
|
@foxish as you requested (hostnames redacted): |
|
That looks odd. The master doesn't have the Also, this is typical output: In any case, @mccheah, I recommend just filtering on |
|
Cool, I filed #73 to track further investigation. One remark is that the |
* Retry the submit-application request to multiple nodes. * Fix doc style comment * Check node unschedulable, log retry failures
* Retry the submit-application request to multiple nodes. * Fix doc style comment * Check node unschedulable, log retry failures
* Retry the submit-application request to multiple nodes. * Fix doc style comment * Check node unschedulable, log retry failures
…on-k8s#69) * Retry the submit-application request to multiple nodes. * Fix doc style comment * Check node unschedulable, log retry failures
…on-k8s#69) * Retry the submit-application request to multiple nodes. * Fix doc style comment * Check node unschedulable, log retry failures
Closes #67. Use a custom Feign Target which can try making requests against multiple servers.