Retry the submit-application request to multiple nodes #69

mccheah · 2017-02-01T01:25:25Z

Closes #67. Use a custom Feign Target which can try making requests against multiple servers.

mccheah · 2017-02-01T01:26:28Z

...tes/core/src/main/scala/org/apache/spark/deploy/rest/kubernetes/MultiServerFeignTarget.scala

+import scala.reflect.ClassTag
+import scala.util.Random
+
+private[kubernetes] class MultiServerFeignTarget[T : ClassTag](


This is inspired by http-remoting.

ash211

I'll give this a shot and see if it fixes the issue in my test environment.

ash211 · 2017-02-01T03:38:03Z

.../kubernetes/core/src/main/scala/org/apache/spark/deploy/rest/kubernetes/HttpClientUtil.scala

+    val resetTargetHttpClient = new Client {
+      override def execute(request: Request, options: Options): Response = {
+        val response = baseHttpClient.execute(request, options)
+        if (response.status() >= 200 && response.status() < 300) {


do you mean 2xx responses to reset that target? I'd expect non-2xx responses to be what triggers a reset and fails over to another URI in the list

See remoting - not sure, what's the right thing to check here though?

ash211 · 2017-02-01T03:53:24Z

...tes/core/src/main/scala/org/apache/spark/deploy/rest/kubernetes/MultiServerFeignTarget.scala

+  override def url(): String = threadLocalShuffledServers.get.head
+
+  /**
+    * Cloning the target is done on every request, for use on the current


scalastyle is complaining about this -- the Spark project uses javadoc style comments not scaladoc

ash211 · 2017-02-01T06:18:36Z

Hmm, so this works but the failover takes a significant amount of time to fail over:

With this patch:

2017-02-01 04:09:22 INFO  ServerConnector:266 - Started ServerConnector@769a1df5{HTTP/1.1}{org-apache-spark-examples-sparkpi-1485922157696:7077}
2017-02-01 04:09:22 INFO  Server:379 - Started @1294ms
2017-02-01 04:09:22 INFO  Utils:54 - Successfully started service on port 7077.
2017-02-01 04:09:22 INFO  KubernetesSparkRestServer:54 - Started REST server for submitting applications on port 7077
2017-02-01 04:10:05 INFO  SparkContext:54 - Running Spark version 2.2.0-SNAPSHOT
2017-02-01 04:10:05 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2017-02-01 04:10:05 INFO  SparkContext:54 - Submitted application: Spark Pi
2017-02-01 04:10:05 INFO  SecurityManager:54 - Changing view acls to: root
2017-02-01 04:10:05 INFO  SecurityManager:54 - Changing modify acls to: root
2017-02-01 04:10:05 INFO  SecurityManager:54 - Changing view acls groups to:

(43sec gap)

2017-02-01 06:05:47 INFO  Server:379 - Started @1358ms
2017-02-01 06:05:47 INFO  Utils:54 - Successfully started service on port 7077.
2017-02-01 06:05:47 INFO  KubernetesSparkRestServer:54 - Started REST server for submitting applications on port 7077
2017-02-01 06:06:08 INFO  SparkContext:54 - Running Spark version 2.2.0-SNAPSHOT
2017-02-01 06:06:08 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2017-02-01 06:06:08 INFO  SparkContext:54 - Submitted application: Spark Pi
2017-02-01 06:06:08 INFO  SecurityManager:54 - Changing view acls to: root

(21sec gap)

2017-02-01 06:07:39 INFO  Server:379 - Started @1406ms
2017-02-01 06:07:39 INFO  Utils:54 - Successfully started service on port 7077.
2017-02-01 06:07:39 INFO  KubernetesSparkRestServer:54 - Started REST server for submitting applications on port 7077
2017-02-01 06:08:20 INFO  SparkContext:54 - Running Spark version 2.2.0-SNAPSHOT
2017-02-01 06:08:20 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2017-02-01 06:08:20 INFO  SparkContext:54 - Submitted application: Spark Pi

(41sec gap)

With a separate patch that just drops the master:

diff --git a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/kubernetes/Client.scala b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/kubernetes/Client.scala
index fed9334dbb..fa1df14666 100644
--- a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/kubernetes/Client.scala
+++ b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/kubernetes/Client.scala
@@ -541,7 +541,7 @@ private[spark] class Client(
       .getNodePort
     // NodePort is exposed on every node, so just pick one of them.
     // TODO be resilient to node failures and try all of them
-    val node = kubernetesClient.nodes.list.getItems.asScala.head
+    val node = kubernetesClient.nodes.list.getItems.asScala.tail.head
     val nodeAddress = node.getStatus.getAddresses.asScala.head.getAddress
     val urlScheme = if (driverSubmitSslOptions.enabled) {
       "https"

logs:

2017-01-31 22:27:12 INFO  Server:379 - Started @1332ms
2017-01-31 22:27:12 INFO  Utils:54 - Successfully started service on port 7077.
2017-01-31 22:27:12 INFO  KubernetesSparkRestServer:54 - Started REST server for submitting applications on port 7077
2017-01-31 22:27:19 INFO  SparkContext:54 - Running Spark version 2.2.0-SNAPSHOT
2017-01-31 22:27:19 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2017-01-31 22:27:19 INFO  SparkContext:54 - Submitted application: Spark Pi
2017-01-31 22:27:19 INFO  SecurityManager:54 - Changing view acls to: root

(7sec gap)

It seems like it takes a significant amount of time to fail over to a different endpoint. Does it make sense to reduce the timeout for the pings when checking for liveness so it fails over faster? Is there something else we can do to reduce the time spent waiting for failover?

foxish · 2017-02-01T22:59:13Z

Is there a reason we don't combine the two? It seems like we should drop the master from the candidate list in any case before trying to connect to the nodes.

foxish · 2017-02-01T23:02:09Z

Just picking a single node at random from among the rest should also suffice. Kube-proxy is a critical system pod and should be running on all non-master nodes. If it's not running, there are usually issues on that node, and the administrator would need to step in.

mccheah · 2017-02-01T23:04:08Z

Hm, I think the original implementation this was inspired from assumes that the servers it failed to connect to may become available later. This might not be the same semantics as to what we're trying to accomplish, but as a principle it would be neater to not make the MultiServerFeignTarget opinionated towards the particular use case of the HTTP client.

ash211 · 2017-02-01T23:09:57Z

Dropping the master from the node list either way seems like a good idea -- it would reduce the time for failover at least in my case, and it sounds like kube proxy on the master node isn't guaranteed anyway.

@mccheah what do you think of dropping the master node from the list of URIs?

mccheah · 2017-02-01T23:12:41Z

If we can filter the nodes at the level above the MultiServerFeignTarget then that would be better than filtering when we see a failed request at the HTTP level. However, this breaks local setup with e.g. minikube since in local mode the master also runs the proxy. It would be better to be able to tell if any given node is not running the proxy - @foxish is there any way to get that more precisely than just filter out the master node?

foxish · 2017-02-01T23:24:49Z

We can use the Node's .spec.unschedulable=true|false to decide whether kube-proxy is running on it. This seems to be the preferred way of finding nodes which we can hit. The networking sig is working on a different way to expose kube-proxy status via the node but it hasn't been merged yet. Our best bet I think would be to use the unschedulable field. Typically master nodes will have it set to true. In case of minikube, the master's .spec.unschedulable should not be marked true.

foxish · 2017-02-01T23:27:36Z

Edited my above comment: fixed, as I had accidentally inverted the logic.

ash211 · 2017-02-01T23:30:25Z

my kubeadm-created cluster I'm not seeing a .spec.unschedulable field on any of the nodes (master or worker). This is with k8s 1.5.1.

$ kubectl get node <HOST> -o yaml | grep -i unschedul

What normally creates that on a node's spec?

I do see a taint on the node -- maybe that could be used as a proxy for whether the nodeport is running on a node?

metadata:
  annotations:
    scheduler.alpha.kubernetes.io/taints: '[{"key":"dedicated","value":"master","effect":"NoSchedule"}]'
...

foxish · 2017-02-01T23:51:43Z

The node controller is responsible for setting .spec.unschedulable, and I can see that it is set on my 1.5.2 cluster. It is interesting that your master node has only the taint set. The taint has the same effect, but is the new method of marking nodes as unschedulable. I'm not sure what happened with the kubeadm bootstrapping process here and I'll file an issue with them. Could you show me the output of kubectl get nodes to see if the master has been marked as SchedulingDisabled?

The safest thing would be to check the taints annotation, and the .spec.unschedulable fields. Both of them have exactly the same effect if Unschedulable is set. We will need to update it in future as the annotation evolves to beta, and then becomes a proper API field. :(

mccheah · 2017-02-01T23:56:52Z

I added more logging to the Retry code as I think some errors are being swallowed up there also.

@foxish - what's the exact annotation key and value I need to check? node.getMetadata.getAnnotations returns a java.util.Map<String, String> - presumably I need to find the specific key-value pair to filter out nodes by here.

ash211 · 2017-02-02T00:03:07Z

@foxish as you requested (hostnames redacted):

[user@host-766416 ~]$ kubectl get nodes
NAME                 STATUS         AGE
host-766416.domain   Ready,master   8d
host-766417.domain   Ready          8d
host-766418.domain   Ready          8d
host-766419.domain   Ready          8d
[user@host-766416 ~]$

[user@host-766416 ~]$ kubectl get nodes -o yaml | grep -i schedulingdisabled
[user@host-766416 ~]$

foxish · 2017-02-02T00:09:44Z

That looks odd. The master doesn't have the SchedulingDisabled label either. I think this is a kubeadm bootstrapping issue. This looks related: kubernetes/kubeadm#140. Some more context is in kubernetes/kubernetes#36272. I'll have a longer discussion about this, but the annotation is about to turn into an API field in 1.6.

Also, this is typical output:

NAME                           STATUS                     AGE
kubernetes-master              Ready,SchedulingDisabled   20d
kubernetes-minion-group-3pvp   Ready                      20d
kubernetes-minion-group-cnk6   Ready                      20d
kubernetes-minion-group-j0qh   Ready                      20d

In any case, @mccheah, I recommend just filtering on unschedulable=true for now, and leaving an issue open. I'll figure out the taints/tolerations and send a PR adding that sometime next week if it's necessary.

mccheah · 2017-02-02T00:50:48Z

Cool, I filed #73 to track further investigation.

One remark is that the MultiServerFeignTarget will make it such that a failed application submission will trigger another one to a different node. This could result in us pushing the jars and files over the wire multiple times, which could be expensive considering the size of the payload. I'm not too sure how significant this will be in practice however, especially since we are sending the data to different nodes every time.

mccheah · 2017-02-02T18:47:12Z

@foxish @aash Anything else to add here?

ash211 · 2017-02-02T18:58:09Z

Nope. Looks like we might remove the pinging process entirely in #75 and might more proactively prune out nodes in the MultiServerFeignTarget in #73 but both of those are extensions to this.

Merging

* Retry the submit-application request to multiple nodes. * Fix doc style comment * Check node unschedulable, log retry failures

…on-k8s#69) * Retry the submit-application request to multiple nodes. * Fix doc style comment * Check node unschedulable, log retry failures

Retry the submit-application request to multiple nodes.

fd7bf8e

mccheah commented Feb 1, 2017

View reviewed changes

ash211 reviewed Feb 1, 2017

View reviewed changes

Fix doc style comment

5a93bbb

Check node unschedulable, log retry failures

c510396

mccheah mentioned this pull request Feb 2, 2017

Filter out unschedulable nodes for the driver endpoint more rigorously #73

Open

ash211 merged commit 0c3ff11 into k8s-support-alternate-incremental Feb 2, 2017

ash211 deleted the retry-multiple-nodes branch February 2, 2017 18:58

ash211 mentioned this pull request Feb 2, 2017

Regression introduced by NodePort PR #67

Closed

ash211 pushed a commit that referenced this pull request Feb 8, 2017

Retry the submit-application request to multiple nodes (#69)

b30bb2b

* Retry the submit-application request to multiple nodes. * Fix doc style comment * Check node unschedulable, log retry failures

ash211 pushed a commit that referenced this pull request Mar 8, 2017

Retry the submit-application request to multiple nodes (#69)

6ee3be5

* Retry the submit-application request to multiple nodes. * Fix doc style comment * Check node unschedulable, log retry failures

foxish pushed a commit that referenced this pull request Jul 24, 2017

Retry the submit-application request to multiple nodes (#69)

261a624

* Retry the submit-application request to multiple nodes. * Fix doc style comment * Check node unschedulable, log retry failures

Retry the submit-application request to multiple nodes #69

Retry the submit-application request to multiple nodes #69

Uh oh!

Conversation

mccheah commented Feb 1, 2017

Uh oh!

mccheah Feb 1, 2017

Choose a reason for hiding this comment

Uh oh!

ash211 left a comment

Choose a reason for hiding this comment

Uh oh!

ash211 Feb 1, 2017

Choose a reason for hiding this comment

Uh oh!

mccheah Feb 1, 2017

Choose a reason for hiding this comment

Uh oh!

ash211 Feb 1, 2017

Choose a reason for hiding this comment

Uh oh!

ash211 commented Feb 1, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

foxish commented Feb 1, 2017

Uh oh!

foxish commented Feb 1, 2017

Uh oh!

mccheah commented Feb 1, 2017

Uh oh!

ash211 commented Feb 1, 2017

Uh oh!

mccheah commented Feb 1, 2017

Uh oh!

foxish commented Feb 1, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

foxish commented Feb 1, 2017

Uh oh!

ash211 commented Feb 1, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

foxish commented Feb 1, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mccheah commented Feb 1, 2017

Uh oh!

ash211 commented Feb 2, 2017

Uh oh!

foxish commented Feb 2, 2017

Uh oh!

mccheah commented Feb 2, 2017

Uh oh!

mccheah commented Feb 2, 2017

Uh oh!

ash211 commented Feb 2, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ash211 commented Feb 1, 2017 •

edited

Loading

foxish commented Feb 1, 2017 •

edited

Loading

ash211 commented Feb 1, 2017 •

edited

Loading

foxish commented Feb 1, 2017 •

edited

Loading