Skip to content
This repository was archived by the owner on Jan 9, 2020. It is now read-only.

Conversation

@mccheah
Copy link

@mccheah mccheah commented Feb 1, 2017

Closes #67. Use a custom Feign Target which can try making requests against multiple servers.

import scala.reflect.ClassTag
import scala.util.Random

private[kubernetes] class MultiServerFeignTarget[T : ClassTag](
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is inspired by http-remoting.

Copy link

@ash211 ash211 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll give this a shot and see if it fixes the issue in my test environment.

val resetTargetHttpClient = new Client {
override def execute(request: Request, options: Options): Response = {
val response = baseHttpClient.execute(request, options)
if (response.status() >= 200 && response.status() < 300) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you mean 2xx responses to reset that target? I'd expect non-2xx responses to be what triggers a reset and fails over to another URI in the list

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See remoting - not sure, what's the right thing to check here though?

override def url(): String = threadLocalShuffledServers.get.head

/**
* Cloning the target is done on every request, for use on the current
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scalastyle is complaining about this -- the Spark project uses javadoc style comments not scaladoc

@ash211
Copy link

ash211 commented Feb 1, 2017

Hmm, so this works but the failover takes a significant amount of time to fail over:

With this patch:

2017-02-01 04:09:22 INFO  ServerConnector:266 - Started ServerConnector@769a1df5{HTTP/1.1}{org-apache-spark-examples-sparkpi-1485922157696:7077}
2017-02-01 04:09:22 INFO  Server:379 - Started @1294ms
2017-02-01 04:09:22 INFO  Utils:54 - Successfully started service on port 7077.
2017-02-01 04:09:22 INFO  KubernetesSparkRestServer:54 - Started REST server for submitting applications on port 7077
2017-02-01 04:10:05 INFO  SparkContext:54 - Running Spark version 2.2.0-SNAPSHOT
2017-02-01 04:10:05 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2017-02-01 04:10:05 INFO  SparkContext:54 - Submitted application: Spark Pi
2017-02-01 04:10:05 INFO  SecurityManager:54 - Changing view acls to: root
2017-02-01 04:10:05 INFO  SecurityManager:54 - Changing modify acls to: root
2017-02-01 04:10:05 INFO  SecurityManager:54 - Changing view acls groups to:

(43sec gap)

2017-02-01 06:05:47 INFO  Server:379 - Started @1358ms
2017-02-01 06:05:47 INFO  Utils:54 - Successfully started service on port 7077.
2017-02-01 06:05:47 INFO  KubernetesSparkRestServer:54 - Started REST server for submitting applications on port 7077
2017-02-01 06:06:08 INFO  SparkContext:54 - Running Spark version 2.2.0-SNAPSHOT
2017-02-01 06:06:08 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2017-02-01 06:06:08 INFO  SparkContext:54 - Submitted application: Spark Pi
2017-02-01 06:06:08 INFO  SecurityManager:54 - Changing view acls to: root

(21sec gap)

2017-02-01 06:07:39 INFO  Server:379 - Started @1406ms
2017-02-01 06:07:39 INFO  Utils:54 - Successfully started service on port 7077.
2017-02-01 06:07:39 INFO  KubernetesSparkRestServer:54 - Started REST server for submitting applications on port 7077
2017-02-01 06:08:20 INFO  SparkContext:54 - Running Spark version 2.2.0-SNAPSHOT
2017-02-01 06:08:20 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2017-02-01 06:08:20 INFO  SparkContext:54 - Submitted application: Spark Pi

(41sec gap)

With a separate patch that just drops the master:

diff --git a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/kubernetes/Client.scala b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/kubernetes/Client.scala
index fed9334dbb..fa1df14666 100644
--- a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/kubernetes/Client.scala
+++ b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/kubernetes/Client.scala
@@ -541,7 +541,7 @@ private[spark] class Client(
       .getNodePort
     // NodePort is exposed on every node, so just pick one of them.
     // TODO be resilient to node failures and try all of them
-    val node = kubernetesClient.nodes.list.getItems.asScala.head
+    val node = kubernetesClient.nodes.list.getItems.asScala.tail.head
     val nodeAddress = node.getStatus.getAddresses.asScala.head.getAddress
     val urlScheme = if (driverSubmitSslOptions.enabled) {
       "https"

logs:

2017-01-31 22:27:12 INFO  Server:379 - Started @1332ms
2017-01-31 22:27:12 INFO  Utils:54 - Successfully started service on port 7077.
2017-01-31 22:27:12 INFO  KubernetesSparkRestServer:54 - Started REST server for submitting applications on port 7077
2017-01-31 22:27:19 INFO  SparkContext:54 - Running Spark version 2.2.0-SNAPSHOT
2017-01-31 22:27:19 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2017-01-31 22:27:19 INFO  SparkContext:54 - Submitted application: Spark Pi
2017-01-31 22:27:19 INFO  SecurityManager:54 - Changing view acls to: root

(7sec gap)

It seems like it takes a significant amount of time to fail over to a different endpoint. Does it make sense to reduce the timeout for the pings when checking for liveness so it fails over faster? Is there something else we can do to reduce the time spent waiting for failover?

@foxish
Copy link
Member

foxish commented Feb 1, 2017

Is there a reason we don't combine the two? It seems like we should drop the master from the candidate list in any case before trying to connect to the nodes.

@foxish
Copy link
Member

foxish commented Feb 1, 2017

Just picking a single node at random from among the rest should also suffice. Kube-proxy is a critical system pod and should be running on all non-master nodes. If it's not running, there are usually issues on that node, and the administrator would need to step in.

@mccheah
Copy link
Author

mccheah commented Feb 1, 2017

Hm, I think the original implementation this was inspired from assumes that the servers it failed to connect to may become available later. This might not be the same semantics as to what we're trying to accomplish, but as a principle it would be neater to not make the MultiServerFeignTarget opinionated towards the particular use case of the HTTP client.

@ash211
Copy link

ash211 commented Feb 1, 2017

Dropping the master from the node list either way seems like a good idea -- it would reduce the time for failover at least in my case, and it sounds like kube proxy on the master node isn't guaranteed anyway.

@mccheah what do you think of dropping the master node from the list of URIs?

@mccheah
Copy link
Author

mccheah commented Feb 1, 2017

If we can filter the nodes at the level above the MultiServerFeignTarget then that would be better than filtering when we see a failed request at the HTTP level. However, this breaks local setup with e.g. minikube since in local mode the master also runs the proxy. It would be better to be able to tell if any given node is not running the proxy - @foxish is there any way to get that more precisely than just filter out the master node?

@foxish
Copy link
Member

foxish commented Feb 1, 2017

We can use the Node's .spec.unschedulable=true|false to decide whether kube-proxy is running on it. This seems to be the preferred way of finding nodes which we can hit. The networking sig is working on a different way to expose kube-proxy status via the node but it hasn't been merged yet. Our best bet I think would be to use the unschedulable field. Typically master nodes will have it set to true. In case of minikube, the master's .spec.unschedulable should not be marked true.

@foxish
Copy link
Member

foxish commented Feb 1, 2017

Edited my above comment: fixed, as I had accidentally inverted the logic.

@ash211
Copy link

ash211 commented Feb 1, 2017

my kubeadm-created cluster I'm not seeing a .spec.unschedulable field on any of the nodes (master or worker). This is with k8s 1.5.1.

$ kubectl get node <HOST> -o yaml | grep -i unschedul

What normally creates that on a node's spec?

I do see a taint on the node -- maybe that could be used as a proxy for whether the nodeport is running on a node?

metadata:
  annotations:
    scheduler.alpha.kubernetes.io/taints: '[{"key":"dedicated","value":"master","effect":"NoSchedule"}]'
...

@foxish
Copy link
Member

foxish commented Feb 1, 2017

The node controller is responsible for setting .spec.unschedulable, and I can see that it is set on my 1.5.2 cluster. It is interesting that your master node has only the taint set. The taint has the same effect, but is the new method of marking nodes as unschedulable. I'm not sure what happened with the kubeadm bootstrapping process here and I'll file an issue with them. Could you show me the output of kubectl get nodes to see if the master has been marked as SchedulingDisabled?

The safest thing would be to check the taints annotation, and the .spec.unschedulable fields. Both of them have exactly the same effect if Unschedulable is set. We will need to update it in future as the annotation evolves to beta, and then becomes a proper API field. :(

@mccheah
Copy link
Author

mccheah commented Feb 1, 2017

I added more logging to the Retry code as I think some errors are being swallowed up there also.

@foxish - what's the exact annotation key and value I need to check? node.getMetadata.getAnnotations returns a java.util.Map<String, String> - presumably I need to find the specific key-value pair to filter out nodes by here.

@ash211
Copy link

ash211 commented Feb 2, 2017

@foxish as you requested (hostnames redacted):

[user@host-766416 ~]$ kubectl get nodes
NAME                 STATUS         AGE
host-766416.domain   Ready,master   8d
host-766417.domain   Ready          8d
host-766418.domain   Ready          8d
host-766419.domain   Ready          8d
[user@host-766416 ~]$
[user@host-766416 ~]$ kubectl get nodes -o yaml | grep -i schedulingdisabled
[user@host-766416 ~]$

@foxish
Copy link
Member

foxish commented Feb 2, 2017

That looks odd. The master doesn't have the SchedulingDisabled label either. I think this is a kubeadm bootstrapping issue. This looks related: kubernetes/kubeadm#140. Some more context is in kubernetes/kubernetes#36272. I'll have a longer discussion about this, but the annotation is about to turn into an API field in 1.6.

Also, this is typical output:

NAME                           STATUS                     AGE
kubernetes-master              Ready,SchedulingDisabled   20d
kubernetes-minion-group-3pvp   Ready                      20d
kubernetes-minion-group-cnk6   Ready                      20d
kubernetes-minion-group-j0qh   Ready                      20d

In any case, @mccheah, I recommend just filtering on unschedulable=true for now, and leaving an issue open. I'll figure out the taints/tolerations and send a PR adding that sometime next week if it's necessary.

@mccheah
Copy link
Author

mccheah commented Feb 2, 2017

Cool, I filed #73 to track further investigation.

One remark is that the MultiServerFeignTarget will make it such that a failed application submission will trigger another one to a different node. This could result in us pushing the jars and files over the wire multiple times, which could be expensive considering the size of the payload. I'm not too sure how significant this will be in practice however, especially since we are sending the data to different nodes every time.

@mccheah
Copy link
Author

mccheah commented Feb 2, 2017

@foxish @aash Anything else to add here?

@ash211
Copy link

ash211 commented Feb 2, 2017

Nope. Looks like we might remove the pinging process entirely in #75 and might more proactively prune out nodes in the MultiServerFeignTarget in #73 but both of those are extensions to this.

Merging

@ash211 ash211 merged commit 0c3ff11 into k8s-support-alternate-incremental Feb 2, 2017
@ash211 ash211 deleted the retry-multiple-nodes branch February 2, 2017 18:58
ash211 pushed a commit that referenced this pull request Feb 8, 2017
* Retry the submit-application request to multiple nodes.

* Fix doc style comment

* Check node unschedulable, log retry failures
ash211 pushed a commit that referenced this pull request Mar 8, 2017
* Retry the submit-application request to multiple nodes.

* Fix doc style comment

* Check node unschedulable, log retry failures
foxish pushed a commit that referenced this pull request Jul 24, 2017
* Retry the submit-application request to multiple nodes.

* Fix doc style comment

* Check node unschedulable, log retry failures
ifilonenko pushed a commit to ifilonenko/spark that referenced this pull request Feb 25, 2019
…on-k8s#69)

* Retry the submit-application request to multiple nodes.

* Fix doc style comment

* Check node unschedulable, log retry failures
puneetloya pushed a commit to puneetloya/spark that referenced this pull request Mar 11, 2019
…on-k8s#69)

* Retry the submit-application request to multiple nodes.

* Fix doc style comment

* Check node unschedulable, log retry failures
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants