TPR Support #284

iyanuobidele · 2017-05-19T02:35:05Z

Following from the discussions from the SIG to keep the TPR manipulation implementations in this project until we make upstream changes to the k8s client.

This supports the crud+watch on the SparkJobResource using the suggested schema

iyanuobidele · 2017-05-19T02:36:04Z

@foxish putting this out there, still hacking through it. Feel free to jump in to suggest immediate changes

mccheah

Seeing that this is a work in progress, I took a preliminary scan over the changes so far and made some early suggestions.

mccheah · 2017-05-23T00:45:53Z

...s/core/src/main/scala/org/apache/spark/deploy/kubernetes/SparkJobResourceFromWithinK8s.scala

We don't always want to use service account tokens here. See #246, #192, and #182.

mccheah · 2017-05-23T00:46:35Z

...s/core/src/main/scala/org/apache/spark/deploy/kubernetes/SparkJobResourceFromWithinK8s.scala

We likely want this class and the class below to be in separate files.

mccheah · 2017-05-23T00:47:24Z

...ers/kubernetes/core/src/main/scala/org/apache/spark/deploy/kubernetes/submit/v2/Client.scala

Can we be more specific with this variable name?

mccheah · 2017-05-23T00:48:42Z

...ers/kubernetes/core/src/main/scala/org/apache/spark/deploy/kubernetes/submit/v2/Client.scala

We usually use

try { ... } catch { case e: SparkException => }

I don't think the Spark codebase uses Scala's Try(...) match idiom. However I actually think the Try(...) match idiom makes sense here, so if there's other examples of this being done in the code then it's also appropriate to do so here.

mccheah · 2017-05-23T00:50:45Z

...rs/kubernetes/core/src/main/scala/org/apache/spark/deploy/kubernetes/tpr/JobStateSerDe.scala

Likely should be marked private[spark].

mccheah · 2017-05-23T00:51:53Z

...ers/kubernetes/core/src/main/scala/org/apache/spark/deploy/kubernetes/tpr/TPRCrudCalls.scala

Indentation seems off here. Look at v2/Client.scala to see how we handle long argument lists.

mccheah · 2017-05-23T00:52:32Z

...ers/kubernetes/core/src/main/scala/org/apache/spark/deploy/kubernetes/tpr/TPRCrudCalls.scala

Remove whitespace.

mccheah · 2017-05-23T00:55:02Z

...ers/kubernetes/core/src/main/scala/org/apache/spark/deploy/kubernetes/tpr/TPRCrudCalls.scala

This class seems particularly important so some Javadoc on it would be good. I also would lean away from using an abstract class here. If we need the fields specifically here across multiple implementations, consider making this a concrete class that packages up the common operations and fields, and to have different implementations contain an instance of this concrete class. Or in other words, let's try to have a design that favors composition over inheritance.

thanks for looking at this, i'll address your comments

foxish · 2017-05-22T23:04:24Z

conf/kubernetes-resource.yaml

spark-job.apache.org

foxish · 2017-05-22T23:05:11Z

...e-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/kubernetes/constants.scala

foxish · 2017-05-22T23:06:42Z

...ers/kubernetes/core/src/main/scala/org/apache/spark/deploy/kubernetes/submit/v2/Client.scala

We need to put URLs here of the driver and executor UIs as accessible via the APIServer proxy. I'd suggest relative URLs of the form /api/v1/..../spark-driver:port , because then that way, we can get to them from the kubectl dashboard.

sounds good.

foxish · 2017-05-25T11:28:51Z

Removed the watch (because I think we can leave it out for now, considering that TPRs themselves are considered alpha)
For now, allowing it to perform simple updates to status. @iyanuobidele @mccheah PTAL

TODO:

Differentiate between failure/error and succeeded cases
Document that this particular integration relies on an alpha feature which may not be available in every cluster
Document that TPRs need to be manually cleaned up for now (Garbage collection support does not exist for TPRs)

foxish · 2017-05-25T14:36:38Z

rerun unit tests please

iyanuobidele · 2017-05-25T14:41:23Z

I'll add the doc changes and some small to changes to make the update portion take 1 or multiple requests at once.

iyanuobidele · 2017-05-28T18:24:40Z

rerun unit tests please

foxish · 2017-05-29T18:31:48Z

@varunkatta @kimoonkim Jenkins fell over again I think

spark-k8s-jenkins · 2017-05-30T05:38:48Z

rerun all tests please

foxish · 2017-05-31T15:58:34Z

rerun unit tests please

iyanuobidele · 2017-06-02T16:46:45Z

rerun unit tests please

iyanuobidele · 2017-06-02T17:30:47Z

@mccheah please could you help take a quick look at this unit test failure ? I'm finding it hard to understand why the "run with dependency uploader" test passes and the "Run without dependency uploader" fails on the verify step.

foxish · 2017-06-02T19:17:03Z

@iyanuobidele rebased and fixed conflicts with the other PR that just merged. No other changes.

mccheah · 2017-06-02T20:25:12Z

docs/running-on-kubernetes.md

Prerequisites

mccheah · 2017-06-02T20:26:53Z

...nagers/kubernetes/core/src/main/scala/org/apache/spark/deploy/kubernetes/submit/Client.scala

We probably want to use a specific formatter here.

The Calendar is a really old class that's not good practice to use anymore. Can we use joda classes or the Java 8 time classes instead?

mccheah · 2017-06-02T20:28:24Z

...nagers/kubernetes/core/src/main/scala/org/apache/spark/deploy/kubernetes/submit/Client.scala

Move sparkConf.set(...) down one line.

mccheah · 2017-06-02T20:30:51Z

...rs/kubernetes/core/src/main/scala/org/apache/spark/deploy/kubernetes/tpr/JobStateSerDe.scala

Instead of using this, look at what is done here: https://github.com/apache-spark-on-k8s/spark/pull/305/files#diff-801d4c840d0e60f5521c12f9389598b9R30

Essentially we can create a subclass of TypeReference for the enumeration's type, and then use @JsonScalaEnumeration wherever we are embedding an instance of the enumeration as a field of another class.

mccheah · 2017-06-02T20:31:14Z

...nagers/kubernetes/core/src/main/scala/org/apache/spark/deploy/kubernetes/submit/Client.scala

Inject the trait instead of the impl.

the naming here between the object and the class doesn't match. I like the name SparkJobResourceController better than TPRCrudCalls. Maybe you can rename the class?

mccheah · 2017-06-02T20:43:14Z

...kubernetes/core/src/main/scala/org/apache/spark/deploy/kubernetes/tpr/TPRCrudCallsImpl.scala

Does the fact that this variable is not localized to this method mean that when we watch multiple objects we can lose the watch source? What about if two callers concurrently invoke this method?

mccheah · 2017-06-02T20:44:33Z

.../scala/org/apache/spark/scheduler/cluster/kubernetes/KubernetesClusterSchedulerBackend.scala

The resolved value of this should be in the Kubernetes client object.

mccheah · 2017-06-02T20:45:18Z

.../scala/org/apache/spark/scheduler/cluster/kubernetes/KubernetesClusterSchedulerBackend.scala

The controller class itself should probably be logging its success or failure.

Yep, and that might eliminate this method entirely by moving everything into the updateJobObject call

mccheah · 2017-06-02T20:47:53Z

...kubernetes/core/src/main/scala/org/apache/spark/deploy/kubernetes/tpr/TPRCrudCallsImpl.scala

Name the Scala file after the trait and not the impl.

mccheah · 2017-06-02T20:49:34Z

...kubernetes/core/src/main/scala/org/apache/spark/deploy/kubernetes/tpr/TPRCrudCallsImpl.scala

Based on some of the usages of this, it would be nice to have a "batch update" method that can update multiple fields at once. In doing so, we can encode updates with a POJO that's more expressive than just String fields:

case class SparkJobPatch(field1: Option[String] = None, field2: Option[String] = None).

Since the fields are all optional with default values, we can be as selective as we want in what items to update. We can accomplish something similar with an updateJobObject method signature that takes in all of the possible updatable fields with default values, but I like encapsulating this in a single variadic argument type, so to speak.

mccheah · 2017-06-02T20:55:39Z

...nagers/kubernetes/core/src/main/scala/org/apache/spark/deploy/kubernetes/submit/Client.scala

Can we use a more specific type than a map? If we always want the same fields, use a case class with the specific fields enumerated.

mccheah · 2017-06-02T20:56:17Z

...kubernetes/core/src/main/scala/org/apache/spark/deploy/kubernetes/tpr/TPRCrudCallsImpl.scala

Can we define the status as a case class as opposed to an arbitrary Map?

ash211

Good work @iyanuobidele ! Did a thorough review and left a bunch of comments.

One of the bigger questions is around the purpose of TPR. Right now it's mainly doing status updating (timestamps, executor counts, Spark UI URL). Do you imagine growing this to have more reporting here as well?

As written it seems like a "status updater" that happens to report to a TPR, but could go to other places as well (see e.g. the existing Spark event log). Possibly even TPR is one of many implementations of this.

What this ties into also is that it's very similar to the existing Spark events infrastructure, which is how Spark reports status through the driver back to a user application, and also how it writes logs to the Spark event log for post-application diagnosis.

If you registered a SparkListenerInterface could you listen to all the events (and lay the groundwork for future task/stage/metrics updating) instead of injecting method reporting calls throughout the main logic?

ash211 · 2017-06-02T21:55:29Z

docs/running-on-kubernetes.md

put the version requirement information here. starting in what version of k8s will the provided yaml file work?

ash211 · 2017-06-02T21:57:05Z

...nagers/kubernetes/core/src/main/scala/org/apache/spark/deploy/kubernetes/submit/Client.scala

The Calendar is a really old class that's not good practice to use anymore. Can we use joda classes or the Java 8 time classes instead?

ash211 · 2017-06-02T21:58:49Z

...nagers/kubernetes/core/src/main/scala/org/apache/spark/deploy/kubernetes/submit/Client.scala

what's the 10min lag you refer to here? is this something intrinsic about TPRs? can you link to something in comments?

That's my bad. It's 10s. Here's a comment on an issue opened on TPRs

ash211 · 2017-06-02T21:59:46Z

...nagers/kubernetes/core/src/main/scala/org/apache/spark/deploy/kubernetes/submit/Client.scala

Make these two config items entries in config.scala with the other kubernetes-relevant config

I also don't think the current naming reflects what this does. Right now it describes the state it expects (the job resource has been set on the cluster) rather than what actions the code takes when the flag is set (create a TPR and report status to it as the job progresses). Maybe name something like spark.kubernetes.statusReporting.enabled=true and spark.kubernetes.statusReporting.resourceName=<asdf> ?

ash211 · 2017-06-02T22:02:12Z

...anagers/kubernetes/core/src/main/scala/org/apache/spark/deploy/kubernetes/tpr/JobState.scala

what if the cluster admin killed the pods via kubectl -- what would that show up as?

ash211 · 2017-06-02T22:20:49Z

.../scala/org/apache/spark/scheduler/cluster/kubernetes/KubernetesClusterSchedulerBackend.scala

does this assume the user is accessing the SparkUI via a kubectl proxy on localhost:8001? Can we support ingress-based exposure of the Spark UI too?

ash211 · 2017-06-02T22:24:01Z

...nagers/kubernetes/core/src/main/scala/org/apache/spark/deploy/kubernetes/submit/Client.scala

the naming here between the object and the class doesn't match. I like the name SparkJobResourceController better than TPRCrudCalls. Maybe you can rename the class?

ash211 · 2017-06-02T22:26:09Z

...kubernetes/core/src/main/scala/org/apache/spark/deploy/kubernetes/tpr/TPRCrudCallsImpl.scala

for updateJobObject it seems like there's a caller that expects this to throw a SparkException in a certain way.
Can you document that API on this method in the trait? Same for other methods if they're expected to throw specific exceptions in specific situations.

ash211 · 2017-06-02T22:31:41Z

.../scala/org/apache/spark/scheduler/cluster/kubernetes/KubernetesClusterSchedulerBackend.scala

Yep, and that might eliminate this method entirely by moving everything into the updateJobObject call

ash211 · 2017-06-02T22:32:29Z

.../scala/org/apache/spark/scheduler/cluster/kubernetes/KubernetesClusterSchedulerBackend.scala

this } looks odd to my eyes -- should it be moved to the previous line?

mccheah · 2017-06-13T21:32:48Z

docs/running-on-kubernetes.md

+
+### Future work
+
+Kube administrators or users would be able to stop a spark app running in their cluster by simply 


Kubernetes cluster administrators or users should be able to stop...

mccheah · 2017-06-13T21:36:49Z

.../core/src/main/scala/org/apache/spark/deploy/kubernetes/tpr/sparkJobResourceController.scala

+      response,
+      Option(Seq(tprObjectName, response.message(), request.toString)))
+
+    response.body().close()


Put close() in finally blocks

mccheah · 2017-06-13T21:36:57Z

.../core/src/main/scala/org/apache/spark/deploy/kubernetes/tpr/sparkJobResourceController.scala

+        val msg =
+          s"Failed to delete resource. ${x.getMessage}."
+        logError(msg)
+        response.close()


close() should be in a finally block.

Actually since ResponseBody implements closeable, look into how we can use Utils.tryWithResource(...) {...}.

mccheah · 2017-06-13T21:37:37Z

.../core/src/main/scala/org/apache/spark/deploy/kubernetes/tpr/sparkJobResourceController.scala

+      .build()
+
+    logDebug(s"Get Request: $request")
+    var response: Response = null


You can use

val response = try { httpCLient.newCall(request).execute() }

Just found out, as I've mentioned above, that ResponseBody implements Closeable, so we can do this:

Utils.tryWithResource(httpClient.newCall(request).execute()) { responseBody => // operation with responseBody }

mccheah · 2017-06-13T21:38:36Z

.../core/src/main/scala/org/apache/spark/deploy/kubernetes/tpr/sparkJobResourceController.scala

+        val msg =
+          s"Failed to get resource $name. ${x.getMessage}."
+        logError(msg)
+        response.close()


If you close here and the call to create failed, then you'll get a NullPointerException. If you use the paradigm as I've commented a few lines above however then this problem can be avoided.

mccheah · 2017-06-13T21:39:20Z

.../core/src/main/scala/org/apache/spark/deploy/kubernetes/tpr/sparkJobResourceController.scala

+
+    logDebug(s"Update Request: $request")
+    var response: Response = null
+    try {


val response = try {...

Repeat this for all similar blocks.

mccheah · 2017-06-13T21:43:26Z

.../core/src/main/scala/org/apache/spark/deploy/kubernetes/tpr/sparkJobResourceController.scala

+  def extractHttpClientFromK8sClient(client: BaseClient): OkHttpClient = {
+    val field = classOf[BaseClient].getDeclaredField("httpClient")
+    try {
+      field.setAccessible(true)


I'm looking at BaseClient and it has a getHttpClient method. Is it therefore necessary to use reflection here?
https://github.com/fabric8io/kubernetes-client/blob/master/kubernetes-client/src/main/java/io/fabric8/kubernetes/client/BaseClient.java#L113

mccheah · 2017-06-13T21:43:53Z

.../core/src/main/scala/org/apache/spark/deploy/kubernetes/tpr/sparkJobResourceController.scala

+    additionalInfo: Option[Seq[String]] = None): Unit = {
+
+    if (!response.isSuccessful) {
+      response.body().close()


Don't close in methods that didn't open the response body.

ash211 · 2017-08-21T22:55:29Z

@iyanuobidele is this PR still active? I'm unsure the current state of TPRs (since renamed?) so don't know how much we can reuse from this PR with the new APIs in upcoming kubernetes releases.

ash211 · 2017-08-31T21:17:21Z

Closing for inactivity -- please feel free to reopen when conflicts are merged and this is ready for more review!

…apache [NOSQUASH] Resync from Apache

iyanuobidele force-pushed the new-tpr-support branch 5 times, most recently from fbf8d94 to 8450333 Compare May 22, 2017 21:47

mccheah suggested changes May 23, 2017

View reviewed changes

foxish reviewed May 23, 2017

View reviewed changes

iyanuobidele force-pushed the new-tpr-support branch 4 times, most recently from 1a5231e to b519c3e Compare May 24, 2017 07:46

iyanuobidele force-pushed the new-tpr-support branch 8 times, most recently from 6766ca2 to 17ee380 Compare May 28, 2017 17:29

iyanuobidele force-pushed the new-tpr-support branch from 17ee380 to a3bbf6a Compare May 28, 2017 19:19

iyanuobidele force-pushed the new-tpr-support branch 2 times, most recently from 093f5f5 to 97ca241 Compare May 30, 2017 07:29

iyanuobidele force-pushed the new-tpr-support branch 3 times, most recently from ad1daf7 to 91ffdd1 Compare June 2, 2017 16:27

foxish force-pushed the new-tpr-support branch from 91ffdd1 to 0fe9b15 Compare June 2, 2017 19:16

mccheah suggested changes Jun 2, 2017

View reviewed changes

mccheah reviewed Jun 2, 2017

View reviewed changes

ash211 reviewed Jun 2, 2017

View reviewed changes

iyanuobidele and others added 5 commits June 2, 2017 19:56

initial sparkResource implementation

3946444

next iteration with resource creation and watch logic

75057f3

addressing comments

e21f3b5

Added basic support for various status functions

706de99

address more comments

f762f85

iyanuobidele force-pushed the new-tpr-support branch from 0fe9b15 to f762f85 Compare June 3, 2017 02:57

mccheah reviewed Jun 13, 2017

View reviewed changes

ash211 closed this Aug 31, 2017

foxish deleted the new-tpr-support branch September 26, 2017 09:42

ifilonenko pushed a commit to ifilonenko/spark that referenced this pull request Feb 26, 2019

Merge pull request apache-spark-on-k8s#284 from palantir/aash/resync-…

1c1a608

…apache [NOSQUASH] Resync from Apache


		### Future work

		Kube administrators or users would be able to stop a spark app running in their cluster by simply

TPR Support #284

TPR Support #284

Uh oh!

Conversation

iyanuobidele commented May 19, 2017

Uh oh!

iyanuobidele commented May 19, 2017

Uh oh!

mccheah left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

foxish commented May 25, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

foxish commented May 25, 2017

Uh oh!

iyanuobidele commented May 25, 2017

Uh oh!

iyanuobidele commented May 28, 2017

Uh oh!

foxish commented May 29, 2017

Uh oh!

spark-k8s-jenkins commented May 30, 2017

Uh oh!

foxish commented May 31, 2017

Uh oh!

iyanuobidele commented Jun 2, 2017

Uh oh!

iyanuobidele commented Jun 2, 2017

Uh oh!

foxish commented Jun 2, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

foxish commented May 25, 2017 •

edited

Loading