[SPARK-25021][K8S] Add spark.executor.pyspark.memory limit for K8S #22298

ifilonenko · 2018-08-31T04:46:27Z

What changes were proposed in this pull request?

Add spark.executor.pyspark.memory limit for K8S

How was this patch tested?

Unit and Integration tests

ifilonenko · 2018-08-31T04:51:34Z

@rdblue @holdenk for review. This contains both unit and integration tests that verify [SPARK-25004] for K8S

ifilonenko · 2018-08-31T04:53:40Z

...ion-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/SecretsTestsSuite.scala

      .delete()
  }

+  // TODO: [SPARK-25291] This test is flaky with regards to memory of executors


@mccheah This test periodically fails on setting proper memory for executors on this specific test. I have filed a JIRA: SPARK-25291

SparkQA · 2018-08-31T04:57:41Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-make-spark-distribution-unified/2724/

SparkQA · 2018-08-31T05:05:31Z

Test build #95519 has finished for PR 22298 at commit 46c30cc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-08-31T05:06:53Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-make-spark-distribution-unified/2724/

holdenk

Did a quick first pass over this PR, really excited to have this support in K8s as well as yarn, but I have some questions especially around mixed language pipelines.

Really excited the K8s integration tests are now integrated.

holdenk · 2018-08-31T16:15:11Z

examples/src/main/python/worker_memory_check.py

@@ -0,0 +1,47 @@
+#


Is examples the right place for this?

Easiest place to put to be trivially picked up by integration tests, as I did with pyfiles, but I am open to recommendations.

shouldn't this be in python tests (and get it to run only certain cluster manager)

That might be a good place for it. But the reason for this being in examples is that the integration tests can access this locally and we can see the success in the Jenkins environment. The K8s integration test suite does not run python run-tests.py which means that this test would therefore not be part of the PRB.

I think the concern here is shipping a test as an example - this is the place where dev will be looking for example on how to use pyspark and having a memory test there is a bit strange.

holdenk · 2018-08-31T16:15:55Z

resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Config.scala

        "Ensure that major Python version is either Python2 or Python3")
      .createWithDefault("2")

+  val APP_RESOURCE_TYPE =


Why this instead of the bools? What about folks who want to make a pipeline which is both R and Python?

The reason for this is because we are already running binding steps that configure the driver based on the app resource. I thought it might as well pass the config down into the executors upon doing that binding bootstrap step.

Currently, we don't have any docker files that handle mixed pipelines so such configurations should be addressed in a followup-PR, imo. But I am open to suggestions (that are testable).

Yeah lets do something in a follow up after 2.4

holdenk · 2018-08-31T16:16:18Z

...rnetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicDriverFeatureStep.scala

    .get(DRIVER_MEMORY_OVERHEAD)
    .getOrElse(math.max((conf.get(MEMORY_OVERHEAD_FACTOR) * driverMemoryMiB).toInt,
      MEMORY_OVERHEAD_MIN_MIB))
+  // TODO: Have memory limit checks on driverMemory


Can we file a JIRA as well and include it in the comment.

I wanted to get an opinion from people (@mccheah) into whether or not we wanted to let the K8S api to handle memory limits (via ResourceQuota limit errors) or whether we wanted to catch it in a Spark exception (if we were to include a configuration for memory limits)

Hm can you elaborate here? We already set the driver memory limit in this step based on the overhead.

Valid point. These are not necessary

holdenk · 2018-08-31T16:16:39Z

...etes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicExecutorFeatureStep.scala

      (kubernetesConf.get(MEMORY_OVERHEAD_FACTOR) * executorMemoryMiB).toInt,
      MEMORY_OVERHEAD_MIN_MIB))
  private val executorMemoryWithOverhead = executorMemoryMiB + memoryOverheadMiB
+  // TODO: Have memory limit checks on executorMemory


Same as other TODO

holdenk · 2018-08-31T16:18:03Z

...etes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicExecutorFeatureStep.scala

+  // TODO: Have memory limit checks on executorMemory
+  private val executorMemoryTotal = kubernetesConf.sparkConf
+    .getOption(APP_RESOURCE_TYPE.key).map{ res =>
+      val additionalPySparkMemory = if (res == "python") {


So this means that we couldn't turn this on in a mixed language pipeline even if the pipeline author did some fun hacks (since we might need to support Scala/Python/R all at the same time). Is there a reason we did this instead of the YARN approach of isPython ?

Also I'd write this as a case statement instead perhaps style wise, what do you think?

Well, if there is a mixed pipeline, the binding steps would need to be reconfigured to include mixed pipelines. But currently, language is only determined based on the appResource

holdenk · 2018-08-31T16:25:45Z

resource-managers/kubernetes/docker/src/main/dockerfiles/spark/bindings/R/Dockerfile


 RUN apk add --no-cache R R-dev

+COPY R ${SPARK_HOME}/R


Is this change intentional?

This makes the docker build not run the R and R-dev installations each time an update to the jar is made. This is a minor change that helps with dev :)

holdenk · 2018-08-31T16:25:58Z

resource-managers/kubernetes/docker/src/main/dockerfiles/spark/bindings/python/Dockerfile

    # Removed the .cache to save space
    rm -r /root/.cache

+COPY python/lib ${SPARK_HOME}/python/lib


Same, is this change intentional

Same as the R change above ^^

holdenk · 2018-08-31T16:31:38Z

...tion-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/PythonTestsSuite.scala

+
+  test("Run PySpark with memory customization", k8sTestTag) {
+    sparkAppConf
+      .set("spark.kubernetes.container.image", s"${getTestImageRepo}/spark-py:${getTestImageTag}")


nit: Mildly confused why there's so many sets here (like the image, etc.) maybe make more sense in a shared test setup func?

Some of this stuff can be factored out I think, we just haven't done so yet. I wouldn't block a merge of this on such a refactor, but this entire test class could probably use some cleanup with respect to how the code is structured.

holdenk · 2018-08-31T16:32:09Z

...tion-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/PythonTestsSuite.scala

+      .set("spark.kubernetes.pyspark.pythonVersion", "3")
+      .set("spark.kubernetes.memoryOverheadFactor", s"$memOverheadConstant")
+      .set("spark.executor.pyspark.memory", s"${additionalMemory}m")
+      .set("spark.python.worker.reuse", "false")


Why is worker reuse being set to false?

in reference to #21977 (comment)

holdenk · 2018-08-31T16:33:25Z

...tion-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/PythonTestsSuite.scala

+    sparkAppConf
+      .set("spark.kubernetes.container.image", s"${getTestImageRepo}/spark-py:${getTestImageTag}")
+      .set("spark.kubernetes.pyspark.pythonVersion", "3")
+      .set("spark.kubernetes.memoryOverheadFactor", s"$memOverheadConstant")


Do we expect people who configure the rlimit advanced feature to also set the memoryOverheadConstant to a different value? If so we should call it out in the docs. (note: I think it would make sense for folks to set this to a lower value so I think this would be the expected behaviour and we should document but open to suggestions)

I can add to docs, sure!

This is already included in the docs of docs/running-on-kubernetes.md

SparkQA · 2018-08-31T21:43:22Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-make-spark-distribution-unified/2751/

SparkQA · 2018-08-31T21:50:19Z

Test build #95564 has finished for PR 22298 at commit 467703b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-08-31T21:53:20Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-make-spark-distribution-unified/2751/

rdblue · 2018-08-31T23:03:29Z

Looks fine to me, but I'm not familiar enough with the K8S code to have much of an opinion.

ifilonenko · 2018-08-31T23:38:01Z

@holdenk and @mccheah any other comments before merge?

SparkQA · 2018-08-31T23:55:49Z

Test build #95570 has finished for PR 22298 at commit 3ad1324.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mccheah · 2018-08-31T23:59:55Z

https://issues.apache.org/jira/browse/SPARK-25291 looks like a real issue with the way the tests are written. So we don't necessarily want to ignore it for this patch, but we're still thinking about it.

SparkQA · 2018-09-01T00:00:00Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-make-spark-distribution-unified/2754/

SparkQA · 2018-09-01T00:15:01Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-make-spark-distribution-unified/2754/

ifilonenko · 2018-09-01T01:30:49Z

@mccheah as such, that should be a followup PR and this should be good to merge as long as @holdenk gives a LGTM

ifilonenko · 2018-09-04T22:43:51Z

@felixcheung @holdenk I have moved the PySpark example files to a more appropriate location. Any other comments before merge?

SparkQA · 2018-09-04T22:58:43Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-make-spark-distribution-unified/2848/

SparkQA · 2018-09-04T23:07:53Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-make-spark-distribution-unified/2848/

SparkQA · 2018-09-05T03:28:39Z

Test build #95686 has finished for PR 22298 at commit 7dc26ce.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mccheah · 2018-09-06T23:11:37Z

Yeah this looks ok to me, would like a +1 from @felixcheung and @holdenk.

SparkQA · 2018-09-07T20:58:54Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-make-spark-distribution-unified/2934/

SparkQA · 2018-09-07T21:07:38Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-make-spark-distribution-unified/2934/

holdenk

Two minor nits, I think we should get rid of the set to false but otherwise LGTM since K8s folks have already signed off on that part.

holdenk · 2018-09-08T00:10:32Z

...ore/src/main/scala/org/apache/spark/deploy/k8s/features/bindings/JavaDriverFeatureStep.scala

    SparkPod(pod.pod, withDriverArgs)
  }
-  override def getAdditionalPodSystemProperties(): Map[String, String] = Map.empty
+  override def getAdditionalPodSystemProperties(): Map[String, String] =


filed SPARK-25373 - Support mixed language pipelines on Spark on K8s

holdenk · 2018-09-08T00:12:44Z

...tion-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/PythonTestsSuite.scala

+      .set("spark.kubernetes.pyspark.pythonVersion", "3")
+      .set("spark.kubernetes.memoryOverheadFactor", s"$memOverheadConstant")
+      .set("spark.executor.pyspark.memory", s"${additionalMemory}m")
+      .set("spark.python.worker.reuse", "false")


I don't believe this should be set. Worker reuse is on by default in most systems so not sure if this test depends on worker reuse being false. As per @rdblue's investigation this shouldn't impact this code path (and if it does we need to re-open that investigation).

SparkQA · 2018-09-08T00:39:58Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-make-spark-distribution-unified/2937/

holdenk · 2018-09-08T00:49:12Z

LGTM pending jenkins, see previous comments for details.

SparkQA · 2018-09-08T00:54:45Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-make-spark-distribution-unified/2937/

SparkQA · 2018-09-08T01:24:39Z

Test build #95809 has finished for PR 22298 at commit fe8cc5a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-09-08T04:59:12Z

Test build #95813 has finished for PR 22298 at commit ea25b8b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

ifilonenko · 2018-09-08T05:31:06Z

@holdenk or @mccheah for merge

holdenk · 2018-09-09T05:21:24Z

Merged to master (e.g. 3). It's not a bug fix but I think we should consider this for backport to 2.4 since it's arguably the second half of a feature that's in 2.4, but it's doesn't backport cleanly as is so maybe another PR just for the 2.4 branch.

felixcheung · 2018-09-09T18:25:17Z

+1 for 2.4

ifilonenko added 3 commits August 29, 2018 20:19

initial WIP push for SPARK-25021

b54a039

add python.worker.reuse

75742a3

final checks with e2e tests

46c30cc

ifilonenko commented Aug 31, 2018

View reviewed changes

holdenk reviewed Aug 31, 2018

View reviewed changes

resolve comments

467703b

remove TODOs

3ad1324

reconfigure location of pyspark examples

7dc26ce

merge conflicts

fe8cc5a

holdenk reviewed Sep 8, 2018

View reviewed changes

remove worker reuse

ea25b8b

asfgit closed this in 1cfda44 Sep 9, 2018

[SPARK-25021][K8S] Add spark.executor.pyspark.memory limit for K8S #22298

[SPARK-25021][K8S] Add spark.executor.pyspark.memory limit for K8S #22298

Uh oh!

Conversation

ifilonenko commented Aug 31, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

ifilonenko commented Aug 31, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 31, 2018

Uh oh!

SparkQA commented Aug 31, 2018

Uh oh!

SparkQA commented Aug 31, 2018

Uh oh!

holdenk left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 31, 2018

Uh oh!

SparkQA commented Aug 31, 2018

Uh oh!

SparkQA commented Aug 31, 2018

Uh oh!

rdblue commented Aug 31, 2018