-
Notifications
You must be signed in to change notification settings - Fork 46
[SPARK-48017] Add Spark application submission worker for operator #10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
gradle.properties
Outdated
|
|
||
| fabric8Version=6.12.1 | ||
| # Caution: fabric8 version should be aligned with Spark dependency | ||
| fabric8Version=6.7.2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is caused by the co-exist dependency of Spark 3.5.1 . fabric8 6.7.2 and 6.12.1 are not binary compatible, it complains even if we exclude that from Spark dependency.
A reminder has been added here to align the fabric8 version used by Spark and operator. I'll update this when 4.0.0 rcs become available
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The incompatible was captured via unit tests.
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for making a PR.
As we observe here, I'm not sure if the AS-IS design is sufficient or robust across multiple Spark versions. If we cannot find a way to be robust, it will bite us in the future again and again in the production.
It seems that we need a shim layer or multiple module like Iceberg
Yes, that's one mid-term goal that we will target for operator v1.0, in order to achieve fully version agnostic. This PR proposes single submission worker based on latest spark-kubernetes - consider it's history, we tested the compatibility with Spark 3.2, 3.3, 3.4, 3.5. We can do the same for 4.0 to ensure no breaking change is introduced. This is the pattern adopted by most operator solutions, like Flink operator / Google Spark operator. I'm not saying this is the absolutely right way to go for longer term, but it could enable the first batch of evaluations on operator 0.1 while we work on the multi-submission worker mode. The challenges of multi-version submission worker mode involves
Given this can we start with this PR for v0.1 ? |
|
Thanks. Let me consider more. |
|
Thank you for renaming the package and updating LICENSE. |
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apache Spark 4.0.0-preview will arrive next Monday. Let's resume this PR with that as the first Spark version until we have version-agnostic capability.
gradle.properties
Outdated
| lombokVersion=1.18.32 | ||
|
|
||
| #Spark | ||
| scalaVersion=2.12 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you know, there is not Scala 2.12 from Apache Spark 4.0.0-preview.
So, I guess we need to use Scala 2.13 only in Spark Operator for both case Scala 2.12 and 2.13 submission.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup! upgrading to 2.13 along with 4.0.0-preview1
| dependencies { | ||
| implementation project(":spark-operator-api") | ||
|
|
||
| implementation("org.apache.spark:spark-kubernetes_$scalaVersion:$sparkVersion") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me try to define the used classes as DeveloperAPI for this work. That will protect spark-kubernetes-operator from any accidental change from 4.0.0.
For now, Apache Spark guarantees only the following.
$ git grep '@DeveloperApi' resource-managers/kubernetes/
resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/KubernetesUtils.scala:@DeveloperApi
resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/SparkKubernetesClientFactory.scala:@DeveloperApi
resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/SparkPod.scala:@DeveloperApi
resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/KubernetesDriverCustomFeatureConfigStep.scala:@DeveloperApi
resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/KubernetesExecutorCustomFeatureConfigStep.scala:@DeveloperApi
resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/KubernetesFeatureConfigStep.scala:@DeveloperApi
resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/AbstractPodsAllocator.scala:@DeveloperApi
resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsPollingSnapshotSource.scala:@DeveloperApi
resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsWatchSnapshotSource.scala:@DeveloperApi
| import org.apache.spark.deploy.k8s.submit.KubernetesClientUtils; | ||
| import org.apache.spark.deploy.k8s.submit.MainAppResource; | ||
|
|
||
| public class SparkAppDriverConf extends KubernetesDriverConf { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since KubernetesDriverConf is a private class, always there has been a breaking change on KubernetesDriverConf like the following.
spark-submission-worker/src/main/java/org/apache/spark/k8s/operator/SparkAppDriverConf.java
Show resolved
Hide resolved
spark-submission-worker/src/main/java/org/apache/spark/k8s/operator/SparkAppDriverConf.java
Outdated
Show resolved
Hide resolved
...-submission-worker/src/main/java/org/apache/spark/k8s/operator/SparkAppSubmissionWorker.java
Show resolved
Hide resolved
spark-submission-worker/src/main/java/org/apache/spark/k8s/operator/SparkAppResourceSpec.java
Show resolved
Hide resolved
spark-submission-worker/src/main/java/org/apache/spark/k8s/operator/SparkAppResourceSpec.java
Show resolved
Hide resolved
spark-submission-worker/src/main/java/org/apache/spark/k8s/operator/SparkAppResourceSpec.java
Show resolved
Hide resolved
spark-submission-worker/src/main/java/org/apache/spark/k8s/operator/SparkAppResourceSpec.java
Show resolved
Hide resolved
| Container containerWithVolume = | ||
| new ContainerBuilder(pod.container()) | ||
| .addNewEnv() | ||
| .withName(org.apache.spark.deploy.k8s.Constants.ENV_SPARK_CONF_DIR()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need a full package name when we have import statement? Maybe, the following?
- .withName(org.apache.spark.deploy.k8s.Constants.ENV_SPARK_CONF_DIR())
+ .withName(Constants.ENV_SPARK_CONF_DIR())| new ContainerBuilder(pod.container()) | ||
| .addNewEnv() | ||
| .withName(org.apache.spark.deploy.k8s.Constants.ENV_SPARK_CONF_DIR()) | ||
| .withValue(org.apache.spark.deploy.k8s.Constants.SPARK_CONF_DIR_INTERNAL()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto.
| .withValue(org.apache.spark.deploy.k8s.Constants.SPARK_CONF_DIR_INTERNAL()) | ||
| .endEnv() | ||
| .addNewVolumeMount() | ||
| .withName(org.apache.spark.deploy.k8s.Constants.SPARK_CONF_VOLUME_DRIVER()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto.
| .endEnv() | ||
| .addNewVolumeMount() | ||
| .withName(org.apache.spark.deploy.k8s.Constants.SPARK_CONF_VOLUME_DRIVER()) | ||
| .withMountPath(org.apache.spark.deploy.k8s.Constants.SPARK_CONF_DIR_INTERNAL()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto.
### What changes were proposed in this pull request? This PR aims to promote `KubernetesVolumeUtils` to `DeveloperApi` from Apache Spark 4.0.0 for Apache Spark Kubernetes Operator. ### Why are the changes needed? This API was added by the following at `Apache Spark 3.0.0` and has been stable. - #22959 Since `Apache Spark Kubernetes Operator` requires this, we had better maintain it as a developer API officially from `Apache Spark 4.0.0`. - apache/spark-kubernetes-operator#10 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46326 from dongjoon-hyun/SPARK-48076. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request? This PR aims to promote `KubernetesClientUtils` to `DeveloperApi`. ### Why are the changes needed? Since `Apache Spark Kubernetes Operator` requires this, we had better maintain it as a developer API officially from `Apache Spark 4.0.0`. - apache/spark-kubernetes-operator#10 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46327 from dongjoon-hyun/SPARK-48077. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request? This PR aims to promote `org.apache.spark.deploy.k8s.Constants` to `DeveloperApi` ### Why are the changes needed? Since `Apache Spark Kubernetes Operator` depends on this, we had better maintain it as a developer API officially from `Apache Spark 4.0.0`. - apache/spark-kubernetes-operator#10 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46329 from dongjoon-hyun/SPARK-48078. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
… `DeveloperApi` ### What changes were proposed in this pull request? This PR aims to promote `*MainAppResource` and `NonJVMResource` to `DeveloperApi`. ### Why are the changes needed? Since `Apache Spark Kubernetes Operator` depends on these traits and classes, we had better maintain it as a developer API officially from `Apache Spark 4.0.0`. - apache/spark-kubernetes-operator#10 Since there are no changes after `3.0.0`, these are defined as `Stable`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46332 from dongjoon-hyun/SPARK-48080. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
build.gradle
Outdated
| repositories { | ||
| mavenCentral() | ||
| // This is a workaround to resolve Spark 4.0.0-preview-1 | ||
| // To be removed for official release |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apache Spark recommends to use IDed TODOs. Please file a JIRA issue and use it like the following.
- // To be removed for official release
+ // TODO(SPARK-XXXXX) Use Apache Spark 4.0.0-preview1 when it's ready
gradle.properties
Outdated
| commonsIOVersion=2.16.1 | ||
| lombokVersion=1.18.32 | ||
|
|
||
| #Spark |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need a space.
#Spark-># Spark.
| && applicationSpec.getExecutorSpec().getPodTemplateSpec() != null; | ||
| } | ||
|
|
||
| public static long getAttemptId(final SparkApplication app) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for adding final.
| return appId(); | ||
| } | ||
|
|
||
| public String configMapNameDriver() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we make a new K8s resource name, we should guarantee that this complies K8s naming limit.
Could you add a method description, what is the range of string length of this method's return value?
| public class SparkAppSubmissionWorker { | ||
| // Default length limit for generated app id. Generated id is used as resource-prefix when | ||
| // user-provided id is too long for this purpose. This applied to all resources associated with | ||
| // the Spark app (including k8s service which has different naming length limit). This we |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This we?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the catch! fixed the typo.
| primaryResource = new RMainAppResource(applicationSpec.getSparkRFiles()); | ||
| } | ||
| effectiveSparkConf.setIfMissing( | ||
| "spark.master", "k8s://https://$KUBERNETES_SERVICE_HOST:$KUBERNETES_SERVICE_PORT"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To @jiangzho and @aaruna, as we know, Apache Spark's ExternalClusterManager allows a custom K8s-based external cluster manager. So, it would be great if Spark K8s Operator has a capability from the beginning to allow a custom prefix in addition to k8s from the beginning.
def canCreate(masterURL: String): BooleanThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes - it should be possible to use custom Cluster Manager by setting spark.master. When master is not expplicitly set in SparkConf, this would automatically generate master URL based on the environment which operator is deployed in.
To make it more handy (e.g. still support automatic generating master url but with a different prefix), a new property has been introduced. Added javaDoc in getResourceSpec to clarify the behavior
| String preferredIdPrefix = preferredId.substring(0, preferredIdPrefixLength); | ||
| return generateHashBasedId( | ||
| preferredIdPrefix, | ||
| app.getMetadata().getNamespace(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line seems to have some assumption on K8s namespace length.
- Can we support the maximum K8s namespace length here?
- Please make a test case for
generateSparkAppIdfor that maximum K8s namespace length.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The namespace name / app name are used as values to generate hash which has a final length limit applied. Added Javadoc and one more test case to clarify this behavior.
| sparkConf, appId, mock(JavaMainAppResource.class), "foo", null, Option.empty()); | ||
| String resourcePrefix = sparkAppDriverConf.resourceNamePrefix(); | ||
| Assertions.assertEquals(resourcePrefix, appId); | ||
| Assertions.assertTrue(sparkAppDriverConf.configMapNameDriver().contains(resourcePrefix)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add more test cases seperately at least for all public methods. For example, it would be great if configMapNameDriver has a new test case.
|
|
||
| SparkAppResourceSpec appResourceSpec = new SparkAppResourceSpec(mockConf, mockSpec); | ||
|
|
||
| Assertions.assertEquals(2, appResourceSpec.getDriverResources().size()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we import Assertions.assertEquals?
| appProps.put("spark.kubernetes.namespace", "ns2"); | ||
| Map<String, String> overrides = new HashMap<>(); | ||
| overrides.put("spark.executor.instances", "5"); | ||
| overrides.put("spark.kubernetes.namespace", "ns3"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a new test coverage with a long namespace length.
| .withMountPath(Constants.SPARK_CONF_DIR_INTERNAL()) | ||
| .endVolumeMount() | ||
| .build(); | ||
| Pod podWithVolume = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great to have more specific name like podWithConfigMapVolume.
| appId, | ||
| primaryResource, | ||
| applicationSpec.getMainClass(), | ||
| applicationSpec.getDriverArgs().toArray(new String[0]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please define a final constant in order to reuse this, new String[0].
| prefix, DEFAULT_ENCODE_BASE, DEFAULT_HASH_BASED_IDENTIFIER_LENGTH_LIMIT, identifiers); | ||
| } | ||
|
|
||
| public static String generateHashBasedId( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a method description because this is public static method. Especially, about the guaranteed return string length.
| import org.apache.spark.k8s.operator.status.ApplicationStatus; | ||
| import org.apache.spark.k8s.operator.status.AttemptInfo; | ||
|
|
||
| class SparkAppSubmissionWorkerTest { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add more test coverage. For example, a corner case like generateHashBasedId whose input has long long identifiers.
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @jiangzho . I finished another round of reviews. Could you address them?
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. Thank you for addressing all comments (with IDed TODOs).
Merged to main.
|
I wrote the current status summary here. |
…SparkApplication` CRD ### What changes were proposed in this pull request? This PR aims to fix `SparkAppDriverConf` to respect `sparkVersion` of `SparkApplication` CRD. ### Why are the changes needed? This is a long standing bug from the initial implementation. - #10 Since Apache Spark K8s Operator can launch various Spark versions, `spark-version` label should come from `SparkApplication` CRD's `sparkVersion` field. However, currently, the Spark version of compile dependency is used for `Driver` resources (like `Driver Pod` and `Driver Service`. We should override this. ### Does this PR introduce _any_ user-facing change? Yes, this is a bug fix to use a correct version information. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #385 from dongjoon-hyun/SPARK-53874. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
What changes were proposed in this pull request?
This is a breakdown PR of #2 - adding a submission worker implementation for SparkApplication.
Why are the changes needed?
Spark Operator needs a submission worker to convert its abstraction (the SparkApplication API) into k8s resource spec.
This is a light-weight implementation based on native k8s integration.
As of now, it's based off Spark 4.0.0-preview1 - but it's assumed to serve all Spark LTS versions. This is feasible because as it aims to cover only the spec generation, Spark core jars are still brought-in by application images. E2Es would set up with operator later to ensure that.
Per SPIP doc, in future operator version(s) we may add more implementations for submission worker based on different Spark versions to achieve 100% version agnostic, at the cost of having multiple workers stand-by.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Added unit test coverage.
Was this patch authored or co-authored using generative AI tooling?
no