You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[SPARK-40817][K8S][3.2] spark.files should preserve remote files
### What changes were proposed in this pull request?
Backport apache#38376 to `branch-3.2`
You can find a detailed description of the issue and an example reproduction on the Jira card: https://issues.apache.org/jira/browse/SPARK-40817
The idea for this fix is to update the logic which uploads user-specified files (via `spark.jars`, `spark.files`, etc) to `spark.kubernetes.file.upload.path`. After uploading local files, it used to overwrite the initial list of URIs passed by the user and it would thus erase all remote URIs which were specified there.
Small example of this behaviour:
1. User set the value of `spark.jars` to `s3a://some-bucket/my-application.jar,/tmp/some-local-jar.jar` when running `spark-submit` in cluster mode
2. `BasicDriverFeatureStep.getAdditionalPodSystemProperties()` gets called at one point while running `spark-submit`
3. This function would set `spark.jars` to a new value of `${SPARK_KUBERNETES_UPLOAD_PATH}/spark-upload-${RANDOM_STRING}/some-local-jar.jar`. Note that `s3a://some-bucket/my-application.jar` has been discarded.
With the logic proposed in this PR, the new value of `spark.jars` would be `s3a://some-bucket/my-application.jar,${SPARK_KUBERNETES_UPLOAD_PATH}/spark-upload-${RANDOM_STRING}/some-local-jar.jar`, so in other words we are making sure that remote URIs are no longer discarded.
### Why are the changes needed?
We encountered this issue in production when trying to launch Spark on Kubernetes jobs in cluster mode with a fix of local and remote dependencies.
### Does this PR introduce _any_ user-facing change?
Yes, see description of the new behaviour above.
### How was this patch tested?
- Added a unit test for the new behaviour
- Added an integration test for the new behaviour
- Tried this patch in our Kubernetes environment with `SparkPi`:
```
spark-submit \
--master k8s://https://$KUBERNETES_API_SERVER_URL:443 \
--deploy-mode cluster \
--name=spark-submit-test \
--class org.apache.spark.examples.SparkPi \
--conf spark.jars=/opt/my-local-jar.jar,s3a://$BUCKET_NAME/my-remote-jar.jar \
--conf spark.kubernetes.file.upload.path=s3a://$BUCKET_NAME/my-upload-path/ \
[...]
/opt/spark/examples/jars/spark-examples_2.12-3.1.3.jar
```
Before applying the patch, `s3a://$BUCKET_NAME/my-remote-jar.jar` was discarded from the final value of `spark.jars`. After applying the patch and launching the job again, I confirmed that `s3a://$BUCKET_NAME/my-remote-jar.jar` was no longer discarded by looking at the Spark config for the running job.
Closesapache#39670 from antonipp/spark-40817-branch-3.2.
Authored-by: Anton Ippolitov <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
Copy file name to clipboardExpand all lines: resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicDriverFeatureStep.scala
+6-6Lines changed: 6 additions & 6 deletions
Original file line number
Diff line number
Diff line change
@@ -162,27 +162,27 @@ private[spark] class BasicDriverFeatureStep(conf: KubernetesDriverConf)
Copy file name to clipboardExpand all lines: resource-managers/kubernetes/core/src/test/scala/org/apache/spark/deploy/k8s/features/BasicDriverFeatureStepSuite.scala
Copy file name to clipboardExpand all lines: resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/DepsTestsSuite.scala
0 commit comments