-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-35672][CORE][YARN] Pass user classpath entries to executors using config instead of command line. #32810
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-35672][CORE][YARN] Pass user classpath entries to executors using config instead of command line. #32810
Conversation
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Test build #139432 has finished for PR 32810 at commit
|
|
Pushed up a new commit addressing test failures -- there was a difference between 2.3 and master that I didn't account for. That also made me realize that due to the changes in |
|
Test build #139435 has finished for PR 32810 at commit
|
|
Kubernetes integration test unable to build dist. exiting with code: 1 |
|
Pushed up a new commit with a simplified approach that doesn't involve a new internal configuration. Updated the description as well. Should be ready for review now. |
|
Test build #139502 has finished for PR 32810 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
The Jenkins build failure is some sort of compilation issue within the cc @dongjoon-hyun @holdenk as well in case you have any commentary from the k8s side. I couldn't find any related code in any of the other resource manager modules, but any input would still be welcomed in case k8s suffers from a similar problem, or has already solved this in a different way. |
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
Outdated
Show resolved
Hide resolved
|
I realized that the existing logic in my PR, which was copied from the New code follows the strategy used by the old code in I added more tests in both
Everything works as expected with the latest diff (only the 2nd through 4th would succeed with the previous). |
|
Test build #139719 has finished for PR 32810 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Gentle ping @tgravescs if you have a chance to look at the latest diff |
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
Outdated
Show resolved
Hide resolved
resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/ClientSuite.scala
Outdated
Show resolved
Hide resolved
f780111 to
335256b
Compare
|
Pushed up a new version which expands test cases, extracts some shared constants in the test, and simplifies the logic in getUserClasspathUrls and makes the assumptions more clear via an |
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Test build #139997 has finished for PR 32810 at commit
|
|
test this please |
|
@xkrogen can you retriever your GitHub action to test this? |
|
Kubernetes integration test starting |
|
Ping @mridulm as well in case you're interested in looking -- since the code has changed quite a bit since you last reviewed. |
|
@xkrogen please look at the test failure: This could very well be from this change is the class path is not being setup properly to pick up the plugin class |
…fig instead of command line. User-provided JARs are make available to executors using a custom classloader, so they do not appear on the standard Java classpath. Instead, they are passed as a list to the executor which then creates a classloader out of the URLs. Currently, this list of JARs is crafted by the Driver, which then passes the information to the executors by specifying each JAR on the executor command line as `--user-class-path /path/to/myjar.jar`. This can cause extremely long argument lists when there are many JARs, which can cause the OS argument length to be exceeded (see the JIRA for more details/examples). Instead, we can have the YARN `Client` create the list and put it into the configs, which get written to a file and distributed via the YARN distributed cache. The executor can load this from its configs. This bypasses the command line and uses a more scalable approach for passing the list of JARs.
…isting SECONDARY_JARS and APP_JAR configs to construct the classpath using the same (now shared) logic as the ApplicationMaster uses, instead of adding an additional config to pass the info.
…s which leverage the gateway/replacment path functionality. Enhance test cases in `YarnClusterSuite` for this case, and add tests in `ClientSuite` for the logic.
…n getUserClasspathUrls and make the assumptions more clear via an assert
31502e3 to
7eac14a
Compare
|
Thanks @tgravescs , that's a good point. I was unable to reproduce the test failure locally, and the other plugin-related tests are fine. Re-kicking the test builds again to be sure. Test failure is coming from here: Looks to me like the issue is that the |
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Test build #140282 has finished for PR 32810 at commit
|
|
looks like the tests passed in the normal QA build and different tests failed this run, so I'll merge |
|
merged to master, it wasn't a clean merge to branch-3.1, were you wanting to get it into that, if so could you put up separate pr? |
…ing config instead of command line Refactor the logic for constructing the user classpath from `yarn.ApplicationMaster` into `yarn.Client` so that it can be leveraged on the executor side as well, instead of having the driver construct it and pass it to the executor via command-line arguments. A new method, `getUserClassPath`, is added to `CoarseGrainedExecutorBackend` which defaults to `Nil` (consistent with the existing behavior where non-YARN resource managers do not configure the user classpath). `YarnCoarseGrainedExecutorBackend` overrides this to construct the user classpath from the existing `APP_JAR` and `SECONDARY_JARS` configs. User-provided JARs are made available to executors using a custom classloader, so they do not appear on the standard Java classpath. Instead, they are passed as a list to the executor which then creates a classloader out of the URLs. Currently in the case of YARN, this list of JARs is crafted by the Driver (in `ExecutorRunnable`), which then passes the information to the executors (`CoarseGrainedExecutorBackend`) by specifying each JAR on the executor command line as `--user-class-path /path/to/myjar.jar`. This can cause extremely long argument lists when there are many JARs, which can cause the OS argument length to be exceeded, typically manifesting as the error message: > /bin/bash: Argument list too long A [Google search](https://www.google.com/search?q=spark%20%22%2Fbin%2Fbash%3A%20argument%20list%20too%20long%22&oq=spark%20%22%2Fbin%2Fbash%3A%20argument%20list%20too%20long%22) indicates that this is not a theoretical problem and afflicts real users, including ours. Passing this list using the configurations instead resolves this issue. No, except for fixing the bug, allowing for larger JAR lists to be passed successfully. Configuration of JARs is identical to before. New unit tests were added in `YarnClusterSuite`. Also, we have been running a similar fix internally for 4 months with great success. Closes apache#32810 from xkrogen/xkrogen-SPARK-35672-classpath-scalable. Authored-by: Erik Krogen <[email protected]> Signed-off-by: Thomas Graves <[email protected]> (cherry picked from commit 866df69)
|
Thank you @tgravescs ! Your comments along the way were much appreciated. I put up a |
…rs using config instead of command line ### What changes were proposed in this pull request? Refactor the logic for constructing the user classpath from `yarn.ApplicationMaster` into `yarn.Client` so that it can be leveraged on the executor side as well, instead of having the driver construct it and pass it to the executor via command-line arguments. A new method, `getUserClassPath`, is added to `CoarseGrainedExecutorBackend` which defaults to `Nil` (consistent with the existing behavior where non-YARN resource managers do not configure the user classpath). `YarnCoarseGrainedExecutorBackend` overrides this to construct the user classpath from the existing `APP_JAR` and `SECONDARY_JARS` configs. ### Why are the changes needed? User-provided JARs are made available to executors using a custom classloader, so they do not appear on the standard Java classpath. Instead, they are passed as a list to the executor which then creates a classloader out of the URLs. Currently in the case of YARN, this list of JARs is crafted by the Driver (in `ExecutorRunnable`), which then passes the information to the executors (`CoarseGrainedExecutorBackend`) by specifying each JAR on the executor command line as `--user-class-path /path/to/myjar.jar`. This can cause extremely long argument lists when there are many JARs, which can cause the OS argument length to be exceeded, typically manifesting as the error message: > /bin/bash: Argument list too long A [Google search](https://www.google.com/search?q=spark%20%22%2Fbin%2Fbash%3A%20argument%20list%20too%20long%22&oq=spark%20%22%2Fbin%2Fbash%3A%20argument%20list%20too%20long%22) indicates that this is not a theoretical problem and afflicts real users, including ours. Passing this list using the configurations instead resolves this issue. ### Does this PR introduce _any_ user-facing change? No, except for fixing the bug, allowing for larger JAR lists to be passed successfully. Configuration of JARs is identical to before. ### How was this patch tested? New unit tests were added in `YarnClusterSuite`. Also, we have been running a similar fix internally for 4 months with great success. Note that this is a backport of #32810 with minor conflicts around imports. Closes #33090 from xkrogen/xkrogen-SPARK-35672-classpath-scalable-branch-3.1. Authored-by: Erik Krogen <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
…rs using config instead of command line ### What changes were proposed in this pull request? Refactor the logic for constructing the user classpath from `yarn.ApplicationMaster` into `yarn.Client` so that it can be leveraged on the executor side as well, instead of having the driver construct it and pass it to the executor via command-line arguments. A new method, `getUserClassPath`, is added to `CoarseGrainedExecutorBackend` which defaults to `Nil` (consistent with the existing behavior where non-YARN resource managers do not configure the user classpath). `YarnCoarseGrainedExecutorBackend` overrides this to construct the user classpath from the existing `APP_JAR` and `SECONDARY_JARS` configs. ### Why are the changes needed? User-provided JARs are made available to executors using a custom classloader, so they do not appear on the standard Java classpath. Instead, they are passed as a list to the executor which then creates a classloader out of the URLs. Currently in the case of YARN, this list of JARs is crafted by the Driver (in `ExecutorRunnable`), which then passes the information to the executors (`CoarseGrainedExecutorBackend`) by specifying each JAR on the executor command line as `--user-class-path /path/to/myjar.jar`. This can cause extremely long argument lists when there are many JARs, which can cause the OS argument length to be exceeded, typically manifesting as the error message: > /bin/bash: Argument list too long A [Google search](https://www.google.com/search?q=spark%20%22%2Fbin%2Fbash%3A%20argument%20list%20too%20long%22&oq=spark%20%22%2Fbin%2Fbash%3A%20argument%20list%20too%20long%22) indicates that this is not a theoretical problem and afflicts real users, including ours. Passing this list using the configurations instead resolves this issue. ### Does this PR introduce _any_ user-facing change? No, except for fixing the bug, allowing for larger JAR lists to be passed successfully. Configuration of JARs is identical to before. ### How was this patch tested? New unit tests were added in `YarnClusterSuite`. Also, we have been running a similar fix internally for 4 months with great success. Note that this is a backport of apache#32810 with minor conflicts around imports. Closes apache#33090 from xkrogen/xkrogen-SPARK-35672-classpath-scalable-branch-3.1. Authored-by: Erik Krogen <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]> (cherry picked from commit b4916d4)
…ing config instead of command line ### What changes were proposed in this pull request? Refactor the logic for constructing the user classpath from `yarn.ApplicationMaster` into `yarn.Client` so that it can be leveraged on the executor side as well, instead of having the driver construct it and pass it to the executor via command-line arguments. A new method, `getUserClassPath`, is added to `CoarseGrainedExecutorBackend` which defaults to `Nil` (consistent with the existing behavior where non-YARN resource managers do not configure the user classpath). `YarnCoarseGrainedExecutorBackend` overrides this to construct the user classpath from the existing `APP_JAR` and `SECONDARY_JARS` configs. ### Why are the changes needed? User-provided JARs are made available to executors using a custom classloader, so they do not appear on the standard Java classpath. Instead, they are passed as a list to the executor which then creates a classloader out of the URLs. Currently in the case of YARN, this list of JARs is crafted by the Driver (in `ExecutorRunnable`), which then passes the information to the executors (`CoarseGrainedExecutorBackend`) by specifying each JAR on the executor command line as `--user-class-path /path/to/myjar.jar`. This can cause extremely long argument lists when there are many JARs, which can cause the OS argument length to be exceeded, typically manifesting as the error message: > /bin/bash: Argument list too long A [Google search](https://www.google.com/search?q=spark%20%22%2Fbin%2Fbash%3A%20argument%20list%20too%20long%22&oq=spark%20%22%2Fbin%2Fbash%3A%20argument%20list%20too%20long%22) indicates that this is not a theoretical problem and afflicts real users, including ours. Passing this list using the configurations instead resolves this issue. ### Does this PR introduce _any_ user-facing change? No, except for fixing the bug, allowing for larger JAR lists to be passed successfully. Configuration of JARs is identical to before. ### How was this patch tested? New unit tests were added in `YarnClusterSuite`. Also, we have been running a similar fix internally for 4 months with great success. Closes apache#32810 from xkrogen/xkrogen-SPARK-35672-classpath-scalable. Authored-by: Erik Krogen <[email protected]> Signed-off-by: Thomas Graves <[email protected]>
…ing config instead of command line ### What changes were proposed in this pull request? Refactor the logic for constructing the user classpath from `yarn.ApplicationMaster` into `yarn.Client` so that it can be leveraged on the executor side as well, instead of having the driver construct it and pass it to the executor via command-line arguments. A new method, `getUserClassPath`, is added to `CoarseGrainedExecutorBackend` which defaults to `Nil` (consistent with the existing behavior where non-YARN resource managers do not configure the user classpath). `YarnCoarseGrainedExecutorBackend` overrides this to construct the user classpath from the existing `APP_JAR` and `SECONDARY_JARS` configs. Within `yarn.Client`, environment variables in the configured paths are resolved before constructing the classpath. Please note that this is a re-submission of #32810, which was reverted in #34082 due to the issues described in [this comment](https://issues.apache.org/jira/browse/SPARK-35672?focusedCommentId=17419285&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17419285). This PR additionally includes the changes described in #34084 to resolve the issue, though this PR has been enhanced to properly handle escape strings, unlike #34084. ### Why are the changes needed? User-provided JARs are made available to executors using a custom classloader, so they do not appear on the standard Java classpath. Instead, they are passed as a list to the executor which then creates a classloader out of the URLs. Currently in the case of YARN, this list of JARs is crafted by the Driver (in `ExecutorRunnable`), which then passes the information to the executors (`CoarseGrainedExecutorBackend`) by specifying each JAR on the executor command line as `--user-class-path /path/to/myjar.jar`. This can cause extremely long argument lists when there are many JARs, which can cause the OS argument length to be exceeded, typically manifesting as the error message: > /bin/bash: Argument list too long A [Google search](https://www.google.com/search?q=spark%20%22%2Fbin%2Fbash%3A%20argument%20list%20too%20long%22&oq=spark%20%22%2Fbin%2Fbash%3A%20argument%20list%20too%20long%22) indicates that this is not a theoretical problem and afflicts real users, including ours. Passing this list using the configurations instead resolves this issue. ### Does this PR introduce _any_ user-facing change? There is one small behavioral change which is a bug fix. Previously the `spark.yarn.config.gatewayPath` and `spark.yarn.config.replacementPath` options were only applied to executors, meaning they would not work for the driver when running in cluster mode. This appears to be a bug; the [documentation for this functionality](https://spark.apache.org/docs/latest/running-on-yarn.html) does not mention any limitations that this is only for executors. This PR fixes that issue. Additionally, this fixes the main bash argument length issue, allowing for larger JAR lists to be passed successfully. Configuration of JARs is identical to before, and substitution of environment variables in `spark.jars` or `spark.yarn.config.replacementPath` works as expected. ### How was this patch tested? New unit tests were added in `YarnClusterSuite`. Also, we have been running a similar fix internally for 4 months with great success. Closes #34120 from xkrogen/xkrogen-SPARK-35672-yarn-classpath-list-take2. Authored-by: Erik Krogen <[email protected]> Signed-off-by: attilapiros <[email protected]>
What changes were proposed in this pull request?
Refactor the logic for constructing the user classpath from
yarn.ApplicationMasterintoyarn.Clientso that it can be leveraged on the executor side as well, instead of having the driver construct it and pass it to the executor via command-line arguments. A new method,getUserClassPath, is added toCoarseGrainedExecutorBackendwhich defaults toNil(consistent with the existing behavior where non-YARN resource managers do not configure the user classpath).YarnCoarseGrainedExecutorBackendoverrides this to construct the user classpath from the existingAPP_JARandSECONDARY_JARSconfigs.Why are the changes needed?
User-provided JARs are made available to executors using a custom classloader, so they do not appear on the standard Java classpath. Instead, they are passed as a list to the executor which then creates a classloader out of the URLs. Currently in the case of YARN, this list of JARs is crafted by the Driver (in
ExecutorRunnable), which then passes the information to the executors (CoarseGrainedExecutorBackend) by specifying each JAR on the executor command line as--user-class-path /path/to/myjar.jar. This can cause extremely long argument lists when there are many JARs, which can cause the OS argument length to be exceeded, typically manifesting as the error message:A Google search indicates that this is not a theoretical problem and afflicts real users, including ours. Passing this list using the configurations instead resolves this issue.
Does this PR introduce any user-facing change?
No, except for fixing the bug, allowing for larger JAR lists to be passed successfully. Configuration of JARs is identical to before.
How was this patch tested?
New unit tests were added in
YarnClusterSuite. Also, we have been running a similar fix internally for 4 months with great success.