Skip to content

Conversation

@xkrogen
Copy link
Contributor

@xkrogen xkrogen commented Jun 7, 2021

What changes were proposed in this pull request?

Refactor the logic for constructing the user classpath from yarn.ApplicationMaster into yarn.Client so that it can be leveraged on the executor side as well, instead of having the driver construct it and pass it to the executor via command-line arguments. A new method, getUserClassPath, is added to CoarseGrainedExecutorBackend which defaults to Nil (consistent with the existing behavior where non-YARN resource managers do not configure the user classpath). YarnCoarseGrainedExecutorBackend overrides this to construct the user classpath from the existing APP_JAR and SECONDARY_JARS configs.

Why are the changes needed?

User-provided JARs are made available to executors using a custom classloader, so they do not appear on the standard Java classpath. Instead, they are passed as a list to the executor which then creates a classloader out of the URLs. Currently in the case of YARN, this list of JARs is crafted by the Driver (in ExecutorRunnable), which then passes the information to the executors (CoarseGrainedExecutorBackend) by specifying each JAR on the executor command line as --user-class-path /path/to/myjar.jar. This can cause extremely long argument lists when there are many JARs, which can cause the OS argument length to be exceeded, typically manifesting as the error message:

/bin/bash: Argument list too long

A Google search indicates that this is not a theoretical problem and afflicts real users, including ours. Passing this list using the configurations instead resolves this issue.

Does this PR introduce any user-facing change?

No, except for fixing the bug, allowing for larger JAR lists to be passed successfully. Configuration of JARs is identical to before.

How was this patch tested?

New unit tests were added in YarnClusterSuite. Also, we have been running a similar fix internally for 4 months with great success.

@xkrogen
Copy link
Contributor Author

xkrogen commented Jun 7, 2021

cc @mridulm @tgravescs @HeartSaVioR

@SparkQA
Copy link

SparkQA commented Jun 7, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43955/

@SparkQA
Copy link

SparkQA commented Jun 7, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43955/

@SparkQA
Copy link

SparkQA commented Jun 7, 2021

Test build #139432 has finished for PR 32810 at commit fdf6b8d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@xkrogen
Copy link
Contributor Author

xkrogen commented Jun 7, 2021

Pushed up a new commit addressing test failures -- there was a difference between 2.3 and master that I didn't account for.

That also made me realize that due to the changes in YarnCoarseGrainedExecutorBackend from #23706, this can probably be done more cleanly without adding a new configuration. I'll investigate and probably put up a new commit tomorrow.

@SparkQA
Copy link

SparkQA commented Jun 7, 2021

Test build #139435 has finished for PR 32810 at commit 10d1764.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 8, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43958/

@xkrogen
Copy link
Contributor Author

xkrogen commented Jun 8, 2021

Pushed up a new commit with a simplified approach that doesn't involve a new internal configuration. Updated the description as well. Should be ready for review now.

@SparkQA
Copy link

SparkQA commented Jun 8, 2021

Test build #139502 has finished for PR 32810 at commit 00518ab.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 8, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44027/

@SparkQA
Copy link

SparkQA commented Jun 8, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44027/

@xkrogen
Copy link
Contributor Author

xkrogen commented Jun 8, 2021

The Jenkins build failure is some sort of compilation issue within the catalyst module, doesn't seem related AFAICT. Similar for the GA build, it is failing on the Python linter with some complaints in pandas module.

cc @dongjoon-hyun @holdenk as well in case you have any commentary from the k8s side. I couldn't find any related code in any of the other resource manager modules, but any input would still be welcomed in case k8s suffers from a similar problem, or has already solved this in a different way.

@xkrogen
Copy link
Contributor Author

xkrogen commented Jun 11, 2021

I realized that the existing logic in my PR, which was copied from the ApplicationMaster/driver, wouldn't properly handle local paths which used the GATEWAY_ROOT_PATH / REPLACEMENT_ROOT_PATH mechanism to use a different path on- vs. off-cluster. I guess this was previously a bug in the driver code, but which would only manifest itself with the combination of cluster mode and local URIs leveraging the gateway/replacement paths.

New code follows the strategy used by the old code in ExecutorRunnable to perform local URI replacements as necessary, and since this code is shared between the driver and executors, it fixes the bug discussed above. I also make use of Java NIO APIs for performing the relative-to-absolute and path-to-URL conversions, instead of relying on the Java File API in combination with manual string manipulation to add a file: prefix.

I added more tests in both ClientSuite and YarnClusterSuite for the various options, and also tested running a real job on a real YARN cluster which made use of:

  • local URI requiring gateway/replacement configs
  • local URI not requiring gateway/replacement configs
  • a relative file in the local working dir
  • a file on shared storage using hdfs scheme, for good measure

Everything works as expected with the latest diff (only the 2nd through 4th would succeed with the previous).

@SparkQA
Copy link

SparkQA commented Jun 12, 2021

Test build #139719 has finished for PR 32810 at commit f780111.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 12, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44244/

@SparkQA
Copy link

SparkQA commented Jun 12, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44244/

@xkrogen
Copy link
Contributor Author

xkrogen commented Jun 16, 2021

Gentle ping @tgravescs if you have a chance to look at the latest diff

@xkrogen xkrogen force-pushed the xkrogen-SPARK-35672-classpath-scalable branch from f780111 to 335256b Compare June 18, 2021 16:36
@xkrogen
Copy link
Contributor Author

xkrogen commented Jun 18, 2021

Pushed up a new version which expands test cases, extracts some shared constants in the test, and simplifies the logic in getUserClasspathUrls and makes the assumptions more clear via an assert.

@SparkQA
Copy link

SparkQA commented Jun 18, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44523/

@SparkQA
Copy link

SparkQA commented Jun 18, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44523/

@SparkQA
Copy link

SparkQA commented Jun 18, 2021

Test build #139997 has finished for PR 32810 at commit 335256b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tgravescs
Copy link
Contributor

test this please

@tgravescs
Copy link
Contributor

@xkrogen can you retriever your GitHub action to test this?

@SparkQA
Copy link

SparkQA commented Jun 21, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44608/

@xkrogen
Copy link
Contributor Author

xkrogen commented Jun 24, 2021

Ping @mridulm as well in case you're interested in looking -- since the code has changed quite a bit since you last reviewed.

@tgravescs
Copy link
Contributor

@xkrogen please look at the test failure:
[info] - plugin initialization in non-local mode with resources *** FAILED *** (11 seconds, 97 milliseconds)
14319

This could very well be from this change is the class path is not being setup properly to pick up the plugin class

xkrogen added 6 commits June 24, 2021 13:05
…fig instead of command line.

User-provided JARs are make available to executors using a custom classloader, so they do not appear on the standard Java classpath. Instead, they are passed as a list to the executor which then creates a classloader out of the URLs. Currently, this list of JARs is crafted by the Driver, which then passes the information to the executors by specifying each JAR on the executor command line as `--user-class-path /path/to/myjar.jar`. This can cause extremely long argument lists when there are many JARs, which can cause the OS argument length to be exceeded (see the JIRA for more details/examples). Instead, we can have the YARN `Client` create the list and put it into the configs, which get written to a file and distributed via the YARN distributed cache. The executor can load this from its configs. This bypasses the command line and uses a more scalable approach for passing the list of JARs.
…isting SECONDARY_JARS and APP_JAR configs to construct the classpath using the same (now shared) logic as the ApplicationMaster uses, instead of adding an additional config to pass the info.
…s which leverage the gateway/replacment path functionality. Enhance test cases in `YarnClusterSuite` for this case, and add tests in `ClientSuite` for the logic.
…n getUserClasspathUrls and make the assumptions more clear via an assert
@xkrogen xkrogen force-pushed the xkrogen-SPARK-35672-classpath-scalable branch from 31502e3 to 7eac14a Compare June 24, 2021 20:37
@xkrogen
Copy link
Contributor Author

xkrogen commented Jun 24, 2021

Thanks @tgravescs , that's a good point. I was unable to reproduce the test failure locally, and the other plugin-related tests are fine. Re-kicking the test builds again to be sure.

Test failure is coming from here:

      eventually(timeout(10.seconds), interval(100.millis)) {
        children = dir.listFiles()
        assert(children != null)
        // we have 2 discovery scripts and then expect 1 driver and 1 executor file
        assert(children.length >= 4)
      }
      val execFiles =
        children.filter(_.getName.startsWith(NonLocalModeSparkPlugin.executorFileStr))
      assert(execFiles.size === 1)
      val allLines = Files.readLines(execFiles(0), StandardCharsets.UTF_8)
      assert(allLines.size === 1) //    <------------------------------------------------------- fails here

Looks to me like the issue is that the eventually is waiting for all of the files to be present (children.length >= 4), but we don't wait for the content to be ready. Instead, we assume that as soon as the file exists, it's safe to read the content, which may not be the case (if the file has been created but not yet written to). I can create another JIRA to fix this if you agree.

@SparkQA
Copy link

SparkQA commented Jun 24, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44813/

@SparkQA
Copy link

SparkQA commented Jun 24, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44813/

@SparkQA
Copy link

SparkQA commented Jun 24, 2021

Test build #140282 has finished for PR 32810 at commit 7eac14a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tgravescs
Copy link
Contributor

looks like the tests passed in the normal QA build and different tests failed this run, so I'll merge

@asfgit asfgit closed this in 866df69 Jun 25, 2021
@tgravescs
Copy link
Contributor

merged to master, it wasn't a clean merge to branch-3.1, were you wanting to get it into that, if so could you put up separate pr?

@xkrogen xkrogen deleted the xkrogen-SPARK-35672-classpath-scalable branch June 25, 2021 15:37
xkrogen added a commit to xkrogen/spark that referenced this pull request Jun 25, 2021
…ing config instead of command line

Refactor the logic for constructing the user classpath from `yarn.ApplicationMaster` into `yarn.Client` so that it can be leveraged on the executor side as well, instead of having the driver construct it and pass it to the executor via command-line arguments. A new method, `getUserClassPath`, is added to `CoarseGrainedExecutorBackend` which defaults to `Nil` (consistent with the existing behavior where non-YARN resource managers do not configure the user classpath). `YarnCoarseGrainedExecutorBackend` overrides this to construct the user classpath from the existing `APP_JAR` and `SECONDARY_JARS` configs.

User-provided JARs are made available to executors using a custom classloader, so they do not appear on the standard Java classpath. Instead, they are passed as a list to the executor which then creates a classloader out of the URLs. Currently in the case of YARN, this list of JARs is crafted by the Driver (in `ExecutorRunnable`), which then passes the information to the executors (`CoarseGrainedExecutorBackend`) by specifying each JAR on the executor command line as `--user-class-path /path/to/myjar.jar`. This can cause extremely long argument lists when there are many JARs, which can cause the OS argument length to be exceeded, typically manifesting as the error message:

> /bin/bash: Argument list too long

A [Google search](https://www.google.com/search?q=spark%20%22%2Fbin%2Fbash%3A%20argument%20list%20too%20long%22&oq=spark%20%22%2Fbin%2Fbash%3A%20argument%20list%20too%20long%22) indicates that this is not a theoretical problem and afflicts real users, including ours. Passing this list using the configurations instead resolves this issue.

No, except for fixing the bug, allowing for larger JAR lists to be passed successfully. Configuration of JARs is identical to before.

New unit tests were added in `YarnClusterSuite`. Also, we have been running a similar fix internally for 4 months with great success.

Closes apache#32810 from xkrogen/xkrogen-SPARK-35672-classpath-scalable.

Authored-by: Erik Krogen <[email protected]>
Signed-off-by: Thomas Graves <[email protected]>

(cherry picked from commit 866df69)
@xkrogen
Copy link
Contributor Author

xkrogen commented Jun 25, 2021

Thank you @tgravescs ! Your comments along the way were much appreciated.

I put up a branch-3.1 backport at #33090.

HyukjinKwon pushed a commit that referenced this pull request Jun 28, 2021
…rs using config instead of command line

### What changes were proposed in this pull request?
Refactor the logic for constructing the user classpath from `yarn.ApplicationMaster` into `yarn.Client` so that it can be leveraged on the executor side as well, instead of having the driver construct it and pass it to the executor via command-line arguments. A new method, `getUserClassPath`, is added to `CoarseGrainedExecutorBackend` which defaults to `Nil` (consistent with the existing behavior where non-YARN resource managers do not configure the user classpath). `YarnCoarseGrainedExecutorBackend` overrides this to construct the user classpath from the existing `APP_JAR` and `SECONDARY_JARS` configs.

### Why are the changes needed?
User-provided JARs are made available to executors using a custom classloader, so they do not appear on the standard Java classpath. Instead, they are passed as a list to the executor which then creates a classloader out of the URLs. Currently in the case of YARN, this list of JARs is crafted by the Driver (in `ExecutorRunnable`), which then passes the information to the executors (`CoarseGrainedExecutorBackend`) by specifying each JAR on the executor command line as `--user-class-path /path/to/myjar.jar`. This can cause extremely long argument lists when there are many JARs, which can cause the OS argument length to be exceeded, typically manifesting as the error message:

> /bin/bash: Argument list too long

A [Google search](https://www.google.com/search?q=spark%20%22%2Fbin%2Fbash%3A%20argument%20list%20too%20long%22&oq=spark%20%22%2Fbin%2Fbash%3A%20argument%20list%20too%20long%22) indicates that this is not a theoretical problem and afflicts real users, including ours. Passing this list using the configurations instead resolves this issue.

### Does this PR introduce _any_ user-facing change?
No, except for fixing the bug, allowing for larger JAR lists to be passed successfully. Configuration of JARs is identical to before.

### How was this patch tested?
New unit tests were added in `YarnClusterSuite`. Also, we have been running a similar fix internally for 4 months with great success.

Note that this is a backport of #32810 with minor conflicts around imports.

Closes #33090 from xkrogen/xkrogen-SPARK-35672-classpath-scalable-branch-3.1.

Authored-by: Erik Krogen <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
flyrain pushed a commit to flyrain/spark that referenced this pull request Sep 21, 2021
…rs using config instead of command line

### What changes were proposed in this pull request?
Refactor the logic for constructing the user classpath from `yarn.ApplicationMaster` into `yarn.Client` so that it can be leveraged on the executor side as well, instead of having the driver construct it and pass it to the executor via command-line arguments. A new method, `getUserClassPath`, is added to `CoarseGrainedExecutorBackend` which defaults to `Nil` (consistent with the existing behavior where non-YARN resource managers do not configure the user classpath). `YarnCoarseGrainedExecutorBackend` overrides this to construct the user classpath from the existing `APP_JAR` and `SECONDARY_JARS` configs.

### Why are the changes needed?
User-provided JARs are made available to executors using a custom classloader, so they do not appear on the standard Java classpath. Instead, they are passed as a list to the executor which then creates a classloader out of the URLs. Currently in the case of YARN, this list of JARs is crafted by the Driver (in `ExecutorRunnable`), which then passes the information to the executors (`CoarseGrainedExecutorBackend`) by specifying each JAR on the executor command line as `--user-class-path /path/to/myjar.jar`. This can cause extremely long argument lists when there are many JARs, which can cause the OS argument length to be exceeded, typically manifesting as the error message:

> /bin/bash: Argument list too long

A [Google search](https://www.google.com/search?q=spark%20%22%2Fbin%2Fbash%3A%20argument%20list%20too%20long%22&oq=spark%20%22%2Fbin%2Fbash%3A%20argument%20list%20too%20long%22) indicates that this is not a theoretical problem and afflicts real users, including ours. Passing this list using the configurations instead resolves this issue.

### Does this PR introduce _any_ user-facing change?
No, except for fixing the bug, allowing for larger JAR lists to be passed successfully. Configuration of JARs is identical to before.

### How was this patch tested?
New unit tests were added in `YarnClusterSuite`. Also, we have been running a similar fix internally for 4 months with great success.

Note that this is a backport of apache#32810 with minor conflicts around imports.

Closes apache#33090 from xkrogen/xkrogen-SPARK-35672-classpath-scalable-branch-3.1.

Authored-by: Erik Krogen <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
(cherry picked from commit b4916d4)
xkrogen added a commit to xkrogen/spark that referenced this pull request Nov 10, 2021
…ing config instead of command line

### What changes were proposed in this pull request?
Refactor the logic for constructing the user classpath from `yarn.ApplicationMaster` into `yarn.Client` so that it can be leveraged on the executor side as well, instead of having the driver construct it and pass it to the executor via command-line arguments. A new method, `getUserClassPath`, is added to `CoarseGrainedExecutorBackend` which defaults to `Nil` (consistent with the existing behavior where non-YARN resource managers do not configure the user classpath). `YarnCoarseGrainedExecutorBackend` overrides this to construct the user classpath from the existing `APP_JAR` and `SECONDARY_JARS` configs.

### Why are the changes needed?
User-provided JARs are made available to executors using a custom classloader, so they do not appear on the standard Java classpath. Instead, they are passed as a list to the executor which then creates a classloader out of the URLs. Currently in the case of YARN, this list of JARs is crafted by the Driver (in `ExecutorRunnable`), which then passes the information to the executors (`CoarseGrainedExecutorBackend`) by specifying each JAR on the executor command line as `--user-class-path /path/to/myjar.jar`. This can cause extremely long argument lists when there are many JARs, which can cause the OS argument length to be exceeded, typically manifesting as the error message:

> /bin/bash: Argument list too long

A [Google search](https://www.google.com/search?q=spark%20%22%2Fbin%2Fbash%3A%20argument%20list%20too%20long%22&oq=spark%20%22%2Fbin%2Fbash%3A%20argument%20list%20too%20long%22) indicates that this is not a theoretical problem and afflicts real users, including ours. Passing this list using the configurations instead resolves this issue.

### Does this PR introduce _any_ user-facing change?
No, except for fixing the bug, allowing for larger JAR lists to be passed successfully. Configuration of JARs is identical to before.

### How was this patch tested?
New unit tests were added in `YarnClusterSuite`. Also, we have been running a similar fix internally for 4 months with great success.

Closes apache#32810 from xkrogen/xkrogen-SPARK-35672-classpath-scalable.

Authored-by: Erik Krogen <[email protected]>
Signed-off-by: Thomas Graves <[email protected]>
attilapiros pushed a commit that referenced this pull request Nov 16, 2021
…ing config instead of command line

### What changes were proposed in this pull request?
Refactor the logic for constructing the user classpath from `yarn.ApplicationMaster` into `yarn.Client` so that it can be leveraged on the executor side as well, instead of having the driver construct it and pass it to the executor via command-line arguments. A new method, `getUserClassPath`, is added to `CoarseGrainedExecutorBackend` which defaults to `Nil` (consistent with the existing behavior where non-YARN resource managers do not configure the user classpath). `YarnCoarseGrainedExecutorBackend` overrides this to construct the user classpath from the existing `APP_JAR` and `SECONDARY_JARS` configs. Within `yarn.Client`, environment variables in the configured paths are resolved before constructing the classpath.

Please note that this is a re-submission of #32810, which was reverted in #34082 due to the issues described in [this comment](https://issues.apache.org/jira/browse/SPARK-35672?focusedCommentId=17419285&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17419285). This PR additionally includes the changes described in #34084 to resolve the issue, though this PR has been enhanced to properly handle escape strings, unlike #34084.

### Why are the changes needed?
User-provided JARs are made available to executors using a custom classloader, so they do not appear on the standard Java classpath. Instead, they are passed as a list to the executor which then creates a classloader out of the URLs. Currently in the case of YARN, this list of JARs is crafted by the Driver (in `ExecutorRunnable`), which then passes the information to the executors (`CoarseGrainedExecutorBackend`) by specifying each JAR on the executor command line as `--user-class-path /path/to/myjar.jar`. This can cause extremely long argument lists when there are many JARs, which can cause the OS argument length to be exceeded, typically manifesting as the error message:

> /bin/bash: Argument list too long

A [Google search](https://www.google.com/search?q=spark%20%22%2Fbin%2Fbash%3A%20argument%20list%20too%20long%22&oq=spark%20%22%2Fbin%2Fbash%3A%20argument%20list%20too%20long%22) indicates that this is not a theoretical problem and afflicts real users, including ours. Passing this list using the configurations instead resolves this issue.

### Does this PR introduce _any_ user-facing change?
There is one small behavioral change which is a bug fix. Previously the `spark.yarn.config.gatewayPath` and `spark.yarn.config.replacementPath` options were only applied to executors, meaning they would not work for the driver when running in cluster mode. This appears to be a bug; the [documentation for this functionality](https://spark.apache.org/docs/latest/running-on-yarn.html) does not mention any limitations that this is only for executors. This PR fixes that issue.

Additionally, this fixes the main bash argument length issue, allowing for larger JAR lists to be passed successfully. Configuration of JARs is identical to before, and substitution of environment variables in `spark.jars` or `spark.yarn.config.replacementPath` works as expected.

### How was this patch tested?
New unit tests were added in `YarnClusterSuite`. Also, we have been running a similar fix internally for 4 months with great success.

Closes #34120 from xkrogen/xkrogen-SPARK-35672-yarn-classpath-list-take2.

Authored-by: Erik Krogen <[email protected]>
Signed-off-by: attilapiros <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants