[SPARK-19501][YARN] Reduce the number of HDFS RPCs during YARN deployment #16916

jongwook · 2017-02-13T19:40:09Z

What changes were proposed in this pull request?

As discussed in JIRA, this patch addresses the problem where too many HDFS RPCs are made when there are many URIs specified in spark.yarn.jars, potentially adding hundreds of RTTs to YARN before the application launches. This becomes significant when submitting the application to a non-local YARN cluster (where the RTT may be in order of 100ms, for example). For each URI specified, the current implementation makes at least two HDFS RPCs, for:

Calling getFileStatus() before uploading each file to the distributed cache in ClientDistributedCacheManager.addResource().
Resolving any symbolic links in each of the file URI, which repeatedly makes HDFS RPCs until the all symlinks are resolved. (see FileContext.resolve(Path), FSLinkResolver.resolve(FileContext, Path), and AbstractFileSystem.resolvePath().)

The first getFileStatus RPC can be removed, using statCache populated with the file statuses retrieved with the previous globStatus call.

The second one can be largely reduced by caching the symlink resolution results in a mutable.HashMap. This patch adds a local variable in yarn.Client.prepareLocalResources() and passes it as an additional parameter to yarn.Client.copyFileToRemote. The symlink resolution code was added in 2013 and has not changed since. I am assuming that this is still required, but otherwise we can remove using symlinkCache and symlink resolution altogether.

How was this patch tested?

This patch is based off 8e8afb3, currently the latest YARN patch on master. All tests except a few in spark-hive passed with ./dev/run-tests on my machine, using JDK 1.8.0_112 on macOS 10.12.3; also tested myself with this modified version of SPARK 2.2.0-SNAPSHOT which performed a normal deployment and execution on a YARN cluster without errors.

…ment

vanzin · 2017-02-13T20:11:52Z

ok to test

SparkQA · 2017-02-13T20:41:50Z

Test build #72828 has finished for PR 16916 at commit dac61b5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…during YARN deployment

SparkQA · 2017-02-13T22:21:20Z

Test build #72832 has finished for PR 16916 at commit 523c58c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2017-02-13T22:56:57Z

Is it possible to merge the stat cache and the symlink cache? It seems both are sort of doing the same thing.

jongwook · 2017-02-13T23:15:01Z

@vanzin statCache stores the result of getFileStatus or globStatus calls to HDFS, which do not resolve symlinks, so unless we give up symlink resolution, I don't think that merging them is sufficient.

vanzin · 2017-02-13T23:18:17Z

I see. Makes sense. Pinging @tgravescs too, he wrote the original stuff that introduced symlink resolution, maybe he can comment on whether it's really necessary.

If it is, then the patch looks good to me, otherwise we could get rid of it.

tgravescs · 2017-02-13T23:50:16Z

we should not remove symlink resolution.

jongwook · 2017-02-14T09:40:53Z

Let me know if any further modification is needed before committing this to master. Once it is merged, I will prepare the backports to branch-2.0 and branch-2.1. Thanks.

vanzin · 2017-02-14T19:33:05Z

Merging to master.

…ment ## What changes were proposed in this pull request? As discussed in [JIRA](https://issues.apache.org/jira/browse/SPARK-19501), this patch addresses the problem where too many HDFS RPCs are made when there are many URIs specified in `spark.yarn.jars`, potentially adding hundreds of RTTs to YARN before the application launches. This becomes significant when submitting the application to a non-local YARN cluster (where the RTT may be in order of 100ms, for example). For each URI specified, the current implementation makes at least two HDFS RPCs, for: - [Calling `getFileStatus()` before uploading each file to the distributed cache in `ClientDistributedCacheManager.addResource()`](https://github.com/apache/spark/blob/v2.1.0/yarn/src/main/scala/org/apache/spark/deploy/yarn/ClientDistributedCacheManager.scala#L71). - [Resolving any symbolic links in each of the file URI](https://github.com/apache/spark/blob/v2.1.0/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L377-L379), which repeatedly makes HDFS RPCs until the all symlinks are resolved. (see [`FileContext.resolve(Path)`](https://github.com/apache/hadoop/blob/release-2.7.1/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileContext.java#L2189-L2195), [`FSLinkResolver.resolve(FileContext, Path)`](https://github.com/apache/hadoop/blob/release-2.7.1/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FSLinkResolver.java#L79-L112), and [`AbstractFileSystem.resolvePath()`](https://github.com/apache/hadoop/blob/release-2.7.1/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/AbstractFileSystem.java#L464-L468).) The first `getFileStatus` RPC can be removed, using `statCache` populated with the file statuses retrieved with [the previous `globStatus` call](https://github.com/apache/spark/blob/v2.1.0/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L531). The second one can be largely reduced by caching the symlink resolution results in a mutable.HashMap. This patch adds a local variable in `yarn.Client.prepareLocalResources()` and passes it as an additional parameter to `yarn.Client.copyFileToRemote`. [The symlink resolution code was added in 2013](a35472e#diff-b050df3f55b82065803d6e83453b9706R187) and has not changed since. I am assuming that this is still required, but otherwise we can remove using `symlinkCache` and symlink resolution altogether. ## How was this patch tested? This patch is based off 8e8afb3, currently the latest YARN patch on master. All tests except a few in spark-hive passed with `./dev/run-tests` on my machine, using JDK 1.8.0_112 on macOS 10.12.3; also tested myself with this modified version of SPARK 2.2.0-SNAPSHOT which performed a normal deployment and execution on a YARN cluster without errors. Author: Jong Wook Kim <[email protected]> Closes #16916 from jongwook/SPARK-19501. (cherry picked from commit ab9872d) Signed-off-by: Marcelo Vanzin <[email protected]>

vanzin · 2017-02-14T19:56:52Z

Merged to branch-2.1 and branch-2.0 too (after running yarn tests).

jongwook · 2017-02-14T20:05:01Z

Cool! Thank you :)

…ment ## What changes were proposed in this pull request? As discussed in [JIRA](https://issues.apache.org/jira/browse/SPARK-19501), this patch addresses the problem where too many HDFS RPCs are made when there are many URIs specified in `spark.yarn.jars`, potentially adding hundreds of RTTs to YARN before the application launches. This becomes significant when submitting the application to a non-local YARN cluster (where the RTT may be in order of 100ms, for example). For each URI specified, the current implementation makes at least two HDFS RPCs, for: - [Calling `getFileStatus()` before uploading each file to the distributed cache in `ClientDistributedCacheManager.addResource()`](https://github.com/apache/spark/blob/v2.1.0/yarn/src/main/scala/org/apache/spark/deploy/yarn/ClientDistributedCacheManager.scala#L71). - [Resolving any symbolic links in each of the file URI](https://github.com/apache/spark/blob/v2.1.0/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L377-L379), which repeatedly makes HDFS RPCs until the all symlinks are resolved. (see [`FileContext.resolve(Path)`](https://github.com/apache/hadoop/blob/release-2.7.1/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileContext.java#L2189-L2195), [`FSLinkResolver.resolve(FileContext, Path)`](https://github.com/apache/hadoop/blob/release-2.7.1/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FSLinkResolver.java#L79-L112), and [`AbstractFileSystem.resolvePath()`](https://github.com/apache/hadoop/blob/release-2.7.1/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/AbstractFileSystem.java#L464-L468).) The first `getFileStatus` RPC can be removed, using `statCache` populated with the file statuses retrieved with [the previous `globStatus` call](https://github.com/apache/spark/blob/v2.1.0/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L531). The second one can be largely reduced by caching the symlink resolution results in a mutable.HashMap. This patch adds a local variable in `yarn.Client.prepareLocalResources()` and passes it as an additional parameter to `yarn.Client.copyFileToRemote`. [The symlink resolution code was added in 2013](apache@a35472e#diff-b050df3f55b82065803d6e83453b9706R187) and has not changed since. I am assuming that this is still required, but otherwise we can remove using `symlinkCache` and symlink resolution altogether. ## How was this patch tested? This patch is based off 8e8afb3, currently the latest YARN patch on master. All tests except a few in spark-hive passed with `./dev/run-tests` on my machine, using JDK 1.8.0_112 on macOS 10.12.3; also tested myself with this modified version of SPARK 2.2.0-SNAPSHOT which performed a normal deployment and execution on a YARN cluster without errors. Author: Jong Wook Kim <[email protected]> Closes apache#16916 from jongwook/SPARK-19501.

[SPARK-19501][YARN] Reduce the number of HDFS RPCs during YARN deploy…

dac61b5

…ment

jongwook force-pushed the SPARK-19501 branch from 984d818 to dac61b5 Compare February 13, 2017 19:41

jongwook changed the title ~~[SPARK-19501] Reduce the number of HDFS RPCs during YARN deployment~~ [SPARK-19501][YARN] Reduce the number of HDFS RPCs during YARN deployment Feb 13, 2017

[SPARK-19501][YARN] minor style fix - Reduce the number of HDFS RPCs …

523c58c

…during YARN deployment

asfgit closed this in ab9872d Feb 14, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-19501][YARN] Reduce the number of HDFS RPCs during YARN deployment #16916

[SPARK-19501][YARN] Reduce the number of HDFS RPCs during YARN deployment #16916

Uh oh!

jongwook commented Feb 13, 2017

Uh oh!

vanzin commented Feb 13, 2017

Uh oh!

SparkQA commented Feb 13, 2017

Uh oh!

SparkQA commented Feb 13, 2017

Uh oh!

vanzin commented Feb 13, 2017

Uh oh!

jongwook commented Feb 13, 2017

Uh oh!

vanzin commented Feb 13, 2017

Uh oh!

tgravescs commented Feb 13, 2017

Uh oh!

jongwook commented Feb 14, 2017

Uh oh!

vanzin commented Feb 14, 2017

Uh oh!

vanzin commented Feb 14, 2017

Uh oh!

jongwook commented Feb 14, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-19501][YARN] Reduce the number of HDFS RPCs during YARN deployment #16916

[SPARK-19501][YARN] Reduce the number of HDFS RPCs during YARN deployment #16916

Uh oh!

Conversation

jongwook commented Feb 13, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

vanzin commented Feb 13, 2017

Uh oh!

SparkQA commented Feb 13, 2017

Uh oh!

SparkQA commented Feb 13, 2017

Uh oh!

vanzin commented Feb 13, 2017

Uh oh!

jongwook commented Feb 13, 2017

Uh oh!

vanzin commented Feb 13, 2017

Uh oh!

tgravescs commented Feb 13, 2017

Uh oh!

jongwook commented Feb 14, 2017

Uh oh!

vanzin commented Feb 14, 2017

Uh oh!

vanzin commented Feb 14, 2017

Uh oh!

jongwook commented Feb 14, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants