[SPARK-23551][BUILD] Exclude `hadoop-mapreduce-client-core` dependency from `orc-mapreduce` #20704

dongjoon-hyun · 2018-03-01T18:01:37Z

What changes were proposed in this pull request?

This PR aims to prevent orc-mapreduce dependency from making IDEs and maven confused.

BEFORE
Please note that 2.6.4 at Spark Project SQL.

$ mvn dependency:tree -Phadoop-2.7 -Dincludes=org.apache.hadoop:hadoop-mapreduce-client-core
...
[INFO] ------------------------------------------------------------------------
[INFO] Building Spark Project Catalyst 2.4.0-SNAPSHOT
[INFO] ------------------------------------------------------------------------
[INFO]
[INFO] --- maven-dependency-plugin:3.0.2:tree (default-cli) @ spark-catalyst_2.11 ---
[INFO] org.apache.spark:spark-catalyst_2.11:jar:2.4.0-SNAPSHOT
[INFO] \- org.apache.spark:spark-core_2.11:jar:2.4.0-SNAPSHOT:compile
[INFO]    \- org.apache.hadoop:hadoop-client:jar:2.7.3:compile
[INFO]       \- org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.7.3:compile
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] Building Spark Project SQL 2.4.0-SNAPSHOT
[INFO] ------------------------------------------------------------------------
[INFO]
[INFO] --- maven-dependency-plugin:3.0.2:tree (default-cli) @ spark-sql_2.11 ---
[INFO] org.apache.spark:spark-sql_2.11:jar:2.4.0-SNAPSHOT
[INFO] \- org.apache.orc:orc-mapreduce:jar:nohive:1.4.3:compile
[INFO]    \- org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.6.4:compile

AFTER

$ mvn dependency:tree -Phadoop-2.7 -Dincludes=org.apache.hadoop:hadoop-mapreduce-client-core
...
[INFO] ------------------------------------------------------------------------
[INFO] Building Spark Project Catalyst 2.4.0-SNAPSHOT
[INFO] ------------------------------------------------------------------------
[INFO]
[INFO] --- maven-dependency-plugin:3.0.2:tree (default-cli) @ spark-catalyst_2.11 ---
[INFO] org.apache.spark:spark-catalyst_2.11:jar:2.4.0-SNAPSHOT
[INFO] \- org.apache.spark:spark-core_2.11:jar:2.4.0-SNAPSHOT:compile
[INFO]    \- org.apache.hadoop:hadoop-client:jar:2.7.3:compile
[INFO]       \- org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.7.3:compile
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] Building Spark Project SQL 2.4.0-SNAPSHOT
[INFO] ------------------------------------------------------------------------
[INFO]
[INFO] --- maven-dependency-plugin:3.0.2:tree (default-cli) @ spark-sql_2.11 ---
[INFO] org.apache.spark:spark-sql_2.11:jar:2.4.0-SNAPSHOT
[INFO] \- org.apache.spark:spark-core_2.11:jar:2.4.0-SNAPSHOT:compile
[INFO]    \- org.apache.hadoop:hadoop-client:jar:2.7.3:compile
[INFO]       \- org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.7.3:compile

How was this patch tested?

Pass the Jenkins with dev/test-dependencies.sh with the existing dependencies.
Manually do the following and see the change.

mvn dependency:tree -Phadoop-2.7 -Dincludes=org.apache.hadoop:hadoop-mapreduce-client-core

…y from `orc-mapreduce`

SparkQA · 2018-03-01T21:20:14Z

Test build #87848 has finished for PR 20704 at commit dbb5ae5.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2018-03-01T21:23:36Z

The failure is due to a flaky test case.

 org.apache.spark.sql.execution.streaming.RateSourceV2Suite.basic microbatch execution

dongjoon-hyun · 2018-03-01T21:23:43Z

Retest this please.

jerryshao · 2018-03-02T00:21:38Z

LGTM.

dongjoon-hyun · 2018-03-02T00:35:28Z

Thank you for review, @jerryshao !

SparkQA · 2018-03-02T01:19:25Z

Test build #87854 has finished for PR 20704 at commit dbb5ae5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2018-03-02T01:25:16Z

Hmm, I guess it was just luck that this didn't trigger the deps check, since that jar is checked for a specific version (2.7.3 in the case of hadoop2.7).

LGTM, merging to master / 2.3.

…y from `orc-mapreduce` ## What changes were proposed in this pull request? This PR aims to prevent `orc-mapreduce` dependency from making IDEs and maven confused. **BEFORE** Please note that `2.6.4` at `Spark Project SQL`. ``` $ mvn dependency:tree -Phadoop-2.7 -Dincludes=org.apache.hadoop:hadoop-mapreduce-client-core ... [INFO] ------------------------------------------------------------------------ [INFO] Building Spark Project Catalyst 2.4.0-SNAPSHOT [INFO] ------------------------------------------------------------------------ [INFO] [INFO] --- maven-dependency-plugin:3.0.2:tree (default-cli) spark-catalyst_2.11 --- [INFO] org.apache.spark:spark-catalyst_2.11:jar:2.4.0-SNAPSHOT [INFO] \- org.apache.spark:spark-core_2.11:jar:2.4.0-SNAPSHOT:compile [INFO] \- org.apache.hadoop:hadoop-client:jar:2.7.3:compile [INFO] \- org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.7.3:compile [INFO] [INFO] ------------------------------------------------------------------------ [INFO] Building Spark Project SQL 2.4.0-SNAPSHOT [INFO] ------------------------------------------------------------------------ [INFO] [INFO] --- maven-dependency-plugin:3.0.2:tree (default-cli) spark-sql_2.11 --- [INFO] org.apache.spark:spark-sql_2.11:jar:2.4.0-SNAPSHOT [INFO] \- org.apache.orc:orc-mapreduce:jar:nohive:1.4.3:compile [INFO] \- org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.6.4:compile ``` **AFTER** ``` $ mvn dependency:tree -Phadoop-2.7 -Dincludes=org.apache.hadoop:hadoop-mapreduce-client-core ... [INFO] ------------------------------------------------------------------------ [INFO] Building Spark Project Catalyst 2.4.0-SNAPSHOT [INFO] ------------------------------------------------------------------------ [INFO] [INFO] --- maven-dependency-plugin:3.0.2:tree (default-cli) spark-catalyst_2.11 --- [INFO] org.apache.spark:spark-catalyst_2.11:jar:2.4.0-SNAPSHOT [INFO] \- org.apache.spark:spark-core_2.11:jar:2.4.0-SNAPSHOT:compile [INFO] \- org.apache.hadoop:hadoop-client:jar:2.7.3:compile [INFO] \- org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.7.3:compile [INFO] [INFO] ------------------------------------------------------------------------ [INFO] Building Spark Project SQL 2.4.0-SNAPSHOT [INFO] ------------------------------------------------------------------------ [INFO] [INFO] --- maven-dependency-plugin:3.0.2:tree (default-cli) spark-sql_2.11 --- [INFO] org.apache.spark:spark-sql_2.11:jar:2.4.0-SNAPSHOT [INFO] \- org.apache.spark:spark-core_2.11:jar:2.4.0-SNAPSHOT:compile [INFO] \- org.apache.hadoop:hadoop-client:jar:2.7.3:compile [INFO] \- org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.7.3:compile ``` ## How was this patch tested? 1. Pass the Jenkins with `dev/test-dependencies.sh` with the existing dependencies. 2. Manually do the following and see the change. ``` mvn dependency:tree -Phadoop-2.7 -Dincludes=org.apache.hadoop:hadoop-mapreduce-client-core ``` Author: Dongjoon Hyun <[email protected]> Closes #20704 from dongjoon-hyun/SPARK-23551. (cherry picked from commit 34811e0) Signed-off-by: Marcelo Vanzin <[email protected]>

dongjoon-hyun · 2018-03-02T01:41:42Z

Thank you for review and merging, @vanzin .

We generated both spark-deps-hadoop-2.6 and spark-deps-hadoop-2.7 with the following.

./dev/test-dependencies.sh --replace-manifest

sbt and maven choose the latest artifacts during the full build. So, this issue doesn't affect Apache Spark distribution.

vanzin · 2018-03-02T01:45:22Z

Yeah, I'm just wondering why that didn't happen in the dependency:tree output in your description. Anyway, not really important to figure that out.

steveloughran · 2018-03-02T15:11:55Z

kicks in downstream depending on the order of imports; maven is closest-first in the graph. If you explicitly add hadoop-client in your deps at the top then everything gets reconciled consistently

megaserg · 2018-04-14T02:50:53Z

Thank you @dongjoon-hyun! This was also affecting our Spark job performance!

We're using mapreduce.fileoutputcommitter.algorithm.version=2 in our Spark job config, as recommended e.g. here: http://spark.apache.org/docs/latest/cloud-integration.html. We're using user-provided Hadoop 2.9.0.

However, since this 2.6.5 JAR was in spark/jars, it was given priority in the classpath over Hadoop-distributed 2.9.0 JAR. The 2.6.5 was silently ignoring the mapreduce.fileoutputcommitter.algorithm.version setting and used the default, slow algorithm (I believe hadoop-mapreduce-client-core only had one, slow, algorithm until 2.7.0).

I believe this affects everyone who uses any mapreduce settings with Spark 2.3.0. Great job!

Can we double-check that this JAR is not present in the "without-hadoop" Spark distribution anymore?

steveloughran · 2018-04-16T09:25:49Z

@megaserg : if you are writing to GCS, Azure, algorithm 2 is fine. If S3 is the target, then it's only safe to use with a consistent store (Hadoop 3.0 +S3Guard, Amazon Consistent EMR); you still take a major perf hit from that copy. The S3A committers in Hadoop 3.1 deliver that high performance commit semantics, and Netflix committers don't (directly) need a consistent store —though to chain together work you will.

BTW, how to verify that the v2 algorithm version is being opted for? : set the version = 3 and expect a stack trace from the version switch code. It's what I do to make sure that the FileOutputCommitter isn't actually being picked up.

…y from `orc-mapreduce` ## What changes were proposed in this pull request? This PR aims to prevent `orc-mapreduce` dependency from making IDEs and maven confused. **BEFORE** Please note that `2.6.4` at `Spark Project SQL`. ``` $ mvn dependency:tree -Phadoop-2.7 -Dincludes=org.apache.hadoop:hadoop-mapreduce-client-core ... [INFO] ------------------------------------------------------------------------ [INFO] Building Spark Project Catalyst 2.4.0-SNAPSHOT [INFO] ------------------------------------------------------------------------ [INFO] [INFO] --- maven-dependency-plugin:3.0.2:tree (default-cli) spark-catalyst_2.11 --- [INFO] org.apache.spark:spark-catalyst_2.11:jar:2.4.0-SNAPSHOT [INFO] \- org.apache.spark:spark-core_2.11:jar:2.4.0-SNAPSHOT:compile [INFO] \- org.apache.hadoop:hadoop-client:jar:2.7.3:compile [INFO] \- org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.7.3:compile [INFO] [INFO] ------------------------------------------------------------------------ [INFO] Building Spark Project SQL 2.4.0-SNAPSHOT [INFO] ------------------------------------------------------------------------ [INFO] [INFO] --- maven-dependency-plugin:3.0.2:tree (default-cli) spark-sql_2.11 --- [INFO] org.apache.spark:spark-sql_2.11:jar:2.4.0-SNAPSHOT [INFO] \- org.apache.orc:orc-mapreduce:jar:nohive:1.4.3:compile [INFO] \- org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.6.4:compile ``` **AFTER** ``` $ mvn dependency:tree -Phadoop-2.7 -Dincludes=org.apache.hadoop:hadoop-mapreduce-client-core ... [INFO] ------------------------------------------------------------------------ [INFO] Building Spark Project Catalyst 2.4.0-SNAPSHOT [INFO] ------------------------------------------------------------------------ [INFO] [INFO] --- maven-dependency-plugin:3.0.2:tree (default-cli) spark-catalyst_2.11 --- [INFO] org.apache.spark:spark-catalyst_2.11:jar:2.4.0-SNAPSHOT [INFO] \- org.apache.spark:spark-core_2.11:jar:2.4.0-SNAPSHOT:compile [INFO] \- org.apache.hadoop:hadoop-client:jar:2.7.3:compile [INFO] \- org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.7.3:compile [INFO] [INFO] ------------------------------------------------------------------------ [INFO] Building Spark Project SQL 2.4.0-SNAPSHOT [INFO] ------------------------------------------------------------------------ [INFO] [INFO] --- maven-dependency-plugin:3.0.2:tree (default-cli) spark-sql_2.11 --- [INFO] org.apache.spark:spark-sql_2.11:jar:2.4.0-SNAPSHOT [INFO] \- org.apache.spark:spark-core_2.11:jar:2.4.0-SNAPSHOT:compile [INFO] \- org.apache.hadoop:hadoop-client:jar:2.7.3:compile [INFO] \- org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.7.3:compile ``` ## How was this patch tested? 1. Pass the Jenkins with `dev/test-dependencies.sh` with the existing dependencies. 2. Manually do the following and see the change. ``` mvn dependency:tree -Phadoop-2.7 -Dincludes=org.apache.hadoop:hadoop-mapreduce-client-core ``` Author: Dongjoon Hyun <[email protected]> Closes apache#20704 from dongjoon-hyun/SPARK-23551. (cherry picked from commit 34811e0) Signed-off-by: Marcelo Vanzin <[email protected]>

[SPARK-23551][BUILD] Exclude hadoop-mapreduce-client-core dependenc…

dbb5ae5

…y from `orc-mapreduce`

asfgit closed this in 34811e0 Mar 2, 2018

dongjoon-hyun deleted the SPARK-23551 branch March 2, 2018 01:41

[SPARK-23551][BUILD] Exclude hadoop-mapreduce-client-core dependency from orc-mapreduce #20704

[SPARK-23551][BUILD] Exclude hadoop-mapreduce-client-core dependency from orc-mapreduce #20704

Uh oh!

Conversation

dongjoon-hyun commented Mar 1, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Mar 1, 2018

Uh oh!

dongjoon-hyun commented Mar 1, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun commented Mar 1, 2018

Uh oh!

jerryshao commented Mar 2, 2018

Uh oh!

dongjoon-hyun commented Mar 2, 2018

Uh oh!

SparkQA commented Mar 2, 2018

Uh oh!

vanzin commented Mar 2, 2018

Uh oh!

dongjoon-hyun commented Mar 2, 2018

Uh oh!

vanzin commented Mar 2, 2018

Uh oh!

steveloughran commented Mar 2, 2018

Uh oh!

megaserg commented Apr 14, 2018

Uh oh!

steveloughran commented Apr 16, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

[SPARK-23551][BUILD] Exclude `hadoop-mapreduce-client-core` dependency from `orc-mapreduce` #20704

[SPARK-23551][BUILD] Exclude `hadoop-mapreduce-client-core` dependency from `orc-mapreduce` #20704

dongjoon-hyun commented Mar 1, 2018 •

edited

Loading

dongjoon-hyun commented Mar 1, 2018 •

edited

Loading