Skip to content

Conversation

@dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented Mar 1, 2018

What changes were proposed in this pull request?

This PR aims to prevent orc-mapreduce dependency from making IDEs and maven confused.

BEFORE
Please note that 2.6.4 at Spark Project SQL.

$ mvn dependency:tree -Phadoop-2.7 -Dincludes=org.apache.hadoop:hadoop-mapreduce-client-core
...
[INFO] ------------------------------------------------------------------------
[INFO] Building Spark Project Catalyst 2.4.0-SNAPSHOT
[INFO] ------------------------------------------------------------------------
[INFO]
[INFO] --- maven-dependency-plugin:3.0.2:tree (default-cli) @ spark-catalyst_2.11 ---
[INFO] org.apache.spark:spark-catalyst_2.11:jar:2.4.0-SNAPSHOT
[INFO] \- org.apache.spark:spark-core_2.11:jar:2.4.0-SNAPSHOT:compile
[INFO]    \- org.apache.hadoop:hadoop-client:jar:2.7.3:compile
[INFO]       \- org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.7.3:compile
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] Building Spark Project SQL 2.4.0-SNAPSHOT
[INFO] ------------------------------------------------------------------------
[INFO]
[INFO] --- maven-dependency-plugin:3.0.2:tree (default-cli) @ spark-sql_2.11 ---
[INFO] org.apache.spark:spark-sql_2.11:jar:2.4.0-SNAPSHOT
[INFO] \- org.apache.orc:orc-mapreduce:jar:nohive:1.4.3:compile
[INFO]    \- org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.6.4:compile

AFTER

$ mvn dependency:tree -Phadoop-2.7 -Dincludes=org.apache.hadoop:hadoop-mapreduce-client-core
...
[INFO] ------------------------------------------------------------------------
[INFO] Building Spark Project Catalyst 2.4.0-SNAPSHOT
[INFO] ------------------------------------------------------------------------
[INFO]
[INFO] --- maven-dependency-plugin:3.0.2:tree (default-cli) @ spark-catalyst_2.11 ---
[INFO] org.apache.spark:spark-catalyst_2.11:jar:2.4.0-SNAPSHOT
[INFO] \- org.apache.spark:spark-core_2.11:jar:2.4.0-SNAPSHOT:compile
[INFO]    \- org.apache.hadoop:hadoop-client:jar:2.7.3:compile
[INFO]       \- org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.7.3:compile
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] Building Spark Project SQL 2.4.0-SNAPSHOT
[INFO] ------------------------------------------------------------------------
[INFO]
[INFO] --- maven-dependency-plugin:3.0.2:tree (default-cli) @ spark-sql_2.11 ---
[INFO] org.apache.spark:spark-sql_2.11:jar:2.4.0-SNAPSHOT
[INFO] \- org.apache.spark:spark-core_2.11:jar:2.4.0-SNAPSHOT:compile
[INFO]    \- org.apache.hadoop:hadoop-client:jar:2.7.3:compile
[INFO]       \- org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.7.3:compile

How was this patch tested?

  1. Pass the Jenkins with dev/test-dependencies.sh with the existing dependencies.
  2. Manually do the following and see the change.
mvn dependency:tree -Phadoop-2.7 -Dincludes=org.apache.hadoop:hadoop-mapreduce-client-core

@SparkQA
Copy link

SparkQA commented Mar 1, 2018

Test build #87848 has finished for PR 20704 at commit dbb5ae5.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Mar 1, 2018

The failure is due to a flaky test case.

 org.apache.spark.sql.execution.streaming.RateSourceV2Suite.basic microbatch execution

@dongjoon-hyun
Copy link
Member Author

Retest this please.

@jerryshao
Copy link
Contributor

LGTM.

@dongjoon-hyun
Copy link
Member Author

Thank you for review, @jerryshao !

@SparkQA
Copy link

SparkQA commented Mar 2, 2018

Test build #87854 has finished for PR 20704 at commit dbb5ae5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor

vanzin commented Mar 2, 2018

Hmm, I guess it was just luck that this didn't trigger the deps check, since that jar is checked for a specific version (2.7.3 in the case of hadoop2.7).

LGTM, merging to master / 2.3.

asfgit pushed a commit that referenced this pull request Mar 2, 2018
…y from `orc-mapreduce`

## What changes were proposed in this pull request?

This PR aims to prevent `orc-mapreduce` dependency from making IDEs and maven confused.

**BEFORE**
Please note that `2.6.4` at `Spark Project SQL`.
```
$ mvn dependency:tree -Phadoop-2.7 -Dincludes=org.apache.hadoop:hadoop-mapreduce-client-core
...
[INFO] ------------------------------------------------------------------------
[INFO] Building Spark Project Catalyst 2.4.0-SNAPSHOT
[INFO] ------------------------------------------------------------------------
[INFO]
[INFO] --- maven-dependency-plugin:3.0.2:tree (default-cli)  spark-catalyst_2.11 ---
[INFO] org.apache.spark:spark-catalyst_2.11:jar:2.4.0-SNAPSHOT
[INFO] \- org.apache.spark:spark-core_2.11:jar:2.4.0-SNAPSHOT:compile
[INFO]    \- org.apache.hadoop:hadoop-client:jar:2.7.3:compile
[INFO]       \- org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.7.3:compile
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] Building Spark Project SQL 2.4.0-SNAPSHOT
[INFO] ------------------------------------------------------------------------
[INFO]
[INFO] --- maven-dependency-plugin:3.0.2:tree (default-cli)  spark-sql_2.11 ---
[INFO] org.apache.spark:spark-sql_2.11:jar:2.4.0-SNAPSHOT
[INFO] \- org.apache.orc:orc-mapreduce:jar:nohive:1.4.3:compile
[INFO]    \- org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.6.4:compile
```

**AFTER**
```
$ mvn dependency:tree -Phadoop-2.7 -Dincludes=org.apache.hadoop:hadoop-mapreduce-client-core
...
[INFO] ------------------------------------------------------------------------
[INFO] Building Spark Project Catalyst 2.4.0-SNAPSHOT
[INFO] ------------------------------------------------------------------------
[INFO]
[INFO] --- maven-dependency-plugin:3.0.2:tree (default-cli)  spark-catalyst_2.11 ---
[INFO] org.apache.spark:spark-catalyst_2.11:jar:2.4.0-SNAPSHOT
[INFO] \- org.apache.spark:spark-core_2.11:jar:2.4.0-SNAPSHOT:compile
[INFO]    \- org.apache.hadoop:hadoop-client:jar:2.7.3:compile
[INFO]       \- org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.7.3:compile
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] Building Spark Project SQL 2.4.0-SNAPSHOT
[INFO] ------------------------------------------------------------------------
[INFO]
[INFO] --- maven-dependency-plugin:3.0.2:tree (default-cli)  spark-sql_2.11 ---
[INFO] org.apache.spark:spark-sql_2.11:jar:2.4.0-SNAPSHOT
[INFO] \- org.apache.spark:spark-core_2.11:jar:2.4.0-SNAPSHOT:compile
[INFO]    \- org.apache.hadoop:hadoop-client:jar:2.7.3:compile
[INFO]       \- org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.7.3:compile
```

## How was this patch tested?

1. Pass the Jenkins with `dev/test-dependencies.sh` with the existing dependencies.
2. Manually do the following and see the change.
```
mvn dependency:tree -Phadoop-2.7 -Dincludes=org.apache.hadoop:hadoop-mapreduce-client-core
```

Author: Dongjoon Hyun <[email protected]>

Closes #20704 from dongjoon-hyun/SPARK-23551.

(cherry picked from commit 34811e0)
Signed-off-by: Marcelo Vanzin <[email protected]>
@asfgit asfgit closed this in 34811e0 Mar 2, 2018
@dongjoon-hyun
Copy link
Member Author

Thank you for review and merging, @vanzin .

We generated both spark-deps-hadoop-2.6 and spark-deps-hadoop-2.7 with the following.

./dev/test-dependencies.sh --replace-manifest

sbt and maven choose the latest artifacts during the full build. So, this issue doesn't affect Apache Spark distribution.

@dongjoon-hyun dongjoon-hyun deleted the SPARK-23551 branch March 2, 2018 01:41
@vanzin
Copy link
Contributor

vanzin commented Mar 2, 2018

Yeah, I'm just wondering why that didn't happen in the dependency:tree output in your description. Anyway, not really important to figure that out.

@steveloughran
Copy link
Contributor

kicks in downstream depending on the order of imports; maven is closest-first in the graph. If you explicitly add hadoop-client in your deps at the top then everything gets reconciled consistently

@megaserg
Copy link
Contributor

Thank you @dongjoon-hyun! This was also affecting our Spark job performance!

We're using mapreduce.fileoutputcommitter.algorithm.version=2 in our Spark job config, as recommended e.g. here: http://spark.apache.org/docs/latest/cloud-integration.html. We're using user-provided Hadoop 2.9.0.

However, since this 2.6.5 JAR was in spark/jars, it was given priority in the classpath over Hadoop-distributed 2.9.0 JAR. The 2.6.5 was silently ignoring the mapreduce.fileoutputcommitter.algorithm.version setting and used the default, slow algorithm (I believe hadoop-mapreduce-client-core only had one, slow, algorithm until 2.7.0).

I believe this affects everyone who uses any mapreduce settings with Spark 2.3.0. Great job!

Can we double-check that this JAR is not present in the "without-hadoop" Spark distribution anymore?

@steveloughran
Copy link
Contributor

@megaserg : if you are writing to GCS, Azure, algorithm 2 is fine. If S3 is the target, then it's only safe to use with a consistent store (Hadoop 3.0 +S3Guard, Amazon Consistent EMR); you still take a major perf hit from that copy. The S3A committers in Hadoop 3.1 deliver that high performance commit semantics, and Netflix committers don't (directly) need a consistent store —though to chain together work you will.

BTW, how to verify that the v2 algorithm version is being opted for? : set the version = 3 and expect a stack trace from the version switch code. It's what I do to make sure that the FileOutputCommitter isn't actually being picked up.

peter-toth pushed a commit to peter-toth/spark that referenced this pull request Oct 6, 2018
…y from `orc-mapreduce`

## What changes were proposed in this pull request?

This PR aims to prevent `orc-mapreduce` dependency from making IDEs and maven confused.

**BEFORE**
Please note that `2.6.4` at `Spark Project SQL`.
```
$ mvn dependency:tree -Phadoop-2.7 -Dincludes=org.apache.hadoop:hadoop-mapreduce-client-core
...
[INFO] ------------------------------------------------------------------------
[INFO] Building Spark Project Catalyst 2.4.0-SNAPSHOT
[INFO] ------------------------------------------------------------------------
[INFO]
[INFO] --- maven-dependency-plugin:3.0.2:tree (default-cli)  spark-catalyst_2.11 ---
[INFO] org.apache.spark:spark-catalyst_2.11:jar:2.4.0-SNAPSHOT
[INFO] \- org.apache.spark:spark-core_2.11:jar:2.4.0-SNAPSHOT:compile
[INFO]    \- org.apache.hadoop:hadoop-client:jar:2.7.3:compile
[INFO]       \- org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.7.3:compile
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] Building Spark Project SQL 2.4.0-SNAPSHOT
[INFO] ------------------------------------------------------------------------
[INFO]
[INFO] --- maven-dependency-plugin:3.0.2:tree (default-cli)  spark-sql_2.11 ---
[INFO] org.apache.spark:spark-sql_2.11:jar:2.4.0-SNAPSHOT
[INFO] \- org.apache.orc:orc-mapreduce:jar:nohive:1.4.3:compile
[INFO]    \- org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.6.4:compile
```

**AFTER**
```
$ mvn dependency:tree -Phadoop-2.7 -Dincludes=org.apache.hadoop:hadoop-mapreduce-client-core
...
[INFO] ------------------------------------------------------------------------
[INFO] Building Spark Project Catalyst 2.4.0-SNAPSHOT
[INFO] ------------------------------------------------------------------------
[INFO]
[INFO] --- maven-dependency-plugin:3.0.2:tree (default-cli)  spark-catalyst_2.11 ---
[INFO] org.apache.spark:spark-catalyst_2.11:jar:2.4.0-SNAPSHOT
[INFO] \- org.apache.spark:spark-core_2.11:jar:2.4.0-SNAPSHOT:compile
[INFO]    \- org.apache.hadoop:hadoop-client:jar:2.7.3:compile
[INFO]       \- org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.7.3:compile
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] Building Spark Project SQL 2.4.0-SNAPSHOT
[INFO] ------------------------------------------------------------------------
[INFO]
[INFO] --- maven-dependency-plugin:3.0.2:tree (default-cli)  spark-sql_2.11 ---
[INFO] org.apache.spark:spark-sql_2.11:jar:2.4.0-SNAPSHOT
[INFO] \- org.apache.spark:spark-core_2.11:jar:2.4.0-SNAPSHOT:compile
[INFO]    \- org.apache.hadoop:hadoop-client:jar:2.7.3:compile
[INFO]       \- org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.7.3:compile
```

## How was this patch tested?

1. Pass the Jenkins with `dev/test-dependencies.sh` with the existing dependencies.
2. Manually do the following and see the change.
```
mvn dependency:tree -Phadoop-2.7 -Dincludes=org.apache.hadoop:hadoop-mapreduce-client-core
```

Author: Dongjoon Hyun <[email protected]>

Closes apache#20704 from dongjoon-hyun/SPARK-23551.

(cherry picked from commit 34811e0)
Signed-off-by: Marcelo Vanzin <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants