Skip to content

Conversation

@dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented Nov 25, 2019

What changes were proposed in this pull request?

This PR aims to relocate the following internal dependencies to compile sql/core without -Phive-2.3 profile.

  1. Move the hive-storage-api to sql/core which is using hive-storage-api really.

BEFORE (sql/core compilation)

$ ./build/mvn -DskipTests --pl sql/core --am compile
...
[ERROR] [Error] /Users/dongjoon/APACHE/spark/sql/core/v2.3/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilters.scala:21: object hive is not a member of package org.apache.hadoop
...
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------

AFTER (sql/core compilation)

$ ./build/mvn -DskipTests --pl sql/core --am compile
...
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  02:04 min
[INFO] Finished at: 2019-11-25T00:20:11-08:00
[INFO] ------------------------------------------------------------------------
  1. For (1), add commons-lang:commons-lang test dependency to spark-core module to manage the dependency explicitly. Without this, core module fails to build the test classes.
$ ./build/mvn -DskipTests --pl core --am package -Phadoop-3.2
...
[INFO] --- scala-maven-plugin:4.3.0:testCompile (scala-test-compile-first) @ spark-core_2.12 ---
[INFO] Using incremental compilation using Mixed compile order
[INFO] Compiler bridge file: /Users/dongjoon/.sbt/1.0/zinc/org.scala-sbt/org.scala-sbt-compiler-bridge_2.12-1.3.1-bin_2.12.10__52.0-1.3.1_20191012T045515.jar
[INFO] Compiling 271 Scala sources and 26 Java sources to /spark/core/target/scala-2.12/test-classes ...
[ERROR] [Error] /spark/core/src/test/scala/org/apache/spark/util/PropertiesCloneBenchmark.scala:23: object lang is not a member of package org.apache.commons
[ERROR] [Error] /spark/core/src/test/scala/org/apache/spark/util/PropertiesCloneBenchmark.scala:49: not found: value SerializationUtils
[ERROR] two errors found

BEFORE (commons-lang:commons-lang)
The following is the previous core module's commons-lang:commons-lang dependency.

  1. branch-2.4
$ mvn dependency:tree -Dincludes=commons-lang:commons-lang
[INFO] --- maven-dependency-plugin:3.0.2:tree (default-cli) @ spark-core_2.11 ---
[INFO] org.apache.spark:spark-core_2.11:jar:2.4.5-SNAPSHOT
[INFO] \- org.spark-project.hive:hive-exec:jar:1.2.1.spark2:provided
[INFO]    \- commons-lang:commons-lang:jar:2.6:compile
  1. v3.0.0-preview (-Phadoop-3.2)
$ mvn dependency:tree -Dincludes=commons-lang:commons-lang -Phadoop-3.2
[INFO] --- maven-dependency-plugin:3.1.1:tree (default-cli) @ spark-core_2.12 ---
[INFO] org.apache.spark:spark-core_2.12:jar:3.0.0-preview
[INFO] \- org.apache.hive:hive-storage-api:jar:2.6.0:compile
[INFO]    \- commons-lang:commons-lang:jar:2.6:compile
  1. v3.0.0-preview(default)
$ mvn dependency:tree -Dincludes=commons-lang:commons-lang
[INFO] --- maven-dependency-plugin:3.1.1:tree (default-cli) @ spark-core_2.12 ---
[INFO] org.apache.spark:spark-core_2.12:jar:3.0.0-preview
[INFO] \- org.apache.hadoop:hadoop-client:jar:2.7.4:compile
[INFO]    \- org.apache.hadoop:hadoop-common:jar:2.7.4:compile
[INFO]       \- commons-lang:commons-lang:jar:2.6:compile

AFTER (commons-lang:commons-lang)

$ mvn dependency:tree -Dincludes=commons-lang:commons-lang
[INFO] --- maven-dependency-plugin:3.1.1:tree (default-cli) @ spark-core_2.12 ---
[INFO] org.apache.spark:spark-core_2.12:jar:3.0.0-SNAPSHOT
[INFO] \- commons-lang:commons-lang:jar:2.6:test

Since we wanted to verify that this PR doesn't change hive-1.2 profile, we merged
SPARK-30005 Update test-dependencies.sh to check hive-1.2/2.3 profile before this PR.

Why are the changes needed?

  • Apache Spark 2.4's sql/core is using Apache ORC (nohive) jars including shaded hive-storage-api to access ORC data sources.

  • Apache Spark 3.0's sql/core is using Apache Hive jars directly. Previously, -Phadoop-3.2 hid this hive-storage-api dependency. Now, we are using -Phive-2.3 instead. As I mentioned previously, this PR is required to compile sql/core module without -Phive-2.3.

  • For sql/hive and sql/hive-thriftserver, it's natural that we need -Phive-1.2 or -Phive-2.3.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

This will pass the Jenkins (with the dependency check and unit tests).

We need to check manually with ./build/mvn -DskipTests --pl sql/core --am compile.

This closes #26657 .

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-30015][BUILD] Move hive-storage-api dependency from hive-2.3 to sql/core [WIP][SPARK-30015][BUILD] Move hive-storage-api dependency from hive-2.3 to sql/core Nov 25, 2019
@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@dongjoon-hyun dongjoon-hyun changed the title [WIP][SPARK-30015][BUILD] Move hive-storage-api dependency from hive-2.3 to sql/core [SPARK-30015][BUILD] Move hive-storage-api dependency from hive-2.3 to sql/core Nov 25, 2019
@dongjoon-hyun
Copy link
Member Author

Retest this please.

@dongjoon-hyun dongjoon-hyun removed the SQL label Nov 25, 2019
@SparkQA

This comment has been minimized.

@SparkQA
Copy link

SparkQA commented Nov 25, 2019

Test build #114395 has finished for PR 26658 at commit e091ecc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@tgravescs tgravescs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still seeing build failures even with this pr:
[ERROR] [Error] /home/tgraves/workspace/tgravescs-spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala:32: object serde is not a member of package org.apache.hadoop.hive
[ERROR] [Error] /home/tgraves/workspace/tgravescs-spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala:1343: not found: value SERIALIZATION_FORMAT

build/mvn -Phadoop-2.7 -Phive -Pyarn -Pkinesis-asl -Pkubernetes -Pmesos -Phadoop-cloud -Pspark-ganglia-lgpl package -DskipTests

Maybe I still need the -Phive-2.3?

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Nov 25, 2019

Hi, @srowen and @tgravescs . Thank you for review.

  • The commons-lang of core/pom.xml is changed to test dependency. This is required with -Phadoop-3.2.
  • Hive module still needs -Phive-1.2 or -Phive-2.3. It's because sql/hive/pom.xml is adding some dependencies with hive-2.3. For Hive module, it's natural to choose one of -Phive-1.2 or -Phive-2.3. This PR is only focusing on core and sql/core module for now.
$ git grep hive-2.3 | grep pom.xml
pom.xml:      <id>hive-2.3</id>
sql/hive/pom.xml:      <id>hive-2.3</id>

BTW, although it's orthogonal to this PR, we had better remove commons-lang test dependency due to PropertiesCloneBenchmark.scala. It doesn't give us much value.

@tgravescs
Copy link
Contributor

so what are the rest of the jiras to fix the build?

Its not very user friendly for us to have broke the default build and it to not be clear at what options actually work. I tried building 4 different ways before finding one that worked.

@dongjoon-hyun
Copy link
Member Author

@tgravescs . After merging this, I'd like to change the following.

sql/hive/pom.xml:      <id>hive-2.3</id>

BTW, what you want when you use -Phive. -Phive-1.2 or -Phive-2.3?

so what are the rest of the jiras to fix the build?

I'm wondering your decision on Hive 1.2.1 or Hive 2.3.6.

@dongjoon-hyun
Copy link
Member Author

Thank you for approval. Sorry for the inconvenience.
This PR already passed #26658 (comment) . After that, we removed only empty line and changed the dependency scope to test. Since GitHub Action already verified the compilation, I'll merge this.

@tgravescs
Copy link
Contributor

I'm not sure I saw an answer on the mailing thread, if we are considering hive 1.2.1 deprecated perhaps we should have hive 2.3 the default. Also for hadoop 3.2 profile the only one that works is hive 2.3 right?
I actually accidentally have the -Phive in there from building older version in there as it was removed in previous 3.0 builds. So I'm ok with not having that.

@dongjoon-hyun
Copy link
Member Author

Thanks for confirming. Yes. Right. hadoop-3.2 only works with Hive 2.3.

@dongjoon-hyun dongjoon-hyun deleted the SPARK-30015 branch November 25, 2019 19:17
@SparkQA
Copy link

SparkQA commented Nov 25, 2019

Test build #114416 has finished for PR 26658 at commit afb1af1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 25, 2019

Test build #114417 has finished for PR 26658 at commit 3ecdede.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants