-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-29981][BUILD] Add hive-1.2/2.3 profiles #26619
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Hi, @srowen , @liancheng , @wangyum . |
srowen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That looks correct so far. Does the release script need to accommodate the new build combinations we've been talking about? We need at least the Hadoop 2 / Hive 1, and Hadoop 3 / Hive 2 flavors. I think I lost track of whether there was an argument for Hadoop 3 / Hive 1 ? if that even works.
|
Retest this please. |
|
Thank you for review. Up to now, there was no request on
|
|
@srowen , @liancheng , @wangyum , @gatorsmile . Previously, it's |
| "hadoop2.7": ["-Phadoop-2.7"], | ||
| "hadoop3.2": ["-Phadoop-3.2"], | ||
| "hadoop2.7": ["-Phadoop-2.7", "-Phive-1.2"], | ||
| "hadoop3.2": ["-Phadoop-3.2", "-Phive-2.3"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, @dongjoon-hyun, can you create JIRAs to follow up? Seems we will change -Phive-2.3 as a default profile as discussed in the mailing list.
- Verify *Hadoop 2.7 and Hive 2.3 combination in both JDK 8 and JDK 11 (WIP) as it will be the default
- Setting a Jenkins job for
-Phadoop-2.7 -Phive-2.3. - We will need to be able to configure Hive version and also Hadoop version in the PR builder.
- Change the default profile to Hive 2.3.
- [SPARK-29981][BUILD] Add hive-1.2/2.3 profiles #26619 (comment)
- Release script updates.
- ... (other more .. ?)
*Hadoop 2 will be default at this moment. it's being discussed in the mailing list
Many things are going on so it looks very easy to lose the track.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, the thread is independent from JDK11 or Hadoop 3. For the following, only JDK8 is the target of the follow-up of this PR. And, for JDK11, I believe @wangyum will handle it at his on-going work.
Verify *Hadoop 2.7 and Hive 2.3 combination in both JDK 8 and JDK 11 (WIP) as it will be the default
For (4), Hive 2.3 is already default for Hadoop-3. And, this PR makes Hive 2.3 as a default in the pom files. So, I guess the remains of (4) is equal to (2) and (3).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The follow-up issues are created and mentioned in the PR description, @HyukjinKwon .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, sure, thanks for clarification. @wangyum, can you verify the new JDK 11 combination and take a following action?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HyukjinKwon I will do it later.
|
Test build #114192 has finished for PR 26619 at commit
|
|
Test build #114196 has finished for PR 26619 at commit
|
This comment has been minimized.
This comment has been minimized.
|
Test build #114204 has finished for PR 26619 at commit
|
|
Hi, All. I updated the PR description. This PR successfully verified the existing profiles and this PR also updates PR builder and dependency check script in the same way like before. To update Jenkins jobs with new way, we need to merge this PR first because we need to use these profile. Also, I noticed that there is some redundancy in GitHub Action Cache. I'll handle them in another PR to avoid conflicts. The master branch already updated |
|
Hi, @shaneknapp . We need to rename first. (SBT/Maven)
After that, we need to add new combination. (SBT/Maven)
I'll file a JIRA for this after when this PR is merged. |
| <hive23.version>2.3.6</hive23.version> | ||
| <!-- Version used for internal directory structure --> | ||
| <hive.version.short>1.2.1</hive.version.short> | ||
| <hive.version.short>2.3.5</hive.version.short> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, any reason why we needed the patch version in the directory names initially? I don't think we need different shims for Hive patch versions. If that's true, I'd suggest renaming the sql/hive-thriftserver/{v1.2.1,v2.3.5} folders to just v1.2 and v2.3. Of course, this can be done in a follow-up PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. We can use the short versions (v1.2 and v2.3). I'll do in the next follow-up PR. It will include renaming files mostly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we are already in Apache Hive 2.3.6, I don't expect many changes at 2.3.7+.
HyukjinKwon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks fine so far to me but let me defer to other people here.
|
Thank you, @HyukjinKwon ! |
|
Thank you all. I'll merge this to the master and continue to work. This keeps every existing features. I'm sure it's safe until now. |
### What changes were proposed in this pull request? This is a follow-up according to liancheng 's advice. - #26619 (comment) ### Why are the changes needed? Previously, we chose the full version to be carefully. As of today, it seems that `Apache Hive 2.3` branch seems to become stable. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Pass the compile combination on GitHub Action. 1. hadoop-2.7/hive-1.2/JDK8 2. hadoop-2.7/hive-2.3/JDK8 3. hadoop-3.2/hive-2.3/JDK8 4. hadoop-3.2/hive-2.3/JDK11 Also, pass the Jenkins with `hadoop-2.7` and `hadoop-3.2` for (1) and (4). (2) and (3) is not ready in Jenkins. Closes #26645 from dongjoon-hyun/SPARK-RENAME-HIVE-DIRECTORY. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
… to `sql/core` # What changes were proposed in this pull request? This PR aims to relocate the following internal dependencies to compile `sql/core` without `-Phive-2.3` profile. 1. Move the `hive-storage-api` to `sql/core` which is using `hive-storage-api` really. **BEFORE (sql/core compilation)** ``` $ ./build/mvn -DskipTests --pl sql/core --am compile ... [ERROR] [Error] /Users/dongjoon/APACHE/spark/sql/core/v2.3/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilters.scala:21: object hive is not a member of package org.apache.hadoop ... [INFO] ------------------------------------------------------------------------ [INFO] BUILD FAILURE [INFO] ------------------------------------------------------------------------ ``` **AFTER (sql/core compilation)** ``` $ ./build/mvn -DskipTests --pl sql/core --am compile ... [INFO] ------------------------------------------------------------------------ [INFO] BUILD SUCCESS [INFO] ------------------------------------------------------------------------ [INFO] Total time: 02:04 min [INFO] Finished at: 2019-11-25T00:20:11-08:00 [INFO] ------------------------------------------------------------------------ ``` 2. For (1), add `commons-lang:commons-lang` test dependency to `spark-core` module to manage the dependency explicitly. Without this, `core` module fails to build the test classes. ``` $ ./build/mvn -DskipTests --pl core --am package -Phadoop-3.2 ... [INFO] --- scala-maven-plugin:4.3.0:testCompile (scala-test-compile-first) spark-core_2.12 --- [INFO] Using incremental compilation using Mixed compile order [INFO] Compiler bridge file: /Users/dongjoon/.sbt/1.0/zinc/org.scala-sbt/org.scala-sbt-compiler-bridge_2.12-1.3.1-bin_2.12.10__52.0-1.3.1_20191012T045515.jar [INFO] Compiling 271 Scala sources and 26 Java sources to /spark/core/target/scala-2.12/test-classes ... [ERROR] [Error] /spark/core/src/test/scala/org/apache/spark/util/PropertiesCloneBenchmark.scala:23: object lang is not a member of package org.apache.commons [ERROR] [Error] /spark/core/src/test/scala/org/apache/spark/util/PropertiesCloneBenchmark.scala:49: not found: value SerializationUtils [ERROR] two errors found ``` **BEFORE (commons-lang:commons-lang)** The following is the previous `core` module's `commons-lang:commons-lang` dependency. 1. **branch-2.4** ``` $ mvn dependency:tree -Dincludes=commons-lang:commons-lang [INFO] --- maven-dependency-plugin:3.0.2:tree (default-cli) spark-core_2.11 --- [INFO] org.apache.spark:spark-core_2.11:jar:2.4.5-SNAPSHOT [INFO] \- org.spark-project.hive:hive-exec:jar:1.2.1.spark2:provided [INFO] \- commons-lang:commons-lang:jar:2.6:compile ``` 2. **v3.0.0-preview (-Phadoop-3.2)** ``` $ mvn dependency:tree -Dincludes=commons-lang:commons-lang -Phadoop-3.2 [INFO] --- maven-dependency-plugin:3.1.1:tree (default-cli) spark-core_2.12 --- [INFO] org.apache.spark:spark-core_2.12:jar:3.0.0-preview [INFO] \- org.apache.hive:hive-storage-api:jar:2.6.0:compile [INFO] \- commons-lang:commons-lang:jar:2.6:compile ``` 3. **v3.0.0-preview(default)** ``` $ mvn dependency:tree -Dincludes=commons-lang:commons-lang [INFO] --- maven-dependency-plugin:3.1.1:tree (default-cli) spark-core_2.12 --- [INFO] org.apache.spark:spark-core_2.12:jar:3.0.0-preview [INFO] \- org.apache.hadoop:hadoop-client:jar:2.7.4:compile [INFO] \- org.apache.hadoop:hadoop-common:jar:2.7.4:compile [INFO] \- commons-lang:commons-lang:jar:2.6:compile ``` **AFTER (commons-lang:commons-lang)** ``` $ mvn dependency:tree -Dincludes=commons-lang:commons-lang [INFO] --- maven-dependency-plugin:3.1.1:tree (default-cli) spark-core_2.12 --- [INFO] org.apache.spark:spark-core_2.12:jar:3.0.0-SNAPSHOT [INFO] \- commons-lang:commons-lang:jar:2.6:test ``` Since we wanted to verify that this PR doesn't change `hive-1.2` profile, we merged [SPARK-30005 Update `test-dependencies.sh` to check `hive-1.2/2.3` profile](a1706e2) before this PR. ### Why are the changes needed? - Apache Spark 2.4's `sql/core` is using `Apache ORC (nohive)` jars including shaded `hive-storage-api` to access ORC data sources. - Apache Spark 3.0's `sql/core` is using `Apache Hive` jars directly. Previously, `-Phadoop-3.2` hid this `hive-storage-api` dependency. Now, we are using `-Phive-2.3` instead. As I mentioned [previously](#26619 (comment)), this PR is required to compile `sql/core` module without `-Phive-2.3`. - For `sql/hive` and `sql/hive-thriftserver`, it's natural that we need `-Phive-1.2` or `-Phive-2.3`. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? This will pass the Jenkins (with the dependency check and unit tests). We need to check manually with `./build/mvn -DskipTests --pl sql/core --am compile`. This closes #26657 . Closes #26658 from dongjoon-hyun/SPARK-30015. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
|
|
|
The default of AS-IS |
|
One additional stuff is |
|
Hey @dongjoon-hyun, was investigating some dependency related issues, and found that we are using the I'm not quite familiar with the |
|
Yes. It's correct. The removal of For |
|
Historically I was against the change because |
|
Thank you very much for the context, @dongjoon-hyun! |
|
@dongjoon-hyun i found hive storage api and hive common have the following common class files For example, https://github.com/apache/hive/blob/rel/storage-release-2.6.0/storage-api/src/java/org/apache/hadoop/hive/common/ValidReadTxnList.java (pulled in by orc 1.5.8) and https://github.com/apache/hive/blob/rel/release-2.3.6/common/src/java/org/apache/hadoop/hive/common/ValidReadTxnList.java (from hive-common 2.3.6) both are in the classpath and they are different. I am worried that users may hit issues due to classloading order. I think it will be safe to still use nohive classifier for orc, which shades storage api. I think there is no downside of using orc nohive, right? Why not just keep using that. So, we will not need to worry about having class conflicts. |
|
i created https://issues.apache.org/jira/browse/SPARK-30784 for switching back to orc-nohive. |
|
Thank you, @yhuai . During weekend, I missed your comment here. |
|
BTW, @yhuai . As I mentioned before to @liancheng , the transition from
Let me you ping on the right PR. |
|
@dongjoon-hyun have we tried to change hive version to 3.x in spark in order to make it work with hive metastore 3.x. |
|
Did you try For some known issues, we are upgrading the built-in hive module to 2.3.9 here. |
|
@dongjoon-hyun yes have tried it but it fails with class not found exceptions like and I checked this LoadTableDesc$LoadFileType (enum) is added in hive 3.x |
|
@dongjoon-hyun I get this exception when i try to insert data into table. I checked that spark_home/jars contain only hive 2.3.7 jars hence we are getting this exception. query which does not work I also noticed that when i make table as acid then it works. |
What changes were proposed in this pull request?
This PR aims the followings.
hive-1.2andhive-2.3(default)For now, we assumes that
hive-1.2is explicitly used withhadoop-2.7andhive-2.3withhadoop-3.2. The followings are beyond the scope of this PR.hive-1.2/2.3combinationhive-1.2/2.3combinationhive-1.2/2.3in PR BuilderWhy are the changes needed?
This will help to switch our dependencies to update the exposed dependencies.
Does this PR introduce any user-facing change?
This is a dev-only change that the build profile combinations are changed.
-Phadoop-2.7=>-Phadoop-2.7 -Phive-1.2-Phadoop-3.2=>-Phadoop-3.2 -Phive-2.3How was this patch tested?
Pass the Jenkins with the dependency check and tests to make it sure we don't change anything for now.
Also, from now, GitHub Action validates the following combinations.
