-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-47119][BUILD] Add hive-jackson-provided profile
#45201
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Could you review this PR, @viirya ? |
|
I may not have enought context. The existing |
|
Thank you for review. This PR is for the users who keep Hive and exclude
|
| <id>hive-provided</id> | ||
| <properties> | ||
| <hive.deps.scope>provided</hive.deps.scope> | ||
| <hive.jackson.scope>provided</hive.jackson.scope> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, both in hive-provided and hive-jackson-provided profiles, the config value is provided. What's the difference between them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously, these dependencies follow hive.deps.scope like the following. So, we need to preserve the existing behavior.
- <scope>${hive.deps.scope}</scope>
+ <scope>${hive.jackson.scope}</scope>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, hive-jackson-provided has hive in compile scope and only jackson in provided scope.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name hive-jackson-provided makes me think both are provided.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes!
|
To @xinrong-meng and @viirya , I updated the PR description. You can see that Hive jars are there and only CodeHaus Jacksons are gone. |
|
Thank you, @viirya ! |
|
Merged to master for Apache Spark 4.0.0. |
|
Makes sense, thank you @dongjoon-hyun ! |
|
Thank you, @xinrong-meng . |
… a new optional directory ### What changes were proposed in this pull request? This PR aims to provide `Apache Hive`'s `CodeHaus Jackson` dependencies via a new optional directory, `hive-jackson`, instead of the standard `jars` directory of Apache Spark binary distribution. Additionally, two internal configurations are added whose default values are `hive-jackson/*`. - `spark.driver.defaultExtraClassPath` - `spark.executor.defaultExtraClassPath` For example, Apache Spark distributions have been providing `spark-*-yarn-shuffle.jar` file under `yarn` directory instead of `jars`. **YARN SHUFFLE EXAMPLE** ``` $ ls -al yarn/*jar -rw-r--r-- 1 dongjoon staff 77352048 Sep 8 19:08 yarn/spark-3.5.0-yarn-shuffle.jar ``` This PR changes `Apache Hive`'s `CodeHaus Jackson` dependencies in a similar way. **BEFORE** ``` $ ls -al jars/*asl* -rw-r--r-- 1 dongjoon staff 232248 Sep 8 19:08 jars/jackson-core-asl-1.9.13.jar -rw-r--r-- 1 dongjoon staff 780664 Sep 8 19:08 jars/jackson-mapper-asl-1.9.13.jar ``` **AFTER** ``` $ ls -al jars/*asl* zsh: no matches found: jars/*asl* $ ls -al hive-jackson total 1984 drwxr-xr-x 4 dongjoon staff 128 Feb 23 15:37 . drwxr-xr-x 16 dongjoon staff 512 Feb 23 16:34 .. -rw-r--r-- 1 dongjoon staff 232248 Feb 23 15:37 jackson-core-asl-1.9.13.jar -rw-r--r-- 1 dongjoon staff 780664 Feb 23 15:37 jackson-mapper-asl-1.9.13.jar ``` ### Why are the changes needed? Since Apache Hadoop 3.3.5, only Apache Hive requires old CodeHaus Jackson dependencies. Apache Spark 3.5.0 tried to eliminate them completely but it's reverted due to Hive UDF support. - #40893 - #42446 SPARK-47119 added a way to exclude Apache Hive Jackson dependencies at the distribution building stage for Apache Spark 4.0.0. - #45201 This PR provides a way to exclude Apache Hive Jackson dependencies at runtime for Apache Spark 4.0.0. - Spark Shell without Apache Hive Jackson dependencies. ``` $ bin/spark-shell --driver-default-class-path "" ``` - Spark SQL Shell without Apache Hive Jackson dependencies. ``` $ bin/spark-sql --driver-default-class-path "" ``` - Spark Thrift Server without Apache Hive Jackson dependencies. ``` $ sbin/start-thriftserver.sh --driver-default-class-path "" ``` In addition, last but not least, this PR eliminates `CodeHaus Jackson` dependencies from the following Apache Spark deamons (using `spark-daemon.sh start`) because they don't require Hive `CodeHaus Jackson` dependencies - Spark Master - Spark Worker - Spark History Server ``` $ grep 'spark-daemon.sh start' * start-history-server.sh:exec "${SPARK_HOME}/sbin"/spark-daemon.sh start $CLASS 1 "$" start-master.sh:"${SPARK_HOME}/sbin"/spark-daemon.sh start $CLASS 1 \ start-worker.sh: "${SPARK_HOME}/sbin"/spark-daemon.sh start $CLASS $WORKER_NUM \ ``` ### Does this PR introduce _any_ user-facing change? No. There is no user-facing change by default. - For the distributions with `hive-jackson-provided` profile, the `scope` of Apache Hive Jackson dependencies is `provided` and `hive-jackson` directory is not created at all. - For the distributions with default setting, the `scope` of Apache Hive Jackson dependencies is still `compile`. In addition, they are in the Apache Spark's built-in class path like the following.  - The following Spark Deamon don't use `CodeHaus Jackson` dependencies. - Spark Master - Spark Worker - Spark History Server ### How was this patch tested? Pass the CIs and manually build a distribution and check the class paths in the `Environment` Tab. ``` $ dev/make-distribution.sh -Phive,hive-thriftserver ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45237 from dongjoon-hyun/SPARK-47152. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request? This PR aims to provide a new profile, `hive-jackson-provided`, for Apache Spark 4.0.0. ### Why are the changes needed? Since Apache Hadoop 3.3.5, only Apache Hive requires old CodeHaus Jackson dependencies. Apache Spark 3.5.0 tried to eliminate them completely but it's reverted due to Hive UDF support. - apache#40893 - apache#42446 To allow Apache Spark 4.0 users - To provide their own CodeHaus Jackson libraries - To exclude them completely if they don't use `Hive UDF`. ### Does this PR introduce _any_ user-facing change? No, this is a new profile. ### How was this patch tested? Pass the CIs and manual build. **Without `hive-jackson-provided`** ``` $ dev/make-distribution.sh -Phive,hive-thriftserver $ ls -al dist/jars/*asl* -rw-r--r-- 1 dongjoon staff 232248 Feb 21 10:53 dist.org/jars/jackson-core-asl-1.9.13.jar -rw-r--r-- 1 dongjoon staff 780664 Feb 21 10:53 dist.org/jars/jackson-mapper-asl-1.9.13.jar ``` **With `hive-jackson-provided`** ``` $ dev/make-distribution.sh -Phive,hive-thriftserver,hive-jackson-provided $ ls -al dist/jars/*asl* zsh: no matches found: dist/jars/*asl* $ ls -al dist/jars/*hive* -rw-r--r-- 1 dongjoon staff 183633 Feb 21 11:00 dist/jars/hive-beeline-2.3.9.jar -rw-r--r-- 1 dongjoon staff 44704 Feb 21 11:00 dist/jars/hive-cli-2.3.9.jar -rw-r--r-- 1 dongjoon staff 436169 Feb 21 11:00 dist/jars/hive-common-2.3.9.jar -rw-r--r-- 1 dongjoon staff 10840949 Feb 21 11:00 dist/jars/hive-exec-2.3.9-core.jar -rw-r--r-- 1 dongjoon staff 116364 Feb 21 11:00 dist/jars/hive-jdbc-2.3.9.jar -rw-r--r-- 1 dongjoon staff 326585 Feb 21 11:00 dist/jars/hive-llap-common-2.3.9.jar -rw-r--r-- 1 dongjoon staff 8195966 Feb 21 11:00 dist/jars/hive-metastore-2.3.9.jar -rw-r--r-- 1 dongjoon staff 916630 Feb 21 11:00 dist/jars/hive-serde-2.3.9.jar -rw-r--r-- 1 dongjoon staff 1679366 Feb 21 11:00 dist/jars/hive-service-rpc-3.1.3.jar -rw-r--r-- 1 dongjoon staff 53902 Feb 21 11:00 dist/jars/hive-shims-0.23-2.3.9.jar -rw-r--r-- 1 dongjoon staff 8786 Feb 21 11:00 dist/jars/hive-shims-2.3.9.jar -rw-r--r-- 1 dongjoon staff 120293 Feb 21 11:00 dist/jars/hive-shims-common-2.3.9.jar -rw-r--r-- 1 dongjoon staff 12923 Feb 21 11:00 dist/jars/hive-shims-scheduler-2.3.9.jar -rw-r--r-- 1 dongjoon staff 258346 Feb 21 11:00 dist/jars/hive-storage-api-2.8.1.jar -rw-r--r-- 1 dongjoon staff 581739 Feb 21 11:00 dist/jars/spark-hive-thriftserver_2.13-4.0.0-SNAPSHOT.jar -rw-r--r-- 1 dongjoon staff 687446 Feb 21 11:00 dist/jars/spark-hive_2.13-4.0.0-SNAPSHOT.jar ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#45201 from dongjoon-hyun/SPARK-47119. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
… a new optional directory ### What changes were proposed in this pull request? This PR aims to provide `Apache Hive`'s `CodeHaus Jackson` dependencies via a new optional directory, `hive-jackson`, instead of the standard `jars` directory of Apache Spark binary distribution. Additionally, two internal configurations are added whose default values are `hive-jackson/*`. - `spark.driver.defaultExtraClassPath` - `spark.executor.defaultExtraClassPath` For example, Apache Spark distributions have been providing `spark-*-yarn-shuffle.jar` file under `yarn` directory instead of `jars`. **YARN SHUFFLE EXAMPLE** ``` $ ls -al yarn/*jar -rw-r--r-- 1 dongjoon staff 77352048 Sep 8 19:08 yarn/spark-3.5.0-yarn-shuffle.jar ``` This PR changes `Apache Hive`'s `CodeHaus Jackson` dependencies in a similar way. **BEFORE** ``` $ ls -al jars/*asl* -rw-r--r-- 1 dongjoon staff 232248 Sep 8 19:08 jars/jackson-core-asl-1.9.13.jar -rw-r--r-- 1 dongjoon staff 780664 Sep 8 19:08 jars/jackson-mapper-asl-1.9.13.jar ``` **AFTER** ``` $ ls -al jars/*asl* zsh: no matches found: jars/*asl* $ ls -al hive-jackson total 1984 drwxr-xr-x 4 dongjoon staff 128 Feb 23 15:37 . drwxr-xr-x 16 dongjoon staff 512 Feb 23 16:34 .. -rw-r--r-- 1 dongjoon staff 232248 Feb 23 15:37 jackson-core-asl-1.9.13.jar -rw-r--r-- 1 dongjoon staff 780664 Feb 23 15:37 jackson-mapper-asl-1.9.13.jar ``` ### Why are the changes needed? Since Apache Hadoop 3.3.5, only Apache Hive requires old CodeHaus Jackson dependencies. Apache Spark 3.5.0 tried to eliminate them completely but it's reverted due to Hive UDF support. - apache#40893 - apache#42446 SPARK-47119 added a way to exclude Apache Hive Jackson dependencies at the distribution building stage for Apache Spark 4.0.0. - apache#45201 This PR provides a way to exclude Apache Hive Jackson dependencies at runtime for Apache Spark 4.0.0. - Spark Shell without Apache Hive Jackson dependencies. ``` $ bin/spark-shell --driver-default-class-path "" ``` - Spark SQL Shell without Apache Hive Jackson dependencies. ``` $ bin/spark-sql --driver-default-class-path "" ``` - Spark Thrift Server without Apache Hive Jackson dependencies. ``` $ sbin/start-thriftserver.sh --driver-default-class-path "" ``` In addition, last but not least, this PR eliminates `CodeHaus Jackson` dependencies from the following Apache Spark deamons (using `spark-daemon.sh start`) because they don't require Hive `CodeHaus Jackson` dependencies - Spark Master - Spark Worker - Spark History Server ``` $ grep 'spark-daemon.sh start' * start-history-server.sh:exec "${SPARK_HOME}/sbin"/spark-daemon.sh start $CLASS 1 "$" start-master.sh:"${SPARK_HOME}/sbin"/spark-daemon.sh start $CLASS 1 \ start-worker.sh: "${SPARK_HOME}/sbin"/spark-daemon.sh start $CLASS $WORKER_NUM \ ``` ### Does this PR introduce _any_ user-facing change? No. There is no user-facing change by default. - For the distributions with `hive-jackson-provided` profile, the `scope` of Apache Hive Jackson dependencies is `provided` and `hive-jackson` directory is not created at all. - For the distributions with default setting, the `scope` of Apache Hive Jackson dependencies is still `compile`. In addition, they are in the Apache Spark's built-in class path like the following.  - The following Spark Deamon don't use `CodeHaus Jackson` dependencies. - Spark Master - Spark Worker - Spark History Server ### How was this patch tested? Pass the CIs and manually build a distribution and check the class paths in the `Environment` Tab. ``` $ dev/make-distribution.sh -Phive,hive-thriftserver ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#45237 from dongjoon-hyun/SPARK-47152. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
|
@dongjoon-hyun Jackson 1.x can be removed after SPARK-47018 (bump Hive 2.3.10), what should we do for |
|
It's supposed to be here as a last resort until we release Apache Spark 4.0.0 successfully without reverting Hive 2.3.10, @pan3793 . |
This PR aims to provide a new profile, `hive-jackson-provided`, for Apache Spark 4.0.0. Since Apache Hadoop 3.3.5, only Apache Hive requires old CodeHaus Jackson dependencies. Apache Spark 3.5.0 tried to eliminate them completely but it's reverted due to Hive UDF support. - apache#40893 - apache#42446 To allow Apache Spark 4.0 users - To provide their own CodeHaus Jackson libraries - To exclude them completely if they don't use `Hive UDF`. No, this is a new profile. Pass the CIs and manual build. **Without `hive-jackson-provided`** ``` $ dev/make-distribution.sh -Phive,hive-thriftserver $ ls -al dist/jars/*asl* -rw-r--r-- 1 dongjoon staff 232248 Feb 21 10:53 dist.org/jars/jackson-core-asl-1.9.13.jar -rw-r--r-- 1 dongjoon staff 780664 Feb 21 10:53 dist.org/jars/jackson-mapper-asl-1.9.13.jar ``` **With `hive-jackson-provided`** ``` $ dev/make-distribution.sh -Phive,hive-thriftserver,hive-jackson-provided $ ls -al dist/jars/*asl* zsh: no matches found: dist/jars/*asl* $ ls -al dist/jars/*hive* -rw-r--r-- 1 dongjoon staff 183633 Feb 21 11:00 dist/jars/hive-beeline-2.3.9.jar -rw-r--r-- 1 dongjoon staff 44704 Feb 21 11:00 dist/jars/hive-cli-2.3.9.jar -rw-r--r-- 1 dongjoon staff 436169 Feb 21 11:00 dist/jars/hive-common-2.3.9.jar -rw-r--r-- 1 dongjoon staff 10840949 Feb 21 11:00 dist/jars/hive-exec-2.3.9-core.jar -rw-r--r-- 1 dongjoon staff 116364 Feb 21 11:00 dist/jars/hive-jdbc-2.3.9.jar -rw-r--r-- 1 dongjoon staff 326585 Feb 21 11:00 dist/jars/hive-llap-common-2.3.9.jar -rw-r--r-- 1 dongjoon staff 8195966 Feb 21 11:00 dist/jars/hive-metastore-2.3.9.jar -rw-r--r-- 1 dongjoon staff 916630 Feb 21 11:00 dist/jars/hive-serde-2.3.9.jar -rw-r--r-- 1 dongjoon staff 1679366 Feb 21 11:00 dist/jars/hive-service-rpc-3.1.3.jar -rw-r--r-- 1 dongjoon staff 53902 Feb 21 11:00 dist/jars/hive-shims-0.23-2.3.9.jar -rw-r--r-- 1 dongjoon staff 8786 Feb 21 11:00 dist/jars/hive-shims-2.3.9.jar -rw-r--r-- 1 dongjoon staff 120293 Feb 21 11:00 dist/jars/hive-shims-common-2.3.9.jar -rw-r--r-- 1 dongjoon staff 12923 Feb 21 11:00 dist/jars/hive-shims-scheduler-2.3.9.jar -rw-r--r-- 1 dongjoon staff 258346 Feb 21 11:00 dist/jars/hive-storage-api-2.8.1.jar -rw-r--r-- 1 dongjoon staff 581739 Feb 21 11:00 dist/jars/spark-hive-thriftserver_2.13-4.0.0-SNAPSHOT.jar -rw-r--r-- 1 dongjoon staff 687446 Feb 21 11:00 dist/jars/spark-hive_2.13-4.0.0-SNAPSHOT.jar ``` No. Closes apache#45201 from dongjoon-hyun/SPARK-47119. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
What changes were proposed in this pull request?
This PR aims to provide a new profile,
hive-jackson-provided, for Apache Spark 4.0.0.Why are the changes needed?
Since Apache Hadoop 3.3.5, only Apache Hive requires old CodeHaus Jackson dependencies.
Apache Spark 3.5.0 tried to eliminate them completely but it's reverted due to Hive UDF support.
To allow Apache Spark 4.0 users
Hive UDF.Does this PR introduce any user-facing change?
No, this is a new profile.
How was this patch tested?
Pass the CIs and manual build.
Without
hive-jackson-providedWith
hive-jackson-providedWas this patch authored or co-authored using generative AI tooling?
No.