[SPARK-47152][SQL][BUILD] Provide `CodeHaus Jackson` dependencies via a new optional directory #45237

dongjoon-hyun · 2024-02-24T00:31:59Z

What changes were proposed in this pull request?

This PR aims to provide Apache Hive's CodeHaus Jackson dependencies via a new optional directory, hive-jackson, instead of the standard jars directory of Apache Spark binary distribution. Additionally, two internal configurations are added whose default values are hive-jackson/*.

spark.driver.defaultExtraClassPath
spark.executor.defaultExtraClassPath

For example, Apache Spark distributions have been providing spark-*-yarn-shuffle.jar file under yarn directory instead of jars.

YARN SHUFFLE EXAMPLE

$ ls -al yarn/*jar
-rw-r--r--  1 dongjoon  staff  77352048 Sep  8 19:08 yarn/spark-3.5.0-yarn-shuffle.jar

This PR changes Apache Hive's CodeHaus Jackson dependencies in a similar way.

BEFORE

$ ls -al jars/*asl*
-rw-r--r--  1 dongjoon  staff  232248 Sep  8 19:08 jars/jackson-core-asl-1.9.13.jar
-rw-r--r--  1 dongjoon  staff  780664 Sep  8 19:08 jars/jackson-mapper-asl-1.9.13.jar

AFTER

$ ls -al jars/*asl*
zsh: no matches found: jars/*asl*

$ ls -al hive-jackson
total 1984
drwxr-xr-x   4 dongjoon  staff     128 Feb 23 15:37 .
drwxr-xr-x  16 dongjoon  staff     512 Feb 23 16:34 ..
-rw-r--r--   1 dongjoon  staff  232248 Feb 23 15:37 jackson-core-asl-1.9.13.jar
-rw-r--r--   1 dongjoon  staff  780664 Feb 23 15:37 jackson-mapper-asl-1.9.13.jar

Why are the changes needed?

Since Apache Hadoop 3.3.5, only Apache Hive requires old CodeHaus Jackson dependencies.

Apache Spark 3.5.0 tried to eliminate them completely but it's reverted due to Hive UDF support.

SPARK-47119 added a way to exclude Apache Hive Jackson dependencies at the distribution building stage for Apache Spark 4.0.0.

[SPARK-47119][BUILD] Add hive-jackson-provided profile #45201

This PR provides a way to exclude Apache Hive Jackson dependencies at runtime for Apache Spark 4.0.0.

Spark Shell without Apache Hive Jackson dependencies.

$ bin/spark-shell --driver-default-class-path ""

Spark SQL Shell without Apache Hive Jackson dependencies.

$ bin/spark-sql --driver-default-class-path ""

Spark Thrift Server without Apache Hive Jackson dependencies.

$ sbin/start-thriftserver.sh --driver-default-class-path ""

In addition, last but not least, this PR eliminates CodeHaus Jackson dependencies from the following Apache Spark deamons (using spark-daemon.sh start) because they don't require Hive CodeHaus Jackson dependencies

Spark Master
Spark Worker
Spark History Server

$ grep 'spark-daemon.sh start' *
start-history-server.sh:exec "${SPARK_HOME}/sbin"/spark-daemon.sh start $CLASS 1 "$@"
start-master.sh:"${SPARK_HOME}/sbin"/spark-daemon.sh start $CLASS 1 \
start-worker.sh:  "${SPARK_HOME}/sbin"/spark-daemon.sh start $CLASS $WORKER_NUM \

Does this PR introduce any user-facing change?

No. There is no user-facing change by default.

For the distributions with hive-jackson-provided profile, the scope of Apache Hive Jackson dependencies is provided and hive-jackson directory is not created at all.
For the distributions with default setting, the scope of Apache Hive Jackson dependencies is still compile. In addition, they are in the Apache Spark's built-in class path like the following.

The following Spark Deamon don't use CodeHaus Jackson dependencies.
- Spark Master
- Spark Worker
- Spark History Server

How was this patch tested?

Pass the CIs and manually build a distribution and check the class paths in the Environment Tab.

$ dev/make-distribution.sh -Phive,hive-thriftserver

Was this patch authored or co-authored using generative AI tooling?

No.

…a new optional directory

dongjoon-hyun · 2024-02-24T00:56:41Z

cc @viirya

core/src/main/scala/org/apache/spark/internal/config/package.scala

viirya · 2024-02-24T02:51:05Z

launcher/src/main/java/org/apache/spark/launcher/SparkLauncher.java

+  /** Configuration key for the driver default extra class path. */
+  public static final String DRIVER_DEFAULT_EXTRA_CLASS_PATH =
+    "spark.driver.defaultExtraClassPath";
+  public static final String DRIVER_DEFAULT_EXTRA_CLASS_PATH_VALUE = "hive-jackson/*";


So even users who use no hive distributions. they will also have this default class patch value in the extra class path, right?

Yes, the configuration is simply pointing non-existing directly. As you can see in make-distribution.sh and PR description. hive-jackson directory is created only when there exist jackson-*-asl-*.jars.

viirya · 2024-02-24T02:58:25Z

dev/make-distribution.sh

+# Only create the hive-jackson directory if they exist.
+for f in "$DISTDIR"/jars/jackson-*-asl-*.jar; do
+  mkdir -p "$DISTDIR"/hive-jackson
+  mv $f "$DISTDIR"/hive-jackson/
+done


Btw, what's benefit to have separate class path for hive-jackson jars?

There are 5 main benefits like yarn directory, @viirya .

The following Apache Spark deamons (with uses bin/spark-daemon.sh start) will ignore hive-jackson directory.

Spark Master

Spark Worker

Spark History Server

$ grep 'spark-daemon.sh start' * start-history-server.sh:exec "${SPARK_HOME}/sbin"/spark-daemon.sh start $CLASS 1 "$@" start-master.sh:"${SPARK_HOME}/sbin"/spark-daemon.sh start $CLASS 1 \ start-worker.sh: "${SPARK_HOME}/sbin"/spark-daemon.sh start $CLASS $WORKER_NUM \

Recoverability: The AS-IS Spark 3 users can achieve the same goal if they delete those two files from Spark's jar directory manually. However, it's difficult to recover the deleted files when a production job fails due to Hive UDF. This PR provides more robust and safe way with a configuration.

Communication: We (and the security team) can easily communicate that hive-jackson is not used like yarn directory because it's physically split from the distribution. Also, they can delete the directory easily (if they need) without knowing the details of dependency lists inside that directory.

Robustness: If Apache Spark have everything in jars, it's difficult to prevent them from loading. Of course, we may choose a tricky way to filter out from class file lists via naming pattern. It's still less robust in a long term perspective.

Compatibility with hive-jackson-provided: With the existing hive-jackson-provided, this PR provides a cleaner injection point for the provided dependencies. For example, the custom build Jackson dependencies can be placed in hive-jackson (after they create this) instead of jars. We are very reluctant if someone put their custom jar files into Apache Spark's jars directory directly. hive-jackson directory could be more recommendable way than copying into Spark's jars directory.

viirya · 2024-02-24T07:29:17Z

launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java

+        case DRIVER_DEFAULT_CLASS_PATH ->
+          conf.put(SparkLauncher.DRIVER_DEFAULT_EXTRA_CLASS_PATH, value);


Hmm, why spark-submit can only specify driver default extra class path, cannot it specify executor default extra class path?

No~ This is --driver-class-path which used as a special case because the driver JVM is started already.

We already have more general ones are spark.executor.extraClassPath and spark.driver.extraClassPath for both driver and executor. This PR extends the above with the following.

spark.driver.defaultExtraClassPath spark.executor.defaultExtraClassPath

Please see here for the detail.

https://spark.apache.org/docs/latest/configuration.html#runtime-environment

BTW, is your concern is about that the following is insufficient?

private[spark] val EXECUTOR_CLASS_PATH = ConfigBuilder(SparkLauncher.EXECUTOR_EXTRA_CLASSPATH) .withPrepended(EXECUTOR_DEFAULT_EXTRA_CLASS_PATH.key, File.pathSeparator)

Let me double-check it.

To @viirya ,

In general, bin/spark-submit ... -c spark.driver.defaultExtraClassPath="" -c spark.executor.defaultExtraClassPath="" is a way to launch without hive-jackson dependency because hive-jackson/* are provided by those configuration. This works for the drivers of cluster mode submission and executors of both client and cluster submission modes always.

Apache Spark already provides --driver-class-path is used only for the cases where we cannot use spark.driver.extraClassPath. In this case, spark.driver.defaultExtraClassPath is also not applicable. So, this PR adds --driver-default-class-path "" in the same way.

This is identical with the way of the existing Apache Spark's spark.*.extraClassPath way.

Yea, -c spark.driver.defaultExtraClassPath="" should be same as --driver-default-class-path "", so I wonder why we need it especially? And also what's the case we cannot specify it through -c spark.driver.defaultExtraClassPath?

It's probably okay as we already have --driver-class-path so this addition follows it. Just a question if you know the exact reason.

--driver-class-path is required because Spark's Driver JVMs are started before loading Spark's property files or config file.

We cannot create SparkConf class if we cannot load Spark's jar files. This is the chicken and egg situation. To solve this issue, we blindly load jars/*. So, if we have other jars, we need to use --driver-class-path

viirya · 2024-02-24T07:33:22Z

launcher/src/main/java/org/apache/spark/launcher/AbstractCommandBuilder.java

+      effectiveConfig.putIfAbsent(SparkLauncher.DRIVER_DEFAULT_EXTRA_CLASS_PATH,
+        SparkLauncher.DRIVER_DEFAULT_EXTRA_CLASS_PATH_VALUE);


Non SparkLauncher triggered (e.g., the Spark deamons you mentioned) classes are not affected, right?

And, same question, why only driver default extra class is specified here?

Yes, the following spark-daemon.sh start are not affected, @viirya .

$ grep 'spark-daemon.sh start' * start-history-server.sh:exec "${SPARK_HOME}/sbin"/spark-daemon.sh start $CLASS 1 "$@" start-master.sh:"${SPARK_HOME}/sbin"/spark-daemon.sh start $CLASS 1 \ start-worker.sh: "${SPARK_HOME}/sbin"/spark-daemon.sh start $CLASS $WORKER_NUM \

This is a command on the driver side. The executors are supposed to use spark.executor.extraClassPath and spark.executor.defaultExtraClassPath.

And, same question, why only driver default extra class is specified here?

Hmm, why we need to specially deal with SparkLauncher.DRIVER_DEFAULT_EXTRA_CLASS_PATH here? I think Spark configs should be dealt together above:

p.stringPropertyNames().forEach(key -> effectiveConfig.computeIfAbsent(key, p::getProperty));

This is required because, when it's loaded from the file, p.stringPropertyNames() doesn't have SparkLauncher.DRIVER_DEFAULT_EXTRA_CLASS_PATH. So, .forEach is not invoked at all.

Properties p = loadPropertiesFile(); p.stringPropertyNames().

Yes, here.

When getEffectiveConfig is used like this in SparkSubmitCommandBuilder.java, config is a Java Map.
So, String defaultExtraClassPath = config.get(SparkLauncher.DRIVER_DEFAULT_EXTRA_CLASS_PATH); become null if we don't have this.

Map<String, String> config = getEffectiveConfig(); boolean isClientMode = isClientMode(config); String extraClassPath = isClientMode ? config.get(SparkLauncher.DRIVER_EXTRA_CLASSPATH) : null; String defaultExtraClassPath = config.get(SparkLauncher.DRIVER_DEFAULT_EXTRA_CLASS_PATH); if (extraClassPath == null || extraClassPath.trim().isEmpty()) {

I see. Although this is only used by SparkSubmitCommandBuilder but not SparkClassCommandBuilder so I wonder if we should directly put it in SparkSubmitCommandBuilder.

getEffectiveConfig is shared by other classes not only by SparkSubmitCommandBuilder. That's the decision why I fixed it in getEffectiveConfig .

$ git grep getEffectiveConfig launcher/src/main/java/org/apache/spark/launcher/AbstractCommandBuilder.java: Map<String, String> getEffectiveConfig() throws IOException { launcher/src/main/java/org/apache/spark/launcher/InProcessLauncher.java: if (builder.isClientMode(builder.getEffectiveConfig())) { launcher/src/main/java/org/apache/spark/launcher/SparkLauncher.java: return builder.getEffectiveConfig().get(CHILD_PROCESS_LOGGER_NAME); launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java: Map<String, String> config = getEffectiveConfig(); launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java: getEffectiveConfig().get(SparkLauncher.DRIVER_EXTRA_LIBRARY_PATH));

Hmm, but I saw you only changed SparkSubmitCommandBuilder to explicitly pick it up? If SparkLauncher.DRIVER_EXTRA_CLASSPATH is not set at all, these other classes won't use SparkLauncher.DRIVER_DEFAULT_EXTRA_CLASS_PATH too.

Let me double-check with them InProcessLauncher and SparkLauncher once more. Last time, I checked it was okay, but your concern rings a bell to me.

pan3793 · 2024-02-24T08:32:36Z

FYI, I noticed that apache/hive#4564 already cut the Jackson 1.x deps out, pending 2.3.10 release ...
cc @sunchao @wangyum

dongjoon-hyun · 2024-02-24T08:59:34Z

Actually, we are the one who asked that, @pan3793 . We are aware of that. And @sunchao and I am in the same team in Apple. 😄

dongjoon-hyun · 2024-02-24T09:07:07Z

To @pan3793 , for the record, we've been waiting for that although there is no ETA for now. In addition, every dependency update has a risk of reverting as you see in the history. No matter what happens with 2.3.10 in the Hive community and in Spark community, we will delete this dependency in Apache Spark 4.0.0.

[SPARK-44197][BUILD] Upgrade Hadoop to 3.3.6
[SPARK-44678][BUILD][3.5] Downgrade Hadoop to 3.3.4

dongjoon-hyun · 2024-02-24T19:05:18Z

Thank you so much, @viirya !
Merged to master for Apache Spark 4.0.0.

bjornjorgensen · 2024-02-25T13:39:48Z

Should we change mv $f "$DISTDIR"/hive-jackson/ to mv "$f" "$DISTDIR"/hive-jackson/ ?

dongjoon-hyun · 2024-02-25T17:10:54Z

Should we change mv $f "$DISTDIR"/hive-jackson/ to mv "$f" "$DISTDIR"/hive-jackson/ ?

Feel free to make a followup, @bjornjorgensen .

… a new optional directory ### What changes were proposed in this pull request? This PR aims to provide `Apache Hive`'s `CodeHaus Jackson` dependencies via a new optional directory, `hive-jackson`, instead of the standard `jars` directory of Apache Spark binary distribution. Additionally, two internal configurations are added whose default values are `hive-jackson/*`. - `spark.driver.defaultExtraClassPath` - `spark.executor.defaultExtraClassPath` For example, Apache Spark distributions have been providing `spark-*-yarn-shuffle.jar` file under `yarn` directory instead of `jars`. **YARN SHUFFLE EXAMPLE** ``` $ ls -al yarn/*jar -rw-r--r-- 1 dongjoon staff 77352048 Sep 8 19:08 yarn/spark-3.5.0-yarn-shuffle.jar ``` This PR changes `Apache Hive`'s `CodeHaus Jackson` dependencies in a similar way. **BEFORE** ``` $ ls -al jars/*asl* -rw-r--r-- 1 dongjoon staff 232248 Sep 8 19:08 jars/jackson-core-asl-1.9.13.jar -rw-r--r-- 1 dongjoon staff 780664 Sep 8 19:08 jars/jackson-mapper-asl-1.9.13.jar ``` **AFTER** ``` $ ls -al jars/*asl* zsh: no matches found: jars/*asl* $ ls -al hive-jackson total 1984 drwxr-xr-x 4 dongjoon staff 128 Feb 23 15:37 . drwxr-xr-x 16 dongjoon staff 512 Feb 23 16:34 .. -rw-r--r-- 1 dongjoon staff 232248 Feb 23 15:37 jackson-core-asl-1.9.13.jar -rw-r--r-- 1 dongjoon staff 780664 Feb 23 15:37 jackson-mapper-asl-1.9.13.jar ``` ### Why are the changes needed? Since Apache Hadoop 3.3.5, only Apache Hive requires old CodeHaus Jackson dependencies. Apache Spark 3.5.0 tried to eliminate them completely but it's reverted due to Hive UDF support. - apache#40893 - apache#42446 SPARK-47119 added a way to exclude Apache Hive Jackson dependencies at the distribution building stage for Apache Spark 4.0.0. - apache#45201 This PR provides a way to exclude Apache Hive Jackson dependencies at runtime for Apache Spark 4.0.0. - Spark Shell without Apache Hive Jackson dependencies. ``` $ bin/spark-shell --driver-default-class-path "" ``` - Spark SQL Shell without Apache Hive Jackson dependencies. ``` $ bin/spark-sql --driver-default-class-path "" ``` - Spark Thrift Server without Apache Hive Jackson dependencies. ``` $ sbin/start-thriftserver.sh --driver-default-class-path "" ``` In addition, last but not least, this PR eliminates `CodeHaus Jackson` dependencies from the following Apache Spark deamons (using `spark-daemon.sh start`) because they don't require Hive `CodeHaus Jackson` dependencies - Spark Master - Spark Worker - Spark History Server ``` $ grep 'spark-daemon.sh start' * start-history-server.sh:exec "${SPARK_HOME}/sbin"/spark-daemon.sh start $CLASS 1 "$" start-master.sh:"${SPARK_HOME}/sbin"/spark-daemon.sh start $CLASS 1 \ start-worker.sh: "${SPARK_HOME}/sbin"/spark-daemon.sh start $CLASS $WORKER_NUM \ ``` ### Does this PR introduce _any_ user-facing change? No. There is no user-facing change by default. - For the distributions with `hive-jackson-provided` profile, the `scope` of Apache Hive Jackson dependencies is `provided` and `hive-jackson` directory is not created at all. - For the distributions with default setting, the `scope` of Apache Hive Jackson dependencies is still `compile`. In addition, they are in the Apache Spark's built-in class path like the following. ![Screenshot 2024-02-23 at 16 48 08](https://github.com/apache/spark/assets/9700541/99ed0f02-2792-4666-ae19-ce4f4b7b8ff9) - The following Spark Deamon don't use `CodeHaus Jackson` dependencies. - Spark Master - Spark Worker - Spark History Server ### How was this patch tested? Pass the CIs and manually build a distribution and check the class paths in the `Environment` Tab. ``` $ dev/make-distribution.sh -Phive,hive-thriftserver ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#45237 from dongjoon-hyun/SPARK-47152. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

[SPARK-47152][SQL][BUILD] Provide Apache Hive Jackson dependency via …

c1f076a

…a new optional directory

github-actions bot added BUILD CORE labels Feb 24, 2024

dongjoon-hyun changed the title ~~[SPARK-47152][SQL][BUILD] Provide Apache Hive Jackson dependency via a new optional directory~~ [SPARK-47152][SQL][BUILD] Provide CodeHaus Jackson dependencies via a new optional directory Feb 24, 2024

viirya reviewed Feb 24, 2024

View reviewed changes

core/src/main/scala/org/apache/spark/internal/config/package.scala Show resolved Hide resolved

viirya reviewed Feb 24, 2024

View reviewed changes

viirya approved these changes Feb 24, 2024

View reviewed changes

dongjoon-hyun closed this in dd3f81c Feb 24, 2024

dongjoon-hyun deleted the SPARK-47152 branch February 24, 2024 19:14

dongjoon-hyun mentioned this pull request Jan 12, 2025

[SPARK-43225][BUILD][SQL] Remove jackson-core-asl and jackson-mapper-asl from pre-built distribution #40893

Closed

		case DRIVER_DEFAULT_CLASS_PATH ->
		conf.put(SparkLauncher.DRIVER_DEFAULT_EXTRA_CLASS_PATH, value);

		effectiveConfig.putIfAbsent(SparkLauncher.DRIVER_DEFAULT_EXTRA_CLASS_PATH,
		SparkLauncher.DRIVER_DEFAULT_EXTRA_CLASS_PATH_VALUE);

[SPARK-47152][SQL][BUILD] Provide CodeHaus Jackson dependencies via a new optional directory #45237

[SPARK-47152][SQL][BUILD] Provide CodeHaus Jackson dependencies via a new optional directory #45237

Uh oh!

Conversation

dongjoon-hyun commented Feb 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

dongjoon-hyun commented Feb 24, 2024

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Feb 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Feb 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Feb 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Feb 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pan3793 commented Feb 24, 2024

Uh oh!

dongjoon-hyun commented Feb 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun commented Feb 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun commented Feb 24, 2024

[SPARK-47152][SQL][BUILD] Provide `CodeHaus Jackson` dependencies via a new optional directory #45237

[SPARK-47152][SQL][BUILD] Provide `CodeHaus Jackson` dependencies via a new optional directory #45237

dongjoon-hyun commented Feb 24, 2024 •

edited

Loading

dongjoon-hyun Feb 24, 2024 •

edited

Loading

dongjoon-hyun Feb 24, 2024 •

edited

Loading

dongjoon-hyun Feb 24, 2024 •

edited

Loading

dongjoon-hyun Feb 24, 2024 •

edited

Loading

dongjoon-hyun commented Feb 24, 2024 •

edited

Loading

dongjoon-hyun commented Feb 24, 2024 •

edited

Loading