-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-47152][SQL][BUILD] Provide CodeHaus Jackson dependencies via a new optional directory
#45237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…a new optional directory
|
cc @viirya |
CodeHaus Jackson dependencies via a new optional directory
| /** Configuration key for the driver default extra class path. */ | ||
| public static final String DRIVER_DEFAULT_EXTRA_CLASS_PATH = | ||
| "spark.driver.defaultExtraClassPath"; | ||
| public static final String DRIVER_DEFAULT_EXTRA_CLASS_PATH_VALUE = "hive-jackson/*"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So even users who use no hive distributions. they will also have this default class patch value in the extra class path, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the configuration is simply pointing non-existing directly. As you can see in make-distribution.sh and PR description. hive-jackson directory is created only when there exist jackson-*-asl-*.jars.
| # Only create the hive-jackson directory if they exist. | ||
| for f in "$DISTDIR"/jars/jackson-*-asl-*.jar; do | ||
| mkdir -p "$DISTDIR"/hive-jackson | ||
| mv $f "$DISTDIR"/hive-jackson/ | ||
| done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, what's benefit to have separate class path for hive-jackson jars?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are 5 main benefits like yarn directory, @viirya .
- The following Apache Spark deamons (with uses
bin/spark-daemon.sh start) will ignorehive-jacksondirectory.- Spark Master
- Spark Worker
- Spark History Server
$ grep 'spark-daemon.sh start' *
start-history-server.sh:exec "${SPARK_HOME}/sbin"/spark-daemon.sh start $CLASS 1 "$@"
start-master.sh:"${SPARK_HOME}/sbin"/spark-daemon.sh start $CLASS 1 \
start-worker.sh: "${SPARK_HOME}/sbin"/spark-daemon.sh start $CLASS $WORKER_NUM \
-
Recoverability: The AS-IS Spark 3 users can achieve the same goal if they delete those two files from Spark's
jardirectory manually. However, it's difficult to recover the deleted files when a production job fails due to Hive UDF. This PR provides more robust and safe way with a configuration. -
Communication: We (and the security team) can easily communicate that
hive-jacksonis not used likeyarndirectory because it's physically split from the distribution. Also, they can delete the directory easily (if they need) without knowing the details of dependency lists inside that directory. -
Robustness: If Apache Spark have everything in
jars, it's difficult to prevent them from loading. Of course, we may choose a tricky way to filter out from class file lists via naming pattern. It's still less robust in a long term perspective. -
Compatibility with
hive-jackson-provided: With the existinghive-jackson-provided, this PR provides a cleaner injection point for the provided dependencies. For example, the custom build Jackson dependencies can be placed inhive-jackson(after they create this) instead ofjars. We are very reluctant if someone put their custom jar files into Apache Spark'sjarsdirectory directly.hive-jacksondirectory could be more recommendable way than copying into Spark'sjarsdirectory.
| case DRIVER_DEFAULT_CLASS_PATH -> | ||
| conf.put(SparkLauncher.DRIVER_DEFAULT_EXTRA_CLASS_PATH, value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, why spark-submit can only specify driver default extra class path, cannot it specify executor default extra class path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No~ This is --driver-class-path which used as a special case because the driver JVM is started already.
We already have more general ones are spark.executor.extraClassPath and spark.driver.extraClassPath for both driver and executor. This PR extends the above with the following.
spark.driver.defaultExtraClassPath
spark.executor.defaultExtraClassPath
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see here for the detail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, is your concern is about that the following is insufficient?
private[spark] val EXECUTOR_CLASS_PATH =
ConfigBuilder(SparkLauncher.EXECUTOR_EXTRA_CLASSPATH)
.withPrepended(EXECUTOR_DEFAULT_EXTRA_CLASS_PATH.key, File.pathSeparator)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me double-check it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To @viirya ,
- In general,
bin/spark-submit ... -c spark.driver.defaultExtraClassPath="" -c spark.executor.defaultExtraClassPath=""is a way to launch without hive-jackson dependency becausehive-jackson/*are provided by those configuration. This works for the drivers ofclustermode submission and executors of bothclientandclustersubmission modes always. - Apache Spark already provides
--driver-class-pathis used only for the cases where we cannot usespark.driver.extraClassPath. In this case,spark.driver.defaultExtraClassPathis also not applicable. So, this PR adds--driver-default-class-path ""in the same way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is identical with the way of the existing Apache Spark's spark.*.extraClassPath way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, -c spark.driver.defaultExtraClassPath="" should be same as --driver-default-class-path "", so I wonder why we need it especially? And also what's the case we cannot specify it through -c spark.driver.defaultExtraClassPath?
It's probably okay as we already have --driver-class-path so this addition follows it. Just a question if you know the exact reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
--driver-class-path is required because Spark's Driver JVMs are started before loading Spark's property files or config file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We cannot create SparkConf class if we cannot load Spark's jar files. This is the chicken and egg situation. To solve this issue, we blindly load jars/*. So, if we have other jars, we need to use --driver-class-path
| effectiveConfig.putIfAbsent(SparkLauncher.DRIVER_DEFAULT_EXTRA_CLASS_PATH, | ||
| SparkLauncher.DRIVER_DEFAULT_EXTRA_CLASS_PATH_VALUE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Non SparkLauncher triggered (e.g., the Spark deamons you mentioned) classes are not affected, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And, same question, why only driver default extra class is specified here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the following spark-daemon.sh start are not affected, @viirya .
$ grep 'spark-daemon.sh start' *
start-history-server.sh:exec "${SPARK_HOME}/sbin"/spark-daemon.sh start $CLASS 1 "$@"
start-master.sh:"${SPARK_HOME}/sbin"/spark-daemon.sh start $CLASS 1 \
start-worker.sh: "${SPARK_HOME}/sbin"/spark-daemon.sh start $CLASS $WORKER_NUM \
This is a command on the driver side. The executors are supposed to use spark.executor.extraClassPath and spark.executor.defaultExtraClassPath.
And, same question, why only driver default extra class is specified here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, why we need to specially deal with SparkLauncher.DRIVER_DEFAULT_EXTRA_CLASS_PATH here? I think Spark configs should be dealt together above:
p.stringPropertyNames().forEach(key ->
effectiveConfig.computeIfAbsent(key, p::getProperty));
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is required because, when it's loaded from the file, p.stringPropertyNames() doesn't have SparkLauncher.DRIVER_DEFAULT_EXTRA_CLASS_PATH. So, .forEach is not invoked at all.
Properties p = loadPropertiesFile();
p.stringPropertyNames().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, here.
When getEffectiveConfig is used like this in SparkSubmitCommandBuilder.java, config is a Java Map.
So, String defaultExtraClassPath = config.get(SparkLauncher.DRIVER_DEFAULT_EXTRA_CLASS_PATH); become null if we don't have this.
Map<String, String> config = getEffectiveConfig();
boolean isClientMode = isClientMode(config);
String extraClassPath = isClientMode ? config.get(SparkLauncher.DRIVER_EXTRA_CLASSPATH) : null;
String defaultExtraClassPath = config.get(SparkLauncher.DRIVER_DEFAULT_EXTRA_CLASS_PATH);
if (extraClassPath == null || extraClassPath.trim().isEmpty()) {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Although this is only used by SparkSubmitCommandBuilder but not SparkClassCommandBuilder so I wonder if we should directly put it in SparkSubmitCommandBuilder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getEffectiveConfig is shared by other classes not only by SparkSubmitCommandBuilder. That's the decision why I fixed it in getEffectiveConfig .
$ git grep getEffectiveConfig
launcher/src/main/java/org/apache/spark/launcher/AbstractCommandBuilder.java: Map<String, String> getEffectiveConfig() throws IOException {
launcher/src/main/java/org/apache/spark/launcher/InProcessLauncher.java: if (builder.isClientMode(builder.getEffectiveConfig())) {
launcher/src/main/java/org/apache/spark/launcher/SparkLauncher.java: return builder.getEffectiveConfig().get(CHILD_PROCESS_LOGGER_NAME);
launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java: Map<String, String> config = getEffectiveConfig();
launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java: getEffectiveConfig().get(SparkLauncher.DRIVER_EXTRA_LIBRARY_PATH));
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, but I saw you only changed SparkSubmitCommandBuilder to explicitly pick it up? If SparkLauncher.DRIVER_EXTRA_CLASSPATH is not set at all, these other classes won't use SparkLauncher.DRIVER_DEFAULT_EXTRA_CLASS_PATH too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me double-check with them InProcessLauncher and SparkLauncher once more. Last time, I checked it was okay, but your concern rings a bell to me.
|
FYI, I noticed that apache/hive#4564 already cut the Jackson 1.x deps out, pending 2.3.10 release ... |
|
To @pan3793 , for the record, we've been waiting for that although there is no ETA for now. In addition, every dependency update has a risk of reverting as you see in the history. No matter what happens with |
|
Thank you so much, @viirya ! |
|
Should we change |
Feel free to make a followup, @bjornjorgensen . |
… a new optional directory ### What changes were proposed in this pull request? This PR aims to provide `Apache Hive`'s `CodeHaus Jackson` dependencies via a new optional directory, `hive-jackson`, instead of the standard `jars` directory of Apache Spark binary distribution. Additionally, two internal configurations are added whose default values are `hive-jackson/*`. - `spark.driver.defaultExtraClassPath` - `spark.executor.defaultExtraClassPath` For example, Apache Spark distributions have been providing `spark-*-yarn-shuffle.jar` file under `yarn` directory instead of `jars`. **YARN SHUFFLE EXAMPLE** ``` $ ls -al yarn/*jar -rw-r--r-- 1 dongjoon staff 77352048 Sep 8 19:08 yarn/spark-3.5.0-yarn-shuffle.jar ``` This PR changes `Apache Hive`'s `CodeHaus Jackson` dependencies in a similar way. **BEFORE** ``` $ ls -al jars/*asl* -rw-r--r-- 1 dongjoon staff 232248 Sep 8 19:08 jars/jackson-core-asl-1.9.13.jar -rw-r--r-- 1 dongjoon staff 780664 Sep 8 19:08 jars/jackson-mapper-asl-1.9.13.jar ``` **AFTER** ``` $ ls -al jars/*asl* zsh: no matches found: jars/*asl* $ ls -al hive-jackson total 1984 drwxr-xr-x 4 dongjoon staff 128 Feb 23 15:37 . drwxr-xr-x 16 dongjoon staff 512 Feb 23 16:34 .. -rw-r--r-- 1 dongjoon staff 232248 Feb 23 15:37 jackson-core-asl-1.9.13.jar -rw-r--r-- 1 dongjoon staff 780664 Feb 23 15:37 jackson-mapper-asl-1.9.13.jar ``` ### Why are the changes needed? Since Apache Hadoop 3.3.5, only Apache Hive requires old CodeHaus Jackson dependencies. Apache Spark 3.5.0 tried to eliminate them completely but it's reverted due to Hive UDF support. - apache#40893 - apache#42446 SPARK-47119 added a way to exclude Apache Hive Jackson dependencies at the distribution building stage for Apache Spark 4.0.0. - apache#45201 This PR provides a way to exclude Apache Hive Jackson dependencies at runtime for Apache Spark 4.0.0. - Spark Shell without Apache Hive Jackson dependencies. ``` $ bin/spark-shell --driver-default-class-path "" ``` - Spark SQL Shell without Apache Hive Jackson dependencies. ``` $ bin/spark-sql --driver-default-class-path "" ``` - Spark Thrift Server without Apache Hive Jackson dependencies. ``` $ sbin/start-thriftserver.sh --driver-default-class-path "" ``` In addition, last but not least, this PR eliminates `CodeHaus Jackson` dependencies from the following Apache Spark deamons (using `spark-daemon.sh start`) because they don't require Hive `CodeHaus Jackson` dependencies - Spark Master - Spark Worker - Spark History Server ``` $ grep 'spark-daemon.sh start' * start-history-server.sh:exec "${SPARK_HOME}/sbin"/spark-daemon.sh start $CLASS 1 "$" start-master.sh:"${SPARK_HOME}/sbin"/spark-daemon.sh start $CLASS 1 \ start-worker.sh: "${SPARK_HOME}/sbin"/spark-daemon.sh start $CLASS $WORKER_NUM \ ``` ### Does this PR introduce _any_ user-facing change? No. There is no user-facing change by default. - For the distributions with `hive-jackson-provided` profile, the `scope` of Apache Hive Jackson dependencies is `provided` and `hive-jackson` directory is not created at all. - For the distributions with default setting, the `scope` of Apache Hive Jackson dependencies is still `compile`. In addition, they are in the Apache Spark's built-in class path like the following.  - The following Spark Deamon don't use `CodeHaus Jackson` dependencies. - Spark Master - Spark Worker - Spark History Server ### How was this patch tested? Pass the CIs and manually build a distribution and check the class paths in the `Environment` Tab. ``` $ dev/make-distribution.sh -Phive,hive-thriftserver ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#45237 from dongjoon-hyun/SPARK-47152. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

What changes were proposed in this pull request?
This PR aims to provide
Apache Hive'sCodeHaus Jacksondependencies via a new optional directory,hive-jackson, instead of the standardjarsdirectory of Apache Spark binary distribution. Additionally, two internal configurations are added whose default values arehive-jackson/*.spark.driver.defaultExtraClassPathspark.executor.defaultExtraClassPathFor example, Apache Spark distributions have been providing
spark-*-yarn-shuffle.jarfile underyarndirectory instead ofjars.YARN SHUFFLE EXAMPLE
This PR changes
Apache Hive'sCodeHaus Jacksondependencies in a similar way.BEFORE
AFTER
Why are the changes needed?
Since Apache Hadoop 3.3.5, only Apache Hive requires old CodeHaus Jackson dependencies.
Apache Spark 3.5.0 tried to eliminate them completely but it's reverted due to Hive UDF support.
SPARK-47119 added a way to exclude Apache Hive Jackson dependencies at the distribution building stage for Apache Spark 4.0.0.
hive-jackson-providedprofile #45201This PR provides a way to exclude Apache Hive Jackson dependencies at runtime for Apache Spark 4.0.0.
In addition, last but not least, this PR eliminates
CodeHaus Jacksondependencies from the following Apache Spark deamons (usingspark-daemon.sh start) because they don't require HiveCodeHaus JacksondependenciesDoes this PR introduce any user-facing change?
No. There is no user-facing change by default.
hive-jackson-providedprofile, thescopeof Apache Hive Jackson dependencies isprovidedandhive-jacksondirectory is not created at all.scopeof Apache Hive Jackson dependencies is stillcompile. In addition, they are in the Apache Spark's built-in class path like the following.CodeHaus Jacksondependencies.How was this patch tested?
Pass the CIs and manually build a distribution and check the class paths in the
EnvironmentTab.Was this patch authored or co-authored using generative AI tooling?
No.