Skip to content

Conversation

@dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented Feb 24, 2024

What changes were proposed in this pull request?

This PR aims to provide Apache Hive's CodeHaus Jackson dependencies via a new optional directory, hive-jackson, instead of the standard jars directory of Apache Spark binary distribution. Additionally, two internal configurations are added whose default values are hive-jackson/*.

  • spark.driver.defaultExtraClassPath
  • spark.executor.defaultExtraClassPath

For example, Apache Spark distributions have been providing spark-*-yarn-shuffle.jar file under yarn directory instead of jars.

YARN SHUFFLE EXAMPLE

$ ls -al yarn/*jar
-rw-r--r--  1 dongjoon  staff  77352048 Sep  8 19:08 yarn/spark-3.5.0-yarn-shuffle.jar

This PR changes Apache Hive's CodeHaus Jackson dependencies in a similar way.

BEFORE

$ ls -al jars/*asl*
-rw-r--r--  1 dongjoon  staff  232248 Sep  8 19:08 jars/jackson-core-asl-1.9.13.jar
-rw-r--r--  1 dongjoon  staff  780664 Sep  8 19:08 jars/jackson-mapper-asl-1.9.13.jar

AFTER

$ ls -al jars/*asl*
zsh: no matches found: jars/*asl*

$ ls -al hive-jackson
total 1984
drwxr-xr-x   4 dongjoon  staff     128 Feb 23 15:37 .
drwxr-xr-x  16 dongjoon  staff     512 Feb 23 16:34 ..
-rw-r--r--   1 dongjoon  staff  232248 Feb 23 15:37 jackson-core-asl-1.9.13.jar
-rw-r--r--   1 dongjoon  staff  780664 Feb 23 15:37 jackson-mapper-asl-1.9.13.jar

Why are the changes needed?

Since Apache Hadoop 3.3.5, only Apache Hive requires old CodeHaus Jackson dependencies.

Apache Spark 3.5.0 tried to eliminate them completely but it's reverted due to Hive UDF support.

SPARK-47119 added a way to exclude Apache Hive Jackson dependencies at the distribution building stage for Apache Spark 4.0.0.

This PR provides a way to exclude Apache Hive Jackson dependencies at runtime for Apache Spark 4.0.0.

  • Spark Shell without Apache Hive Jackson dependencies.
$ bin/spark-shell --driver-default-class-path ""
  • Spark SQL Shell without Apache Hive Jackson dependencies.
$ bin/spark-sql --driver-default-class-path ""
  • Spark Thrift Server without Apache Hive Jackson dependencies.
$ sbin/start-thriftserver.sh --driver-default-class-path ""

In addition, last but not least, this PR eliminates CodeHaus Jackson dependencies from the following Apache Spark deamons (using spark-daemon.sh start) because they don't require Hive CodeHaus Jackson dependencies

  • Spark Master
  • Spark Worker
  • Spark History Server
$ grep 'spark-daemon.sh start' *
start-history-server.sh:exec "${SPARK_HOME}/sbin"/spark-daemon.sh start $CLASS 1 "$@"
start-master.sh:"${SPARK_HOME}/sbin"/spark-daemon.sh start $CLASS 1 \
start-worker.sh:  "${SPARK_HOME}/sbin"/spark-daemon.sh start $CLASS $WORKER_NUM \

Does this PR introduce any user-facing change?

No. There is no user-facing change by default.

  • For the distributions with hive-jackson-provided profile, the scope of Apache Hive Jackson dependencies is provided and hive-jackson directory is not created at all.
  • For the distributions with default setting, the scope of Apache Hive Jackson dependencies is still compile. In addition, they are in the Apache Spark's built-in class path like the following.

Screenshot 2024-02-23 at 16 48 08

  • The following Spark Deamon don't use CodeHaus Jackson dependencies.
    • Spark Master
    • Spark Worker
    • Spark History Server

How was this patch tested?

Pass the CIs and manually build a distribution and check the class paths in the Environment Tab.

$ dev/make-distribution.sh -Phive,hive-thriftserver

Was this patch authored or co-authored using generative AI tooling?

No.

@dongjoon-hyun
Copy link
Member Author

cc @viirya

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-47152][SQL][BUILD] Provide Apache Hive Jackson dependency via a new optional directory [SPARK-47152][SQL][BUILD] Provide CodeHaus Jackson dependencies via a new optional directory Feb 24, 2024
Comment on lines +57 to +60
/** Configuration key for the driver default extra class path. */
public static final String DRIVER_DEFAULT_EXTRA_CLASS_PATH =
"spark.driver.defaultExtraClassPath";
public static final String DRIVER_DEFAULT_EXTRA_CLASS_PATH_VALUE = "hive-jackson/*";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So even users who use no hive distributions. they will also have this default class patch value in the extra class path, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the configuration is simply pointing non-existing directly. As you can see in make-distribution.sh and PR description. hive-jackson directory is created only when there exist jackson-*-asl-*.jars.

Comment on lines +192 to +196
# Only create the hive-jackson directory if they exist.
for f in "$DISTDIR"/jars/jackson-*-asl-*.jar; do
mkdir -p "$DISTDIR"/hive-jackson
mv $f "$DISTDIR"/hive-jackson/
done
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, what's benefit to have separate class path for hive-jackson jars?

Copy link
Member Author

@dongjoon-hyun dongjoon-hyun Feb 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are 5 main benefits like yarn directory, @viirya .

  1. The following Apache Spark deamons (with uses bin/spark-daemon.sh start) will ignore hive-jackson directory.
    • Spark Master
    • Spark Worker
    • Spark History Server
$ grep 'spark-daemon.sh start' *
start-history-server.sh:exec "${SPARK_HOME}/sbin"/spark-daemon.sh start $CLASS 1 "$@"
start-master.sh:"${SPARK_HOME}/sbin"/spark-daemon.sh start $CLASS 1 \
start-worker.sh:  "${SPARK_HOME}/sbin"/spark-daemon.sh start $CLASS $WORKER_NUM \
  1. Recoverability: The AS-IS Spark 3 users can achieve the same goal if they delete those two files from Spark's jar directory manually. However, it's difficult to recover the deleted files when a production job fails due to Hive UDF. This PR provides more robust and safe way with a configuration.

  2. Communication: We (and the security team) can easily communicate that hive-jackson is not used like yarn directory because it's physically split from the distribution. Also, they can delete the directory easily (if they need) without knowing the details of dependency lists inside that directory.

  3. Robustness: If Apache Spark have everything in jars, it's difficult to prevent them from loading. Of course, we may choose a tricky way to filter out from class file lists via naming pattern. It's still less robust in a long term perspective.

  4. Compatibility with hive-jackson-provided: With the existing hive-jackson-provided, this PR provides a cleaner injection point for the provided dependencies. For example, the custom build Jackson dependencies can be placed in hive-jackson (after they create this) instead of jars. We are very reluctant if someone put their custom jar files into Apache Spark's jars directory directly. hive-jackson directory could be more recommendable way than copying into Spark's jars directory.

Comment on lines +507 to +508
case DRIVER_DEFAULT_CLASS_PATH ->
conf.put(SparkLauncher.DRIVER_DEFAULT_EXTRA_CLASS_PATH, value);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, why spark-submit can only specify driver default extra class path, cannot it specify executor default extra class path?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No~ This is --driver-class-path which used as a special case because the driver JVM is started already.

We already have more general ones are spark.executor.extraClassPath and spark.driver.extraClassPath for both driver and executor. This PR extends the above with the following.

spark.driver.defaultExtraClassPath
spark.executor.defaultExtraClassPath

Copy link
Member Author

@dongjoon-hyun dongjoon-hyun Feb 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, is your concern is about that the following is insufficient?

  private[spark] val EXECUTOR_CLASS_PATH =
    ConfigBuilder(SparkLauncher.EXECUTOR_EXTRA_CLASSPATH)
      .withPrepended(EXECUTOR_DEFAULT_EXTRA_CLASS_PATH.key, File.pathSeparator)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me double-check it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To @viirya ,

  • In general, bin/spark-submit ... -c spark.driver.defaultExtraClassPath="" -c spark.executor.defaultExtraClassPath="" is a way to launch without hive-jackson dependency because hive-jackson/* are provided by those configuration. This works for the drivers of cluster mode submission and executors of both client and cluster submission modes always.
  • Apache Spark already provides --driver-class-path is used only for the cases where we cannot use spark.driver.extraClassPath. In this case, spark.driver.defaultExtraClassPath is also not applicable. So, this PR adds --driver-default-class-path "" in the same way.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is identical with the way of the existing Apache Spark's spark.*.extraClassPath way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, -c spark.driver.defaultExtraClassPath="" should be same as --driver-default-class-path "", so I wonder why we need it especially? And also what's the case we cannot specify it through -c spark.driver.defaultExtraClassPath?

It's probably okay as we already have --driver-class-path so this addition follows it. Just a question if you know the exact reason.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--driver-class-path is required because Spark's Driver JVMs are started before loading Spark's property files or config file.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot create SparkConf class if we cannot load Spark's jar files. This is the chicken and egg situation. To solve this issue, we blindly load jars/*. So, if we have other jars, we need to use --driver-class-path

Comment on lines +274 to +275
effectiveConfig.putIfAbsent(SparkLauncher.DRIVER_DEFAULT_EXTRA_CLASS_PATH,
SparkLauncher.DRIVER_DEFAULT_EXTRA_CLASS_PATH_VALUE);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non SparkLauncher triggered (e.g., the Spark deamons you mentioned) classes are not affected, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And, same question, why only driver default extra class is specified here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the following spark-daemon.sh start are not affected, @viirya .

$ grep 'spark-daemon.sh start' *
start-history-server.sh:exec "${SPARK_HOME}/sbin"/spark-daemon.sh start $CLASS 1 "$@"
start-master.sh:"${SPARK_HOME}/sbin"/spark-daemon.sh start $CLASS 1 \
start-worker.sh:  "${SPARK_HOME}/sbin"/spark-daemon.sh start $CLASS $WORKER_NUM \

This is a command on the driver side. The executors are supposed to use spark.executor.extraClassPath and spark.executor.defaultExtraClassPath.

And, same question, why only driver default extra class is specified here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, why we need to specially deal with SparkLauncher.DRIVER_DEFAULT_EXTRA_CLASS_PATH here? I think Spark configs should be dealt together above:

p.stringPropertyNames().forEach(key ->
         effectiveConfig.computeIfAbsent(key, p::getProperty));

Copy link
Member Author

@dongjoon-hyun dongjoon-hyun Feb 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is required because, when it's loaded from the file, p.stringPropertyNames() doesn't have SparkLauncher.DRIVER_DEFAULT_EXTRA_CLASS_PATH. So, .forEach is not invoked at all.

Properties p = loadPropertiesFile();
      p.stringPropertyNames().

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, here.

When getEffectiveConfig is used like this in SparkSubmitCommandBuilder.java, config is a Java Map.
So, String defaultExtraClassPath = config.get(SparkLauncher.DRIVER_DEFAULT_EXTRA_CLASS_PATH); become null if we don't have this.

    Map<String, String> config = getEffectiveConfig();
    boolean isClientMode = isClientMode(config);
    String extraClassPath = isClientMode ? config.get(SparkLauncher.DRIVER_EXTRA_CLASSPATH) : null;
    String defaultExtraClassPath = config.get(SparkLauncher.DRIVER_DEFAULT_EXTRA_CLASS_PATH);
    if (extraClassPath == null || extraClassPath.trim().isEmpty()) {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Although this is only used by SparkSubmitCommandBuilder but not SparkClassCommandBuilder so I wonder if we should directly put it in SparkSubmitCommandBuilder.

Copy link
Member Author

@dongjoon-hyun dongjoon-hyun Feb 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getEffectiveConfig is shared by other classes not only by SparkSubmitCommandBuilder. That's the decision why I fixed it in getEffectiveConfig .

$ git grep getEffectiveConfig
launcher/src/main/java/org/apache/spark/launcher/AbstractCommandBuilder.java:  Map<String, String> getEffectiveConfig() throws IOException {
launcher/src/main/java/org/apache/spark/launcher/InProcessLauncher.java:    if (builder.isClientMode(builder.getEffectiveConfig())) {
launcher/src/main/java/org/apache/spark/launcher/SparkLauncher.java:    return builder.getEffectiveConfig().get(CHILD_PROCESS_LOGGER_NAME);
launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java:    Map<String, String> config = getEffectiveConfig();
launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java:      getEffectiveConfig().get(SparkLauncher.DRIVER_EXTRA_LIBRARY_PATH));

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, but I saw you only changed SparkSubmitCommandBuilder to explicitly pick it up? If SparkLauncher.DRIVER_EXTRA_CLASSPATH is not set at all, these other classes won't use SparkLauncher.DRIVER_DEFAULT_EXTRA_CLASS_PATH too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me double-check with them InProcessLauncher and SparkLauncher once more. Last time, I checked it was okay, but your concern rings a bell to me.

@pan3793
Copy link
Member

pan3793 commented Feb 24, 2024

FYI, I noticed that apache/hive#4564 already cut the Jackson 1.x deps out, pending 2.3.10 release ...
cc @sunchao @wangyum

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Feb 24, 2024

Actually, we are the one who asked that, @pan3793 . We are aware of that. And @sunchao and I am in the same team in Apple. 😄

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Feb 24, 2024

To @pan3793 , for the record, we've been waiting for that although there is no ETA for now. In addition, every dependency update has a risk of reverting as you see in the history. No matter what happens with 2.3.10 in the Hive community and in Spark community, we will delete this dependency in Apache Spark 4.0.0.

[SPARK-44197][BUILD] Upgrade Hadoop to 3.3.6
[SPARK-44678][BUILD][3.5] Downgrade Hadoop to 3.3.4

@dongjoon-hyun
Copy link
Member Author

Thank you so much, @viirya !
Merged to master for Apache Spark 4.0.0.

@dongjoon-hyun dongjoon-hyun deleted the SPARK-47152 branch February 24, 2024 19:14
@bjornjorgensen
Copy link
Contributor

Should we change mv $f "$DISTDIR"/hive-jackson/ to mv "$f" "$DISTDIR"/hive-jackson/ ?

@dongjoon-hyun
Copy link
Member Author

Should we change mv $f "$DISTDIR"/hive-jackson/ to mv "$f" "$DISTDIR"/hive-jackson/ ?

Feel free to make a followup, @bjornjorgensen .

ericm-db pushed a commit to ericm-db/spark that referenced this pull request Mar 5, 2024
… a new optional directory

### What changes were proposed in this pull request?

This PR aims to provide `Apache Hive`'s `CodeHaus Jackson` dependencies via a new optional directory, `hive-jackson`, instead of the standard `jars` directory of Apache Spark binary distribution. Additionally, two internal configurations are added whose default values are `hive-jackson/*`.

  - `spark.driver.defaultExtraClassPath`
  - `spark.executor.defaultExtraClassPath`

For example, Apache Spark distributions have been providing `spark-*-yarn-shuffle.jar` file under `yarn` directory instead of `jars`.

**YARN SHUFFLE EXAMPLE**
```
$ ls -al yarn/*jar
-rw-r--r--  1 dongjoon  staff  77352048 Sep  8 19:08 yarn/spark-3.5.0-yarn-shuffle.jar
```

This PR changes `Apache Hive`'s `CodeHaus Jackson` dependencies in a similar way.

**BEFORE**
```
$ ls -al jars/*asl*
-rw-r--r--  1 dongjoon  staff  232248 Sep  8 19:08 jars/jackson-core-asl-1.9.13.jar
-rw-r--r--  1 dongjoon  staff  780664 Sep  8 19:08 jars/jackson-mapper-asl-1.9.13.jar
```

**AFTER**
```
$ ls -al jars/*asl*
zsh: no matches found: jars/*asl*

$ ls -al hive-jackson
total 1984
drwxr-xr-x   4 dongjoon  staff     128 Feb 23 15:37 .
drwxr-xr-x  16 dongjoon  staff     512 Feb 23 16:34 ..
-rw-r--r--   1 dongjoon  staff  232248 Feb 23 15:37 jackson-core-asl-1.9.13.jar
-rw-r--r--   1 dongjoon  staff  780664 Feb 23 15:37 jackson-mapper-asl-1.9.13.jar
```

### Why are the changes needed?

Since Apache Hadoop 3.3.5, only Apache Hive requires old CodeHaus Jackson dependencies.

Apache Spark 3.5.0 tried to eliminate them completely but it's reverted due to Hive UDF support.

  - apache#40893
  - apache#42446

SPARK-47119 added a way to exclude Apache Hive Jackson dependencies at the distribution building stage for Apache Spark 4.0.0.

  - apache#45201

This PR provides a way to exclude Apache Hive Jackson dependencies at runtime for Apache Spark 4.0.0.

- Spark Shell without Apache Hive Jackson dependencies.
```
$ bin/spark-shell --driver-default-class-path ""
```

- Spark SQL Shell without Apache Hive Jackson dependencies.
```
$ bin/spark-sql --driver-default-class-path ""
```

- Spark Thrift Server without Apache Hive Jackson dependencies.
```
$ sbin/start-thriftserver.sh --driver-default-class-path ""
```

In addition, last but not least, this PR eliminates `CodeHaus Jackson` dependencies from the following Apache Spark deamons (using `spark-daemon.sh start`) because they don't require Hive `CodeHaus Jackson` dependencies

- Spark Master
- Spark Worker
- Spark History Server

```
$ grep 'spark-daemon.sh start' *
start-history-server.sh:exec "${SPARK_HOME}/sbin"/spark-daemon.sh start $CLASS 1 "$"
start-master.sh:"${SPARK_HOME}/sbin"/spark-daemon.sh start $CLASS 1 \
start-worker.sh:  "${SPARK_HOME}/sbin"/spark-daemon.sh start $CLASS $WORKER_NUM \
```

### Does this PR introduce _any_ user-facing change?

No. There is no user-facing change by default.

- For the distributions with `hive-jackson-provided` profile, the `scope` of Apache Hive Jackson dependencies is `provided` and `hive-jackson` directory is not created at all.
- For the distributions with default setting, the `scope` of Apache Hive Jackson dependencies is still `compile`. In addition, they are in the Apache Spark's built-in class path like the following.

![Screenshot 2024-02-23 at 16 48 08](https://github.com/apache/spark/assets/9700541/99ed0f02-2792-4666-ae19-ce4f4b7b8ff9)

- The following Spark Deamon don't use `CodeHaus Jackson` dependencies.
  - Spark Master
  - Spark Worker
  - Spark History Server

### How was this patch tested?

Pass the CIs and manually build a distribution and check the class paths in the `Environment` Tab.

```
$ dev/make-distribution.sh -Phive,hive-thriftserver
```

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#45237 from dongjoon-hyun/SPARK-47152.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants