Skip to content

Conversation

@dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented Feb 21, 2024

What changes were proposed in this pull request?

This PR aims to provide a new profile, hive-jackson-provided, for Apache Spark 4.0.0.

Why are the changes needed?

Since Apache Hadoop 3.3.5, only Apache Hive requires old CodeHaus Jackson dependencies.

Apache Spark 3.5.0 tried to eliminate them completely but it's reverted due to Hive UDF support.

To allow Apache Spark 4.0 users

  • To provide their own CodeHaus Jackson libraries
  • To exclude them completely if they don't use Hive UDF.

Does this PR introduce any user-facing change?

No, this is a new profile.

How was this patch tested?

Pass the CIs and manual build.

Without hive-jackson-provided

$ dev/make-distribution.sh -Phive,hive-thriftserver
$ ls -al dist/jars/*asl*
-rw-r--r--  1 dongjoon  staff  232248 Feb 21 10:53 dist.org/jars/jackson-core-asl-1.9.13.jar
-rw-r--r--  1 dongjoon  staff  780664 Feb 21 10:53 dist.org/jars/jackson-mapper-asl-1.9.13.jar

With hive-jackson-provided

$ dev/make-distribution.sh -Phive,hive-thriftserver,hive-jackson-provided
$ ls -al dist/jars/*asl*
zsh: no matches found: dist/jars/*asl*

$ ls -al dist/jars/*hive*
-rw-r--r--  1 dongjoon  staff    183633 Feb 21 11:00 dist/jars/hive-beeline-2.3.9.jar
-rw-r--r--  1 dongjoon  staff     44704 Feb 21 11:00 dist/jars/hive-cli-2.3.9.jar
-rw-r--r--  1 dongjoon  staff    436169 Feb 21 11:00 dist/jars/hive-common-2.3.9.jar
-rw-r--r--  1 dongjoon  staff  10840949 Feb 21 11:00 dist/jars/hive-exec-2.3.9-core.jar
-rw-r--r--  1 dongjoon  staff    116364 Feb 21 11:00 dist/jars/hive-jdbc-2.3.9.jar
-rw-r--r--  1 dongjoon  staff    326585 Feb 21 11:00 dist/jars/hive-llap-common-2.3.9.jar
-rw-r--r--  1 dongjoon  staff   8195966 Feb 21 11:00 dist/jars/hive-metastore-2.3.9.jar
-rw-r--r--  1 dongjoon  staff    916630 Feb 21 11:00 dist/jars/hive-serde-2.3.9.jar
-rw-r--r--  1 dongjoon  staff   1679366 Feb 21 11:00 dist/jars/hive-service-rpc-3.1.3.jar
-rw-r--r--  1 dongjoon  staff     53902 Feb 21 11:00 dist/jars/hive-shims-0.23-2.3.9.jar
-rw-r--r--  1 dongjoon  staff      8786 Feb 21 11:00 dist/jars/hive-shims-2.3.9.jar
-rw-r--r--  1 dongjoon  staff    120293 Feb 21 11:00 dist/jars/hive-shims-common-2.3.9.jar
-rw-r--r--  1 dongjoon  staff     12923 Feb 21 11:00 dist/jars/hive-shims-scheduler-2.3.9.jar
-rw-r--r--  1 dongjoon  staff    258346 Feb 21 11:00 dist/jars/hive-storage-api-2.8.1.jar
-rw-r--r--  1 dongjoon  staff    581739 Feb 21 11:00 dist/jars/spark-hive-thriftserver_2.13-4.0.0-SNAPSHOT.jar
-rw-r--r--  1 dongjoon  staff    687446 Feb 21 11:00 dist/jars/spark-hive_2.13-4.0.0-SNAPSHOT.jar

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the BUILD label Feb 21, 2024
@dongjoon-hyun
Copy link
Member Author

Could you review this PR, @viirya ?

@xinrong-meng
Copy link
Member

I may not have enought context. The existing hive-provided seems to be changed to expect Jackson dependencies to be present in the runtime. Is that expected?

@dongjoon-hyun
Copy link
Member Author

Thank you for review. This PR is for the users who keep Hive and exclude CodeHaus Jackson only. For example, a user who can use Spark Thrift Server without Hive UDFs, @xinrong-meng .

I may not have enought context. The existing hive-provided seems to be changed to expect Jackson dependencies to be present in the runtime. Is that expected?

<id>hive-provided</id>
<properties>
<hive.deps.scope>provided</hive.deps.scope>
<hive.jackson.scope>provided</hive.jackson.scope>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, both in hive-provided and hive-jackson-provided profiles, the config value is provided. What's the difference between them?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, these dependencies follow hive.deps.scope like the following. So, we need to preserve the existing behavior.

- <scope>${hive.deps.scope}</scope>
+ <scope>${hive.jackson.scope}</scope>

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, hive-jackson-provided has hive in compile scope and only jackson in provided scope.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name hive-jackson-provided makes me think both are provided.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes!

@dongjoon-hyun
Copy link
Member Author

To @xinrong-meng and @viirya , I updated the PR description. You can see that Hive jars are there and only CodeHaus Jacksons are gone.

@dongjoon-hyun
Copy link
Member Author

Thank you, @viirya !

@dongjoon-hyun
Copy link
Member Author

Merged to master for Apache Spark 4.0.0.

@dongjoon-hyun dongjoon-hyun deleted the SPARK-47119 branch February 21, 2024 20:36
@xinrong-meng
Copy link
Member

Makes sense, thank you @dongjoon-hyun !

@dongjoon-hyun
Copy link
Member Author

Thank you, @xinrong-meng .

dongjoon-hyun added a commit that referenced this pull request Feb 24, 2024
… a new optional directory

### What changes were proposed in this pull request?

This PR aims to provide `Apache Hive`'s `CodeHaus Jackson` dependencies via a new optional directory, `hive-jackson`, instead of the standard `jars` directory of Apache Spark binary distribution. Additionally, two internal configurations are added whose default values are `hive-jackson/*`.

  - `spark.driver.defaultExtraClassPath`
  - `spark.executor.defaultExtraClassPath`

For example, Apache Spark distributions have been providing `spark-*-yarn-shuffle.jar` file under `yarn` directory instead of `jars`.

**YARN SHUFFLE EXAMPLE**
```
$ ls -al yarn/*jar
-rw-r--r--  1 dongjoon  staff  77352048 Sep  8 19:08 yarn/spark-3.5.0-yarn-shuffle.jar
```

This PR changes `Apache Hive`'s `CodeHaus Jackson` dependencies in a similar way.

**BEFORE**
```
$ ls -al jars/*asl*
-rw-r--r--  1 dongjoon  staff  232248 Sep  8 19:08 jars/jackson-core-asl-1.9.13.jar
-rw-r--r--  1 dongjoon  staff  780664 Sep  8 19:08 jars/jackson-mapper-asl-1.9.13.jar
```

**AFTER**
```
$ ls -al jars/*asl*
zsh: no matches found: jars/*asl*

$ ls -al hive-jackson
total 1984
drwxr-xr-x   4 dongjoon  staff     128 Feb 23 15:37 .
drwxr-xr-x  16 dongjoon  staff     512 Feb 23 16:34 ..
-rw-r--r--   1 dongjoon  staff  232248 Feb 23 15:37 jackson-core-asl-1.9.13.jar
-rw-r--r--   1 dongjoon  staff  780664 Feb 23 15:37 jackson-mapper-asl-1.9.13.jar
```

### Why are the changes needed?

Since Apache Hadoop 3.3.5, only Apache Hive requires old CodeHaus Jackson dependencies.

Apache Spark 3.5.0 tried to eliminate them completely but it's reverted due to Hive UDF support.

  - #40893
  - #42446

SPARK-47119 added a way to exclude Apache Hive Jackson dependencies at the distribution building stage for Apache Spark 4.0.0.

  - #45201

This PR provides a way to exclude Apache Hive Jackson dependencies at runtime for Apache Spark 4.0.0.

- Spark Shell without Apache Hive Jackson dependencies.
```
$ bin/spark-shell --driver-default-class-path ""
```

- Spark SQL Shell without Apache Hive Jackson dependencies.
```
$ bin/spark-sql --driver-default-class-path ""
```

- Spark Thrift Server without Apache Hive Jackson dependencies.
```
$ sbin/start-thriftserver.sh --driver-default-class-path ""
```

In addition, last but not least, this PR eliminates `CodeHaus Jackson` dependencies from the following Apache Spark deamons (using `spark-daemon.sh start`) because they don't require Hive `CodeHaus Jackson` dependencies

- Spark Master
- Spark Worker
- Spark History Server

```
$ grep 'spark-daemon.sh start' *
start-history-server.sh:exec "${SPARK_HOME}/sbin"/spark-daemon.sh start $CLASS 1 "$"
start-master.sh:"${SPARK_HOME}/sbin"/spark-daemon.sh start $CLASS 1 \
start-worker.sh:  "${SPARK_HOME}/sbin"/spark-daemon.sh start $CLASS $WORKER_NUM \
```

### Does this PR introduce _any_ user-facing change?

No. There is no user-facing change by default.

- For the distributions with `hive-jackson-provided` profile, the `scope` of Apache Hive Jackson dependencies is `provided` and `hive-jackson` directory is not created at all.
- For the distributions with default setting, the `scope` of Apache Hive Jackson dependencies is still `compile`. In addition, they are in the Apache Spark's built-in class path like the following.

![Screenshot 2024-02-23 at 16 48 08](https://github.com/apache/spark/assets/9700541/99ed0f02-2792-4666-ae19-ce4f4b7b8ff9)

- The following Spark Deamon don't use `CodeHaus Jackson` dependencies.
  - Spark Master
  - Spark Worker
  - Spark History Server

### How was this patch tested?

Pass the CIs and manually build a distribution and check the class paths in the `Environment` Tab.

```
$ dev/make-distribution.sh -Phive,hive-thriftserver
```

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #45237 from dongjoon-hyun/SPARK-47152.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
ericm-db pushed a commit to ericm-db/spark that referenced this pull request Mar 5, 2024
### What changes were proposed in this pull request?

This PR aims to provide a new profile, `hive-jackson-provided`, for Apache Spark 4.0.0.

### Why are the changes needed?

Since Apache Hadoop 3.3.5, only Apache Hive requires old CodeHaus Jackson dependencies.

Apache Spark 3.5.0 tried to eliminate them completely but it's reverted due to Hive UDF support.
- apache#40893
- apache#42446

To allow Apache Spark 4.0 users
- To provide their own CodeHaus Jackson libraries
- To exclude them completely if they don't use `Hive UDF`.

### Does this PR introduce _any_ user-facing change?

No, this is a new profile.

### How was this patch tested?

Pass the CIs and manual build.

**Without `hive-jackson-provided`**
```
$ dev/make-distribution.sh -Phive,hive-thriftserver
$ ls -al dist/jars/*asl*
-rw-r--r--  1 dongjoon  staff  232248 Feb 21 10:53 dist.org/jars/jackson-core-asl-1.9.13.jar
-rw-r--r--  1 dongjoon  staff  780664 Feb 21 10:53 dist.org/jars/jackson-mapper-asl-1.9.13.jar
```

**With `hive-jackson-provided`**
```
$ dev/make-distribution.sh -Phive,hive-thriftserver,hive-jackson-provided
$ ls -al dist/jars/*asl*
zsh: no matches found: dist/jars/*asl*

$ ls -al dist/jars/*hive*
-rw-r--r--  1 dongjoon  staff    183633 Feb 21 11:00 dist/jars/hive-beeline-2.3.9.jar
-rw-r--r--  1 dongjoon  staff     44704 Feb 21 11:00 dist/jars/hive-cli-2.3.9.jar
-rw-r--r--  1 dongjoon  staff    436169 Feb 21 11:00 dist/jars/hive-common-2.3.9.jar
-rw-r--r--  1 dongjoon  staff  10840949 Feb 21 11:00 dist/jars/hive-exec-2.3.9-core.jar
-rw-r--r--  1 dongjoon  staff    116364 Feb 21 11:00 dist/jars/hive-jdbc-2.3.9.jar
-rw-r--r--  1 dongjoon  staff    326585 Feb 21 11:00 dist/jars/hive-llap-common-2.3.9.jar
-rw-r--r--  1 dongjoon  staff   8195966 Feb 21 11:00 dist/jars/hive-metastore-2.3.9.jar
-rw-r--r--  1 dongjoon  staff    916630 Feb 21 11:00 dist/jars/hive-serde-2.3.9.jar
-rw-r--r--  1 dongjoon  staff   1679366 Feb 21 11:00 dist/jars/hive-service-rpc-3.1.3.jar
-rw-r--r--  1 dongjoon  staff     53902 Feb 21 11:00 dist/jars/hive-shims-0.23-2.3.9.jar
-rw-r--r--  1 dongjoon  staff      8786 Feb 21 11:00 dist/jars/hive-shims-2.3.9.jar
-rw-r--r--  1 dongjoon  staff    120293 Feb 21 11:00 dist/jars/hive-shims-common-2.3.9.jar
-rw-r--r--  1 dongjoon  staff     12923 Feb 21 11:00 dist/jars/hive-shims-scheduler-2.3.9.jar
-rw-r--r--  1 dongjoon  staff    258346 Feb 21 11:00 dist/jars/hive-storage-api-2.8.1.jar
-rw-r--r--  1 dongjoon  staff    581739 Feb 21 11:00 dist/jars/spark-hive-thriftserver_2.13-4.0.0-SNAPSHOT.jar
-rw-r--r--  1 dongjoon  staff    687446 Feb 21 11:00 dist/jars/spark-hive_2.13-4.0.0-SNAPSHOT.jar
```

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#45201 from dongjoon-hyun/SPARK-47119.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
ericm-db pushed a commit to ericm-db/spark that referenced this pull request Mar 5, 2024
… a new optional directory

### What changes were proposed in this pull request?

This PR aims to provide `Apache Hive`'s `CodeHaus Jackson` dependencies via a new optional directory, `hive-jackson`, instead of the standard `jars` directory of Apache Spark binary distribution. Additionally, two internal configurations are added whose default values are `hive-jackson/*`.

  - `spark.driver.defaultExtraClassPath`
  - `spark.executor.defaultExtraClassPath`

For example, Apache Spark distributions have been providing `spark-*-yarn-shuffle.jar` file under `yarn` directory instead of `jars`.

**YARN SHUFFLE EXAMPLE**
```
$ ls -al yarn/*jar
-rw-r--r--  1 dongjoon  staff  77352048 Sep  8 19:08 yarn/spark-3.5.0-yarn-shuffle.jar
```

This PR changes `Apache Hive`'s `CodeHaus Jackson` dependencies in a similar way.

**BEFORE**
```
$ ls -al jars/*asl*
-rw-r--r--  1 dongjoon  staff  232248 Sep  8 19:08 jars/jackson-core-asl-1.9.13.jar
-rw-r--r--  1 dongjoon  staff  780664 Sep  8 19:08 jars/jackson-mapper-asl-1.9.13.jar
```

**AFTER**
```
$ ls -al jars/*asl*
zsh: no matches found: jars/*asl*

$ ls -al hive-jackson
total 1984
drwxr-xr-x   4 dongjoon  staff     128 Feb 23 15:37 .
drwxr-xr-x  16 dongjoon  staff     512 Feb 23 16:34 ..
-rw-r--r--   1 dongjoon  staff  232248 Feb 23 15:37 jackson-core-asl-1.9.13.jar
-rw-r--r--   1 dongjoon  staff  780664 Feb 23 15:37 jackson-mapper-asl-1.9.13.jar
```

### Why are the changes needed?

Since Apache Hadoop 3.3.5, only Apache Hive requires old CodeHaus Jackson dependencies.

Apache Spark 3.5.0 tried to eliminate them completely but it's reverted due to Hive UDF support.

  - apache#40893
  - apache#42446

SPARK-47119 added a way to exclude Apache Hive Jackson dependencies at the distribution building stage for Apache Spark 4.0.0.

  - apache#45201

This PR provides a way to exclude Apache Hive Jackson dependencies at runtime for Apache Spark 4.0.0.

- Spark Shell without Apache Hive Jackson dependencies.
```
$ bin/spark-shell --driver-default-class-path ""
```

- Spark SQL Shell without Apache Hive Jackson dependencies.
```
$ bin/spark-sql --driver-default-class-path ""
```

- Spark Thrift Server without Apache Hive Jackson dependencies.
```
$ sbin/start-thriftserver.sh --driver-default-class-path ""
```

In addition, last but not least, this PR eliminates `CodeHaus Jackson` dependencies from the following Apache Spark deamons (using `spark-daemon.sh start`) because they don't require Hive `CodeHaus Jackson` dependencies

- Spark Master
- Spark Worker
- Spark History Server

```
$ grep 'spark-daemon.sh start' *
start-history-server.sh:exec "${SPARK_HOME}/sbin"/spark-daemon.sh start $CLASS 1 "$"
start-master.sh:"${SPARK_HOME}/sbin"/spark-daemon.sh start $CLASS 1 \
start-worker.sh:  "${SPARK_HOME}/sbin"/spark-daemon.sh start $CLASS $WORKER_NUM \
```

### Does this PR introduce _any_ user-facing change?

No. There is no user-facing change by default.

- For the distributions with `hive-jackson-provided` profile, the `scope` of Apache Hive Jackson dependencies is `provided` and `hive-jackson` directory is not created at all.
- For the distributions with default setting, the `scope` of Apache Hive Jackson dependencies is still `compile`. In addition, they are in the Apache Spark's built-in class path like the following.

![Screenshot 2024-02-23 at 16 48 08](https://github.com/apache/spark/assets/9700541/99ed0f02-2792-4666-ae19-ce4f4b7b8ff9)

- The following Spark Deamon don't use `CodeHaus Jackson` dependencies.
  - Spark Master
  - Spark Worker
  - Spark History Server

### How was this patch tested?

Pass the CIs and manually build a distribution and check the class paths in the `Environment` Tab.

```
$ dev/make-distribution.sh -Phive,hive-thriftserver
```

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#45237 from dongjoon-hyun/SPARK-47152.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
@pan3793
Copy link
Member

pan3793 commented May 10, 2024

@dongjoon-hyun Jackson 1.x can be removed after SPARK-47018 (bump Hive 2.3.10), what should we do for hive-jackson-provided?

@dongjoon-hyun
Copy link
Member Author

It's supposed to be here as a last resort until we release Apache Spark 4.0.0 successfully without reverting Hive 2.3.10, @pan3793 .

szehon-ho pushed a commit to szehon-ho/spark that referenced this pull request Sep 24, 2024
This PR aims to provide a new profile, `hive-jackson-provided`, for Apache Spark 4.0.0.

Since Apache Hadoop 3.3.5, only Apache Hive requires old CodeHaus Jackson dependencies.

Apache Spark 3.5.0 tried to eliminate them completely but it's reverted due to Hive UDF support.
- apache#40893
- apache#42446

To allow Apache Spark 4.0 users
- To provide their own CodeHaus Jackson libraries
- To exclude them completely if they don't use `Hive UDF`.

No, this is a new profile.

Pass the CIs and manual build.

**Without `hive-jackson-provided`**
```
$ dev/make-distribution.sh -Phive,hive-thriftserver
$ ls -al dist/jars/*asl*
-rw-r--r--  1 dongjoon  staff  232248 Feb 21 10:53 dist.org/jars/jackson-core-asl-1.9.13.jar
-rw-r--r--  1 dongjoon  staff  780664 Feb 21 10:53 dist.org/jars/jackson-mapper-asl-1.9.13.jar
```

**With `hive-jackson-provided`**
```
$ dev/make-distribution.sh -Phive,hive-thriftserver,hive-jackson-provided
$ ls -al dist/jars/*asl*
zsh: no matches found: dist/jars/*asl*

$ ls -al dist/jars/*hive*
-rw-r--r--  1 dongjoon  staff    183633 Feb 21 11:00 dist/jars/hive-beeline-2.3.9.jar
-rw-r--r--  1 dongjoon  staff     44704 Feb 21 11:00 dist/jars/hive-cli-2.3.9.jar
-rw-r--r--  1 dongjoon  staff    436169 Feb 21 11:00 dist/jars/hive-common-2.3.9.jar
-rw-r--r--  1 dongjoon  staff  10840949 Feb 21 11:00 dist/jars/hive-exec-2.3.9-core.jar
-rw-r--r--  1 dongjoon  staff    116364 Feb 21 11:00 dist/jars/hive-jdbc-2.3.9.jar
-rw-r--r--  1 dongjoon  staff    326585 Feb 21 11:00 dist/jars/hive-llap-common-2.3.9.jar
-rw-r--r--  1 dongjoon  staff   8195966 Feb 21 11:00 dist/jars/hive-metastore-2.3.9.jar
-rw-r--r--  1 dongjoon  staff    916630 Feb 21 11:00 dist/jars/hive-serde-2.3.9.jar
-rw-r--r--  1 dongjoon  staff   1679366 Feb 21 11:00 dist/jars/hive-service-rpc-3.1.3.jar
-rw-r--r--  1 dongjoon  staff     53902 Feb 21 11:00 dist/jars/hive-shims-0.23-2.3.9.jar
-rw-r--r--  1 dongjoon  staff      8786 Feb 21 11:00 dist/jars/hive-shims-2.3.9.jar
-rw-r--r--  1 dongjoon  staff    120293 Feb 21 11:00 dist/jars/hive-shims-common-2.3.9.jar
-rw-r--r--  1 dongjoon  staff     12923 Feb 21 11:00 dist/jars/hive-shims-scheduler-2.3.9.jar
-rw-r--r--  1 dongjoon  staff    258346 Feb 21 11:00 dist/jars/hive-storage-api-2.8.1.jar
-rw-r--r--  1 dongjoon  staff    581739 Feb 21 11:00 dist/jars/spark-hive-thriftserver_2.13-4.0.0-SNAPSHOT.jar
-rw-r--r--  1 dongjoon  staff    687446 Feb 21 11:00 dist/jars/spark-hive_2.13-4.0.0-SNAPSHOT.jar
```

No.

Closes apache#45201 from dongjoon-hyun/SPARK-47119.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants