Skip to content

Conversation

@pan3793
Copy link
Member

@pan3793 pan3793 commented May 10, 2024

What changes were proposed in this pull request?

CodeHaus Jackson dependencies were pulled from Hive, while in apache/hive#4564 (Hive 2.3.10), it migrated to Jackson 2.x, so we can remove them from Spark now.

Why are the changes needed?

Remove unused and vulnerable dependencies.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Pass GA.

Was this patch authored or co-authored using generative AI tooling?

No.

@pan3793
Copy link
Member Author

pan3793 commented May 10, 2024

cc @dongjoon-hyun and @wangyum

<groupId>org.codehaus.jackson</groupId>
<artifactId>jackson-core-asl</artifactId>
<version>${codehaus.jackson.version}</version>
<scope>${hive.jackson.scope}</scope>
Copy link
Member

@wangyum wangyum May 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also remove <hive.jackson.scope>compile</hive.jackson.scope>?

spark/pom.xml

Line 270 in 44f00cc

<hive.jackson.scope>compile</hive.jackson.scope>

<hive.jackson.scope>provided</hive.jackson.scope>

https://github.com/apache/spark/blob/master/assembly/pom.xml#L272-L277

Copy link
Member Author

@pan3793 pan3793 May 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we identify some issues on hive 2.3.10 before 4.0.0 release, we may need to revert this patch and fallback to SPARK-47119 approach to mitigate CodeHaus Jackson dependencies vulnerabilities, see comemnts at
#45201 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Copy link
Member

@dongjoon-hyun dongjoon-hyun May 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya, sorry for making things difficult, @pan3793 and @wangyum .

If we are sure, we can clean up later more easily definitely.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

dongjoon-hyun
dongjoon-hyun previously approved these changes May 10, 2024
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM (Pending CIs)

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented May 10, 2024

I have one separate question (or request), @wangyum and @pan3793 .

When we use Isolated Class Loader to use old Hive client and old version HMS server and Hive UDF (requires Jackson).

Will everything work without any issue? Do you think we can have a test coverage for that?

@pan3793
Copy link
Member Author

pan3793 commented May 10, 2024

I think the case is already covered by CI.

When IsolatedClassLoader is enabled, the Hive.get should pull and load CodeHaus Jackson classes from IsolatedClassLoader.

  /**
   * Initialize Hive through Configuration.
   * First try to use getWithoutRegisterFns to initialize to avoid loading all functions,
   * if there is no such method, fallback to Hive.get.
   */
  def getHive(conf: Configuration): Hive = {
    val hiveConf = conf match {
      case hiveConf: HiveConf =>
        hiveConf
      case _ =>
        new HiveConf(conf, classOf[HiveConf])
    }
    try {
      Hive.getWithoutRegisterFns(hiveConf)
    } catch {
      // SPARK-37069: not all Hive versions have the above method (e.g., Hive 2.3.9 has it but
      // 2.3.8 don't), therefore here we fallback when encountering the exception.
      case _: NoSuchMethodError =>
        Hive.get(hiveConf)
    }

@dongjoon-hyun
Copy link
Member

Ya, I know that part, but do we have an end-to-end Hive UDF registration and invocation test case?

@pan3793
Copy link
Member Author

pan3793 commented May 10, 2024

@dongjoon-hyun AFAIK, the "Hive UDF execution" always uses built-in Hive jars without IsolatedClassLoader. While "Hive UDF registration" will happen during Hive.get(hiveConf) with IsolatedClassLoader on constructing HMS client.

@dongjoon-hyun
Copy link
Member

@dongjoon-hyun AFAIK, the "Hive UDF execution" always uses built-in Hive jars without IsolatedClassLoader. While "Hive UDF registration" will happen during Hive.get(hiveConf) with IsolatedClassLoader on constructing HMS client.

It sounds like that we could have a corner case. That's the reason why we need an actual test case to cover it, isn't it?

@dongjoon-hyun dongjoon-hyun dismissed their stale review May 10, 2024 08:12

Stale review.

@dongjoon-hyun
Copy link
Member

For this one PR, I believe we need a verification for different HMS versions to make it sure.

@pan3793
Copy link
Member Author

pan3793 commented May 10, 2024

Hmm, let me clear my view.

In short, I think the current CI is sufficient.

Spark uses Hive in two cases:

  1. As an HMS client. To support different HMS versions, it allows to use IsolatedClassLoder to load a different Hive class. It calls Hive.getWithoutRegisterFns(hiveConf) or Hive.get(hiveConf) to create the Hive instance, and there is a chance to trigger the Hive built-in UDF registration, for older Hive, e.g. 2.1.1, some built-in Hive UDF may trigger CodeHaus Jackson classes loading.

  2. As an execution library. Spark always used the built-in Hive jars to read/write Hive tables, execute Hive UDFs.

For case 1, the CI already covers that(any older HMS client initialization triggers built-in UDF registration). For case 2, there is no chance to invoke CodeHaus Jackson classes since Hive 2.3.10 totally removed it in the codebase.

@pan3793
Copy link
Member Author

pan3793 commented May 10, 2024

also cc @wangyum @yaooqinn @AngersZhuuuu @cloud-fan

@pan3793
Copy link
Member Author

pan3793 commented May 10, 2024

For this one PR, I believe we need a verification for different HMS versions to make it sure.

that's a valid concern, since Spark CI only covers embedded HMS client case, let me test it with the real setup.

@dongjoon-hyun
Copy link
Member

Thank you. Please attach the test results to the PR description.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please hold on all Hive related dependency change until we recover Maven CIs.

#46468 (review)

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Aug 21, 2024
@github-actions github-actions bot closed this Aug 22, 2024
@pan3793
Copy link
Member Author

pan3793 commented Mar 7, 2025

For this one PR, I believe we need a verification for different HMS versions to make it sure.

@dongjoon-hyun I managed to set up an env to test the IsolatedClassLoader, it works as expected.

The basic test steps:

  • Hadoop Cluster with HDFS and YARN (v3.3.6)
  • HMS 3.1.2
  • Spark 4.0.0 RC2, add hive-llap-common-2.3.10.jar and delete hive-jacksons folder

Verify built-in Hive 2.3.10 works well without CodeHaus Jackson jars

$ bin/spark-shell
scala> spark.sql("show databases").show()
+---------+
|namespace|
+---------+
|  default|
+---------+

scala> spark.sql("create temporary function hive_uuid as 'org.apache.hadoop.hive.ql.udf.UDFUUID'")
val res1: org.apache.spark.sql.DataFrame = []

scala> spark.sql("select hive_uuid()").show
+--------------------+
|         hive_uuid()|
+--------------------+
|3ad7e110-2ad9-4f0...|
+--------------------+

Verify Hive 3.1.3 metastore jars also works well without CodeHaus Jackson jars

$ bin/spark-shell --conf spark.sql.hive.metastore.version=3.1.3 --conf spark.sql.hive.metastore.jars=maven
scala> spark.sql("show databases").show()
... (triggers jar downloading include CodeHaus Jackson jars)
downloading https://repo1.maven.org/maven2/org/codehaus/jackson/jackson-core-asl/1.9.13/jackson-core-asl-1.9.13.jar ...
	[SUCCESSFUL ] org.codehaus.jackson#jackson-core-asl;1.9.13!jackson-core-asl.jar (1178ms)
downloading https://repo1.maven.org/maven2/org/codehaus/jackson/jackson-mapper-asl/1.9.13/jackson-mapper-asl-1.9.13.jar ...
	[SUCCESSFUL ] org.codehaus.jackson#jackson-mapper-asl;1.9.13!jackson-mapper-asl.jar (3245ms)
downloading https://repo1.maven.org/maven2/org/codehaus/jackson/jackson-jaxrs/1.9.13/jackson-jaxrs-1.9.13.jar ...
	[SUCCESSFUL ] org.codehaus.jackson#jackson-jaxrs;1.9.13!jackson-jaxrs.jar (429ms)
downloading https://repo1.maven.org/maven2/org/codehaus/jackson/jackson-xc/1.9.13/jackson-xc-1.9.13.jar ...
	[SUCCESSFUL ] org.codehaus.jackson#jackson-xc;1.9.13!jackson-xc.jar (454ms)
...
downloading https://repo1.maven.org/maven2/org/apache/hive/hive-llap-client/3.1.3/hive-llap-client-3.1.3.jar ...
	[SUCCESSFUL ] org.apache.hive#hive-llap-client;3.1.3!hive-llap-client.jar (484ms)
downloading https://repo1.maven.org/maven2/org/apache/hive/hive-llap-common/3.1.3/hive-llap-common-3.1.3.jar ...
	[SUCCESSFUL ] org.apache.hive#hive-llap-common;3.1.3!hive-llap-common.jar (851ms)
...
:: retrieving :: org.apache.spark#spark-submit-parent-a3e58de0-7045-4d08-ba86-3d7b1fc03a46
	confs: [default]
	209 artifacts copied, 0 already retrieved (236164kB/372ms)
2025-03-07T16:51:48.371441489Z main INFO Starting configuration XmlConfiguration[location=/etc/spark/conf/log4j2.xml, lastModified=2025-03-07T15:59:46.435Z]...
2025-03-07T16:51:48.371577157Z main INFO Start watching for changes to /etc/spark/conf/log4j2.xml every 0 seconds
2025-03-07T16:51:48.371673867Z main INFO Configuration XmlConfiguration[location=/etc/spark/conf/log4j2.xml, lastModified=2025-03-07T15:59:46.435Z] started.
2025-03-07T16:51:48.371843828Z main INFO Stopping configuration org.apache.logging.log4j.core.config.DefaultConfiguration@4926e32c...
2025-03-07T16:51:48.371982996Z main INFO Configuration org.apache.logging.log4j.core.config.DefaultConfiguration@4926e32c stopped.
2025-03-07 16:51:48 INFO HiveConf: Found configuration file file:/etc/hive/conf/hive-site.xml
Hive Session ID = 1bd666c3-8064-43e5-9b8d-5b680ccc5e6a
2025-03-07 16:51:48 INFO SessionState: Hive Session ID = 1bd666c3-8064-43e5-9b8d-5b680ccc5e6a
2025-03-07 16:51:48 INFO HiveMetaStoreClient: Trying to connect to metastore with URI thrift://hadoop-master1.orb.local:9083
2025-03-07 16:51:48 INFO HiveMetaStoreClient: Opened a connection to metastore, current connections: 1
2025-03-07 16:51:48 INFO HiveMetaStoreClient: Connected to metastore.
2025-03-07 16:51:48 INFO RetryingMetaStoreClient: RetryingMetaStoreClient proxy=class org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient ugi=root (auth:SIMPLE) retries=1 delay=1 lifetime=0
+---------+
|namespace|
+---------+
+---------+

scala> spark.sql("create temporary function hive_uuid as 'org.apache.hadoop.hive.ql.udf.UDFUUID'")
val res0: org.apache.spark.sql.DataFrame = []

scala> spark.sql("select hive_uuid()").show
+--------------------+
|         hive_uuid()|
+--------------------+
|8ec74979-7fba-47e...|
+--------------------+

@dongjoon-hyun
Copy link
Member

Thank you for checking, @pan3793 .

Are you assuming to rebuild all Hive UDF jars here? I'm wondering if you are presenting the test result with old Hive built-UDF jars here.

@dongjoon-hyun
Copy link
Member

BTW, thank you for taking a look at removing this.
I support your direction and I hope we can revisit this with you for Apache Spark 4.1.0 timeframe, @pan3793 .

@dongjoon-hyun
Copy link
Member

I added this to a subtask of SPARK-48231 .

@pan3793
Copy link
Member Author

pan3793 commented Mar 7, 2025

Are you assuming to rebuild all Hive UDF jars here?

@dongjoon-hyun I never made such an assumption, most of the existing UDFs should work without any change, except to: the UDFs explicitly imports and uses the classes we removed from Spark new releases, it is not limited to CodeHaus Jackson, the risk happens each time when we update dev/deps/spark-deps-hadoop-3-hive-2.3.

let's say if the CustomUDF built with Hive 2.3.9 uses OkHTTP classes, it works well in Spark 3.5 because it ships OkHTTP jar by K8s client 6, but Spark 4.0 removes OkHTTP jars during K8s client 7 upgrading, then the CustomUDF should fail with OkHTTP class not found, to fix it, the user can either shade the deps or add them by --pacakges(in this case, no rebuilt required because the Hive UDF interface is binary compatible), so it's user's responsibility to handle the UDF's transitive deps.

What matters is that we must NOT break the Hive built-in UDF deps, otherwise, it blocks o.a.h.hive.ql.exec.FunctionRegistry initialization, and breaks the whole Hive UDF feature, that's why I argue that SPARK-51029 should be reverted.

SPARK-51029 (GitHub PR [1]) removes hive-llap-common from the Spark binary distributions, which technically
breaks the feature "Spark SQL supports integration of Hive UDFs, UDAFs and UDTFs"[2], more precisely, it change
Hive UDF support from batteries included to not.

In details, when user runs a query like CREATE TEMPORARY FUNCTION hello AS 'my.HelloUDF', it triggers
o.a.h.hive.ql.exec.FunctionRegistry initialization, which also initializes the Hive built-in UDFs, UDAFs and
UDTFs[3], then NoClassDefFoundError ocuurs due to some built-in UDTFs depend on class in hive-llap-common.

org.apache.spark.sql.execution.QueryExecutionException: java.lang.NoClassDefFoundError: org/apache/hadoop/hive/llap/security/LlapSigner$Signable
	at java.base/java.lang.Class.getDeclaredConstructors0(Native Method)
	at java.base/java.lang.Class.privateGetDeclaredConstructors(Class.java:3373)
	at java.base/java.lang.Class.getConstructor0(Class.java:3578)
	at java.base/java.lang.Class.getDeclaredConstructor(Class.java:2754)
	at org.apache.hive.common.util.ReflectionUtil.newInstance(ReflectionUtil.java:79)
	at org.apache.hadoop.hive.ql.exec.Registry.registerGenericUDTF(Registry.java:208)
	at org.apache.hadoop.hive.ql.exec.Registry.registerGenericUDTF(Registry.java:201)
	at org.apache.hadoop.hive.ql.exec.FunctionRegistry.<clinit>(FunctionRegistry.java:500)
	at org.apache.hadoop.hive.ql.udf.generic.GenericUDF.initializeAndFoldConstants(GenericUDF.java:160)
	at org.apache.spark.sql.hive.HiveGenericUDFEvaluator.returnInspector$lzycompute(hiveUDFEvaluators.scala:118)
	at org.apache.spark.sql.hive.HiveGenericUDFEvaluator.returnInspector(hiveUDFEvaluators.scala:117)
	at org.apache.spark.sql.hive.HiveGenericUDF.dataType$lzycompute(hiveUDFs.scala:132)
	at org.apache.spark.sql.hive.HiveGenericUDF.dataType(hiveUDFs.scala:132)
	at org.apache.spark.sql.hive.HiveUDFExpressionBuilder$.makeHiveFunctionExpression(HiveSessionStateBuilder.scala:197)
	at org.apache.spark.sql.hive.HiveUDFExpressionBuilder$.$anonfun$makeExpression$1(HiveSessionStateBuilder.scala:177)
	at org.apache.spark.util.Utils$.withContextClassLoader(Utils.scala:187)
	at org.apache.spark.sql.hive.HiveUDFExpressionBuilder$.makeExpression(HiveSessionStateBuilder.scala:171)
	at org.apache.spark.sql.catalyst.catalog.SessionCatalog.$anonfun$makeFunctionBuilder$1(SessionCatalog.scala:1689)
    ...

Currently (v4.0.0-rc2), user must add the hive-llap-common jar explicitly, e.g. by using
--packages org.apache.hive:hive-llap-common:2.3.10, to fix the NoClassDefFoundError issue, even the my.HelloUDF
does not depend on any class in hive-llap-common, this is quite confusing.

[1] #49725
[2] https://spark.apache.org/docs/3.5.5/sql-ref-functions-udf-hive.html
[3] https://github.com/apache/hive/blob/rel/release-2.3.10/ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java#L208

@pan3793
Copy link
Member Author

pan3793 commented Mar 7, 2025

In short, my conclusion is, we should and must keep all jars required by Hive built-in UDF to allow o.a.h.hive.ql.exec.FunctionRegistry initialization, for other jars like commons-lang, codehaus jackson, jodd, just remove them and let user add explicitly if required. (in my company's production cases, it's rare.)

@dongjoon-hyun
Copy link
Member

In short, my conclusion is, we should and must keep all jars required by Hive built-in UDF to allow o.a.h.hive.ql.exec.FunctionRegistry initialization, for other jars like commons-lang, codehaus jackson, jodd, just remove them and let user add explicitly if required. (in my company's production cases, it's rare.)

I fully understand your backgrounds, reasoning, and this conclusion. May I ask why you initiate that discussion on this PR, [SPARK-48231][BUILD] Remove unused CodeHaus Jackson dependencies, @pan3793 ?

@pan3793
Copy link
Member Author

pan3793 commented Mar 7, 2025

For this one PR, I believe we need a verification for different HMS versions to make it sure.

that's a valid concern, since Spark CI only covers embedded HMS client case, let me test it with the real setup.

I just found I forgot this stuff, and I think this PR could be reopened, so the comments should be visible to future reviewers.

I hope we can revisit this with you for Apache Spark 4.1.0 timeframe

Given this should be done in 4.1, so let's focus on SPARK-51029 and move discussion to #49725 for now?

@dongjoon-hyun
Copy link
Member

No, what I meant here was that your concern is legitimate. So, you can raise your concerns to the broader audience, @pan3793 . For example, dev@spark instead of this PR which is completely opposite to your intention.

If you want, you can block Apache Spark 4.0.0 RCs by vetoing and initiating a discussion thread to add them all back.

In short, my conclusion is, we should and must keep all jars required by Hive built-in UDF

@dongjoon-hyun
Copy link
Member

The RC is supposed to gather those kind of feedbacks and difficulties. There is no Apache Spark 4.0.0 until we have a community-blessed one.

@pan3793
Copy link
Member Author

pan3793 commented Mar 7, 2025

Get your point, and let me respond in the voting mail

@dongjoon-hyun
Copy link
Member

Thank you, @pan3793 ! And, sorry for your inconvenience.

@pan3793
Copy link
Member Author

pan3793 commented Jun 23, 2025

Now that Spark 4.0.0 has been successfully released with Hive 2.3.10, I think we can continue the process of removing CodeHaus Jackson dependencies.
@wangyum @dongjoon-hyun can you please help re-open this PR?

@wangyum wangyum reopened this Jun 23, 2025
@wangyum wangyum removed the Stale label Jun 23, 2025
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @pan3793 . I cleared my previous review comments from this PR. For the rest of the process, I'm going to follow the community decision while being away from this PR because I want to be neutral for this specific topic.

@pan3793
Copy link
Member Author

pan3793 commented Jun 24, 2025

thanks @dongjoon-hyun.

cc @wangyum @LuciferYang, this is ready for review. also cc @Madhukar525722, who asked this before.

Copy link
Member

@yaooqinn yaooqinn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine from my side

@LuciferYang
Copy link
Contributor

Please give me two hours to carry out some verification work

Copy link
Member

@wangyum wangyum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Copy link
Contributor

@LuciferYang LuciferYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@LuciferYang
Copy link
Contributor

Merged into master for Apache Spark 4.1.0. Thanks @pan3793 @yaooqinn @wangyum and @dongjoon-hyun

haoyangeng-db pushed a commit to haoyangeng-db/apache-spark that referenced this pull request Jul 22, 2025
### What changes were proposed in this pull request?

CodeHaus Jackson dependencies were pulled from Hive, while in apache/hive#4564 (Hive 2.3.10), it migrated to Jackson 2.x, so we can remove them from Spark now.

### Why are the changes needed?

Remove unused and vulnerable dependencies.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Pass GA.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#46521 from pan3793/SPARK-48231.

Authored-by: Cheng Pan <[email protected]>
Signed-off-by: yangjie01 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants