Skip to content

Conversation

@pan3793
Copy link
Member

@pan3793 pan3793 commented Mar 10, 2025

What changes were proposed in this pull request?

Restore hive-llap-common from provided to compile scope, this PR reverts #49725 and #50146 (partially).

Why are the changes needed?

SPARK-51029 (#49725) removes hive-llap-common from the Spark binary distributions, which technically breaks the feature "Spark SQL supports integration of Hive UDFs, UDAFs and UDTFs", more precisely, it changes Hive UDF support from batteries included to not.

In details, when user runs a query like CREATE TEMPORARY FUNCTION hello AS 'my.HelloUDF', it triggers o.a.h.hive.ql.exec.FunctionRegistry initialization, which also initializes the Hive built-in UDFs, UDAFs and UDTFs, then NoClassDefFoundError ocuurs due to some built-in UDTFs depend on class in hive-llap-common.

org.apache.spark.sql.execution.QueryExecutionException: java.lang.NoClassDefFoundError: org/apache/hadoop/hive/llap/security/LlapSigner$Signable
    at java.base/java.lang.Class.getDeclaredConstructors0(Native Method)
    at java.base/java.lang.Class.privateGetDeclaredConstructors(Class.java:3373)
    at java.base/java.lang.Class.getConstructor0(Class.java:3578)
    at java.base/java.lang.Class.getDeclaredConstructor(Class.java:2754)
    at org.apache.hive.common.util.ReflectionUtil.newInstance(ReflectionUtil.java:79)
    at org.apache.hadoop.hive.ql.exec.Registry.registerGenericUDTF(Registry.java:208)
    at org.apache.hadoop.hive.ql.exec.Registry.registerGenericUDTF(Registry.java:201)
    at org.apache.hadoop.hive.ql.exec.FunctionRegistry.<clinit>(FunctionRegistry.java:500)
    at org.apache.hadoop.hive.ql.udf.generic.GenericUDF.initializeAndFoldConstants(GenericUDF.java:160)
    at org.apache.spark.sql.hive.HiveGenericUDFEvaluator.returnInspector$lzycompute(hiveUDFEvaluators.scala:118)
    at org.apache.spark.sql.hive.HiveGenericUDFEvaluator.returnInspector(hiveUDFEvaluators.scala:117)
    at org.apache.spark.sql.hive.HiveGenericUDF.dataType$lzycompute(hiveUDFs.scala:132)
    at org.apache.spark.sql.hive.HiveGenericUDF.dataType(hiveUDFs.scala:132)
    at org.apache.spark.sql.hive.HiveUDFExpressionBuilder$.makeHiveFunctionExpression(HiveSessionStateBuilder.scala:197)
    at org.apache.spark.sql.hive.HiveUDFExpressionBuilder$.$anonfun$makeExpression$1(HiveSessionStateBuilder.scala:177)
    at org.apache.spark.util.Utils$.withContextClassLoader(Utils.scala:187)
    at org.apache.spark.sql.hive.HiveUDFExpressionBuilder$.makeExpression(HiveSessionStateBuilder.scala:171)
    at org.apache.spark.sql.catalyst.catalog.SessionCatalog.$anonfun$makeFunctionBuilder$1(SessionCatalog.scala:1689)
    ...

Currently (v4.0.0-rc2), user must add the hive-llap-common jar explicitly, e.g. by using
--packages org.apache.hive:hive-llap-common:2.3.10, to fix the NoClassDefFoundError issue, even the my.HelloUDF
does not depend on any class in hive-llap-common, this is quite confusing.

Does this PR introduce any user-facing change?

Restore "Spark SQL supports integration of Hive UDFs, UDAFs and UDTFs" to batteries included, as earlier release like Spark 3.5

How was this patch tested?

Manually verified, NoClassDefFoundError has gone after restoring hive-llap-common to classpath when calling Hive UDF.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the BUILD label Mar 10, 2025
@github-actions github-actions bot added the DOCS label Mar 10, 2025
@pan3793
Copy link
Member Author

pan3793 commented Mar 10, 2025

I personally would treat this as a blocker for the 4.0.0 release.
cc @cloud-fan @dongjoon-hyun @LuciferYang @wangyum @yaooqinn

@wangyum
Copy link
Member

wangyum commented Mar 10, 2025

+1 for restore

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for making a PR, @pan3793 .

To the reviewers, I'm not against to this PR because this is a legitimate request from the community members.

I just want to add a context for the record,

  • Apache Spark 4.0.0 RC2 makes this dependency optional intentionally due to CVE-2024-23953. In RC2, The vulnerability only affected the production environments when the users allow it by installing the package intentionally.
  • This PR will propagate Apache Hive LLAP vulnerability back to Apache Spark binary distribution again although this is not a regression from Apache Spark 3.
  • After this PR, it's highly recommended to handle it internally in the production environments by patching it internal fork of Spark or Hive based on their own user situations.

I must admit that the AS-IS Apache Spark 4.0.0 RC2 was a bandage until Apache Spark upgrades its Hive dependency to Apache Hive 4.x. Every path (including this) has it own rational. So, thank you again.

From our production environment, we will opt-out still.

@LuciferYang
Copy link
Contributor

LuciferYang commented Mar 11, 2025

Thank you for making a PR, @pan3793 .

To the reviewers, I'm not against to this PR because this is a legitimate request from the community members.

I just want to add a context for the record,

  • Apache Spark 4.0.0 RC2 makes this dependency optional intentionally due to CVE-2024-23953. In RC2, The vulnerability only affected the production environments when the users allow it by installing the package intentionally.
  • This PR will propagate Apache Hive LLAP vulnerability back to Apache Spark binary distribution again although this is not a regression from Apache Spark 3.
  • After this PR, it's highly recommended to handle it internally in the production environments by patching it internal fork of Spark or Hive based on their own user situations.

I must admit that the AS-IS Apache Spark 4.0.0 RC2 was a bandage until Apache Spark upgrades its Hive dependency to Apache Hive 4.x. Every path (including this) has it own rational. So, thank you again.

From our production environment, we will opt-out still.

After restoring this dependency, I think it would be best to have a place to document this known issue and provide recommendations to users, such as on the security.html page of spark website, or in the 4.0 release notes?

@pan3793
Copy link
Member Author

pan3793 commented Mar 13, 2025

Close and in favor SPARK-51466 (#50232)

@pan3793 pan3793 closed this Mar 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants