Skip to content

Conversation

@pan3793
Copy link
Member

@pan3793 pan3793 commented Mar 11, 2025

What changes were proposed in this pull request?

Fork a few methods from Hive to eliminate calls of org.apache.hadoop.hive.ql.exec.FunctionRegistry to avoid initializing Hive built-in UDFs

Why are the changes needed?

Currently, when the user runs a query that contains Hive UDF, it triggers o.a.h.hive.ql.exec.FunctionRegistry initialization, which also initializes the Hive built-in UDFs, UDAFs and UDTFs.

Since SPARK-51029 (#49725) removes hive-llap-common from the Spark binary distributions, NoClassDefFoundError occurs.

org.apache.spark.sql.execution.QueryExecutionException: java.lang.NoClassDefFoundError: org/apache/hadoop/hive/llap/security/LlapSigner$Signable
    at java.base/java.lang.Class.getDeclaredConstructors0(Native Method)
    at java.base/java.lang.Class.privateGetDeclaredConstructors(Class.java:3373)
    at java.base/java.lang.Class.getConstructor0(Class.java:3578)
    at java.base/java.lang.Class.getDeclaredConstructor(Class.java:2754)
    at org.apache.hive.common.util.ReflectionUtil.newInstance(ReflectionUtil.java:79)
    at org.apache.hadoop.hive.ql.exec.Registry.registerGenericUDTF(Registry.java:208)
    at org.apache.hadoop.hive.ql.exec.Registry.registerGenericUDTF(Registry.java:201)
    at org.apache.hadoop.hive.ql.exec.FunctionRegistry.<clinit>(FunctionRegistry.java:500)
    at org.apache.hadoop.hive.ql.udf.generic.GenericUDF.initializeAndFoldConstants(GenericUDF.java:160)
    at org.apache.spark.sql.hive.HiveGenericUDFEvaluator.returnInspector$lzycompute(hiveUDFEvaluators.scala:118)
    at org.apache.spark.sql.hive.HiveGenericUDFEvaluator.returnInspector(hiveUDFEvaluators.scala:117)
    at org.apache.spark.sql.hive.HiveGenericUDF.dataType$lzycompute(hiveUDFs.scala:132)
    at org.apache.spark.sql.hive.HiveGenericUDF.dataType(hiveUDFs.scala:132)
    at org.apache.spark.sql.hive.HiveUDFExpressionBuilder$.makeHiveFunctionExpression(HiveSessionStateBuilder.scala:197)
    at org.apache.spark.sql.hive.HiveUDFExpressionBuilder$.$anonfun$makeExpression$1(HiveSessionStateBuilder.scala:177)
    at org.apache.spark.util.Utils$.withContextClassLoader(Utils.scala:187)
    at org.apache.spark.sql.hive.HiveUDFExpressionBuilder$.makeExpression(HiveSessionStateBuilder.scala:171)
    at org.apache.spark.sql.catalyst.catalog.SessionCatalog.$anonfun$makeFunctionBuilder$1(SessionCatalog.scala:1689)
    ...

Actually, Spark does not use those Hive built-in functions, but still needs to pull those transitive deps to make Hive happy. By eliminating Hive built-in UDFs initialization, Spark can get rid of those transitive deps, and gain a small performance improvement on the first call Hive UDF.

Does this PR introduce any user-facing change?

No, except for a small perf improvement on the first call Hive UDF.

How was this patch tested?

Exclude hive-llap-* deps from the STS module and pass all SQL tests (previously some tests fail without hive-llap-* deps, see SPARK-51041)

Manually tested that call Hive UDF, UDAF and UDTF won't trigger org.apache.hadoop.hive.ql.exec.FunctionRegistry.<clinit>

$ bin/spark-sql
// UDF
spark-sql (default)> create temporary function hive_uuid as 'org.apache.hadoop.hive.ql.udf.UDFUUID';
Time taken: 0.878 seconds
spark-sql (default)> select hive_uuid();
840356e5-ce2a-4d6c-9383-294d620ec32b
Time taken: 2.264 seconds, Fetched 1 row(s)

// GenericUDF
spark-sql (default)> create temporary function hive_sha2 as 'org.apache.hadoop.hive.ql.udf.generic.GenericUDFSha2';
Time taken: 0.023 seconds
spark-sql (default)> select hive_sha2('ABC', 256);
b5d4045c3f466fa91fe2cc6abe79232a1a57cdf104f7a26e716e0a1e2789df78
Time taken: 0.157 seconds, Fetched 1 row(s)

// UDAF
spark-sql (default)> create temporary function hive_percentile as 'org.apache.hadoop.hive.ql.udf.UDAFPercentile';
Time taken: 0.032 seconds
spark-sql (default)> select hive_percentile(id, 0.5) from range(100);
49.5
Time taken: 0.474 seconds, Fetched 1 row(s)

// GenericUDAF
spark-sql (default)> create temporary function hive_sum as 'org.apache.hadoop.hive.ql.udf.generic.GenericUDAFSum';
Time taken: 0.017 seconds
spark-sql (default)> select hive_sum(*) from range(100);
4950
Time taken: 1.25 seconds, Fetched 1 row(s)

// GenericUDTF
spark-sql (default)> create temporary function hive_replicate_rows as 'org.apache.hadoop.hive.ql.udf.generic.GenericUDTFReplicateRows';
Time taken: 0.012 seconds
spark-sql (default)> select hive_replicate_rows(3L, id) from range(3);
3	0
3	0
3	0
3	1
3	1
3	1
3	2
3	2
3	2
Time taken: 0.19 seconds, Fetched 9 row(s)

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the SQL label Mar 11, 2025
@pan3793 pan3793 changed the title [WIP] Eliminate Hive built-in UDFs initialization on Hive UDF evaluation [SPARK-51466][SQL][HIVE] Eliminate Hive built-in UDFs initialization on Hive UDF evaluation Mar 11, 2025
@pan3793 pan3793 force-pushed the eliminate-hive-udf-init branch from 22fdec8 to ec30d94 Compare March 11, 2025 08:27
@pan3793 pan3793 marked this pull request as ready for review March 11, 2025 08:27
@pan3793
Copy link
Member Author

pan3793 commented Mar 11, 2025

cc @cloud-fan @dongjoon-hyun @LuciferYang @wangyum @yaooqinn this is an alternative of SPARK-51449 (#50222), the advantages of this one are:

  • applicable for both Hive 2.3.9 and 2.3.10, so we don't need to worry about the risk of reverting Hive 2.3.10 upgrading
  • address a broader scope of issues about Hive built-in UDF transitive deps
  • a small perf improvement on the first call Hive UDF

return getMethodInternal(udfClass, "evaluate", false, argClasses);
}

// Below methods are copied from Hive 2.3.10 o.a.h.hive.ql.exec.FunctionRegistry
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

above is wrapper code, below is copied code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed offline, if we proceed with the current pr, we need to refer to the description in #49736 to ensure that the hive-thriftserver tests can be successfully executed after removing the hive-llap dependency.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I wasn't aware of the STS issue previously, this approach requires further investigation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the STS issue is addressed now. @LuciferYang could you please take another look?

$ build/mvn -Phive-thriftserver install -DskipTests
$ build/mvn -pl sql/hive-thriftserver -Phive-thriftserver install -fae
...
Run completed in 17 minutes, 21 seconds.
Total number of tests run: 639
Suites: completed 20, aborted 0
Tests: succeeded 639, failed 0, canceled 0, ignored 25, pending 0
All tests passed.

@pan3793 pan3793 marked this pull request as draft March 11, 2025 12:11
@github-actions github-actions bot added the BUILD label Mar 11, 2025
@pan3793 pan3793 marked this pull request as ready for review March 11, 2025 13:19
logger.thriftserver.level = off

logger.dagscheduler.name = org.apache.spark.scheduler.DAGScheduler
logger.dagscheduler.level = error
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to suppress noisy logs like

20:44:53.029 WARN org.apache.spark.scheduler.DAGScheduler: Failed to cancel job group 2f794b16-abee-4bbe-9caa-8be3416c500b. Cannot find active jobs for it.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for suggesting alternatives for the community, @pan3793 and @LuciferYang .

<artifactId>byte-buddy-agent</artifactId>
<scope>test</scope>
</dependency>
<dependency>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible for us to add some configuration in SparkBuild.scala to ensure that hive-llap is not included in the classpath too when testing the thriftserver module with sbt?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done by updating the SparkBuild.scala, it could be verified by

build/sbt -Phive-thriftserver hive-thriftserver/Test/dependencyTree | grep hive-llap

before

[info]   +-org.apache.hive:hive-llap-client:2.3.10
[info]   +-org.apache.hive:hive-llap-common:2.3.10
[info]   | +-org.apache.hive:hive-llap-client:2.3.10
[info]   | +-org.apache.hive:hive-llap-common:2.3.10

now result is empty.

@transient
lazy val returnInspector = {
function.initializeAndFoldConstants(argumentInspectors.toArray)
// Inline o.a.h.hive.ql.udf.generic.GenericUDF#initializeAndFoldConstants, but
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also cc @panbingkun

Copy link
Contributor

@LuciferYang LuciferYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM
Thank you @pan3793

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.

@dongjoon-hyun
Copy link
Member

Thank you, @pan3793 and @LuciferYang . Merged to master.

Could you make a backporting PR to branch-4.0 to pass CI there once more, @pan3793 ?

pan3793 added a commit to pan3793/spark that referenced this pull request Mar 13, 2025
…on Hive UDF evaluation

### What changes were proposed in this pull request?

Fork a few methods from Hive to eliminate calls of `org.apache.hadoop.hive.ql.exec.FunctionRegistry` to avoid initializing Hive built-in UDFs

### Why are the changes needed?

Currently, when the user runs a query that contains Hive UDF, it triggers `o.a.h.hive.ql.exec.FunctionRegistry` initialization, which also initializes the [Hive built-in UDFs, UDAFs and UDTFs](https://github.com/apache/hive/blob/rel/release-2.3.10/ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java#L500).

Since [SPARK-51029](https://issues.apache.org/jira/browse/SPARK-51029) (apache#49725) removes hive-llap-common from the Spark binary distributions, `NoClassDefFoundError` occurs.

```
org.apache.spark.sql.execution.QueryExecutionException: java.lang.NoClassDefFoundError: org/apache/hadoop/hive/llap/security/LlapSigner$Signable
    at java.base/java.lang.Class.getDeclaredConstructors0(Native Method)
    at java.base/java.lang.Class.privateGetDeclaredConstructors(Class.java:3373)
    at java.base/java.lang.Class.getConstructor0(Class.java:3578)
    at java.base/java.lang.Class.getDeclaredConstructor(Class.java:2754)
    at org.apache.hive.common.util.ReflectionUtil.newInstance(ReflectionUtil.java:79)
    at org.apache.hadoop.hive.ql.exec.Registry.registerGenericUDTF(Registry.java:208)
    at org.apache.hadoop.hive.ql.exec.Registry.registerGenericUDTF(Registry.java:201)
    at org.apache.hadoop.hive.ql.exec.FunctionRegistry.<clinit>(FunctionRegistry.java:500)
    at org.apache.hadoop.hive.ql.udf.generic.GenericUDF.initializeAndFoldConstants(GenericUDF.java:160)
    at org.apache.spark.sql.hive.HiveGenericUDFEvaluator.returnInspector$lzycompute(hiveUDFEvaluators.scala:118)
    at org.apache.spark.sql.hive.HiveGenericUDFEvaluator.returnInspector(hiveUDFEvaluators.scala:117)
    at org.apache.spark.sql.hive.HiveGenericUDF.dataType$lzycompute(hiveUDFs.scala:132)
    at org.apache.spark.sql.hive.HiveGenericUDF.dataType(hiveUDFs.scala:132)
    at org.apache.spark.sql.hive.HiveUDFExpressionBuilder$.makeHiveFunctionExpression(HiveSessionStateBuilder.scala:197)
    at org.apache.spark.sql.hive.HiveUDFExpressionBuilder$.$anonfun$makeExpression$1(HiveSessionStateBuilder.scala:177)
    at org.apache.spark.util.Utils$.withContextClassLoader(Utils.scala:187)
    at org.apache.spark.sql.hive.HiveUDFExpressionBuilder$.makeExpression(HiveSessionStateBuilder.scala:171)
    at org.apache.spark.sql.catalyst.catalog.SessionCatalog.$anonfun$makeFunctionBuilder$1(SessionCatalog.scala:1689)
    ...
```

Actually, Spark does not use those Hive built-in functions, but still needs to pull those transitive deps to make Hive happy. By eliminating Hive built-in UDFs initialization, Spark can get rid of those transitive deps, and gain a small performance improvement on the first call Hive UDF.

### Does this PR introduce _any_ user-facing change?

No, except for a small perf improvement on the first call Hive UDF.

### How was this patch tested?

Pass GHA to ensure the porting code is correct.

Manually tested that call Hive UDF, UDAF and UDTF won't trigger `org.apache.hadoop.hive.ql.exec.FunctionRegistry.<clinit>`

```
$ bin/spark-sql
// UDF
spark-sql (default)> create temporary function hive_uuid as 'org.apache.hadoop.hive.ql.udf.UDFUUID';
Time taken: 0.878 seconds
spark-sql (default)> select hive_uuid();
840356e5-ce2a-4d6c-9383-294d620ec32b
Time taken: 2.264 seconds, Fetched 1 row(s)

// GenericUDF
spark-sql (default)> create temporary function hive_sha2 as 'org.apache.hadoop.hive.ql.udf.generic.GenericUDFSha2';
Time taken: 0.023 seconds
spark-sql (default)> select hive_sha2('ABC', 256);
b5d4045c3f466fa91fe2cc6abe79232a1a57cdf104f7a26e716e0a1e2789df78
Time taken: 0.157 seconds, Fetched 1 row(s)

// UDAF
spark-sql (default)> create temporary function hive_percentile as 'org.apache.hadoop.hive.ql.udf.UDAFPercentile';
Time taken: 0.032 seconds
spark-sql (default)> select hive_percentile(id, 0.5) from range(100);
49.5
Time taken: 0.474 seconds, Fetched 1 row(s)

// GenericUDAF
spark-sql (default)> create temporary function hive_sum as 'org.apache.hadoop.hive.ql.udf.generic.GenericUDAFSum';
Time taken: 0.017 seconds
spark-sql (default)> select hive_sum(*) from range(100);
4950
Time taken: 1.25 seconds, Fetched 1 row(s)

// GenericUDTF
spark-sql (default)> create temporary function hive_replicate_rows as 'org.apache.hadoop.hive.ql.udf.generic.GenericUDTFReplicateRows';
Time taken: 0.012 seconds
spark-sql (default)> select hive_replicate_rows(3L, id) from range(3);
3	0
3	0
3	0
3	1
3	1
3	1
3	2
3	2
3	2
Time taken: 0.19 seconds, Fetched 9 row(s)
```

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#50232 from pan3793/eliminate-hive-udf-init.

Authored-by: Cheng Pan <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
@pan3793
Copy link
Member Author

pan3793 commented Mar 13, 2025

@dongjoon-hyun I opend #50264 for 4.0 backport

@yaooqinn
Copy link
Member

Hey, why are any manual tests listed in the PR desc not included in UTs?

@pan3793
Copy link
Member Author

pan3793 commented Mar 13, 2025

Hey, why are any manual tests listed in the PR desc not included in UTs?

Given that we have excluded hive-llap-* deps from STS modules, the existing STS SQL tests should cover all my manual test cases.

LuciferYang pushed a commit that referenced this pull request Mar 13, 2025
…tion on Hive UDF evaluation

Backport #50232 to branch-4.0

### What changes were proposed in this pull request?

Fork a few methods from Hive to eliminate calls of `org.apache.hadoop.hive.ql.exec.FunctionRegistry` to avoid initializing Hive built-in UDFs

### Why are the changes needed?

Currently, when the user runs a query that contains Hive UDF, it triggers `o.a.h.hive.ql.exec.FunctionRegistry` initialization, which also initializes the [Hive built-in UDFs, UDAFs and UDTFs](https://github.com/apache/hive/blob/rel/release-2.3.10/ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java#L500).

Since [SPARK-51029](https://issues.apache.org/jira/browse/SPARK-51029) (#49725) removes hive-llap-common from the Spark binary distributions, `NoClassDefFoundError` occurs.

```
org.apache.spark.sql.execution.QueryExecutionException: java.lang.NoClassDefFoundError: org/apache/hadoop/hive/llap/security/LlapSigner$Signable
    at java.base/java.lang.Class.getDeclaredConstructors0(Native Method)
    at java.base/java.lang.Class.privateGetDeclaredConstructors(Class.java:3373)
    at java.base/java.lang.Class.getConstructor0(Class.java:3578)
    at java.base/java.lang.Class.getDeclaredConstructor(Class.java:2754)
    at org.apache.hive.common.util.ReflectionUtil.newInstance(ReflectionUtil.java:79)
    at org.apache.hadoop.hive.ql.exec.Registry.registerGenericUDTF(Registry.java:208)
    at org.apache.hadoop.hive.ql.exec.Registry.registerGenericUDTF(Registry.java:201)
    at org.apache.hadoop.hive.ql.exec.FunctionRegistry.<clinit>(FunctionRegistry.java:500)
    at org.apache.hadoop.hive.ql.udf.generic.GenericUDF.initializeAndFoldConstants(GenericUDF.java:160)
    at org.apache.spark.sql.hive.HiveGenericUDFEvaluator.returnInspector$lzycompute(hiveUDFEvaluators.scala:118)
    at org.apache.spark.sql.hive.HiveGenericUDFEvaluator.returnInspector(hiveUDFEvaluators.scala:117)
    at org.apache.spark.sql.hive.HiveGenericUDF.dataType$lzycompute(hiveUDFs.scala:132)
    at org.apache.spark.sql.hive.HiveGenericUDF.dataType(hiveUDFs.scala:132)
    at org.apache.spark.sql.hive.HiveUDFExpressionBuilder$.makeHiveFunctionExpression(HiveSessionStateBuilder.scala:197)
    at org.apache.spark.sql.hive.HiveUDFExpressionBuilder$.$anonfun$makeExpression$1(HiveSessionStateBuilder.scala:177)
    at org.apache.spark.util.Utils$.withContextClassLoader(Utils.scala:187)
    at org.apache.spark.sql.hive.HiveUDFExpressionBuilder$.makeExpression(HiveSessionStateBuilder.scala:171)
    at org.apache.spark.sql.catalyst.catalog.SessionCatalog.$anonfun$makeFunctionBuilder$1(SessionCatalog.scala:1689)
    ...
```

Actually, Spark does not use those Hive built-in functions, but still needs to pull those transitive deps to make Hive happy. By eliminating Hive built-in UDFs initialization, Spark can get rid of those transitive deps, and gain a small performance improvement on the first call Hive UDF.

### Does this PR introduce _any_ user-facing change?

No, except for a small perf improvement on the first call Hive UDF.

### How was this patch tested?

Exclude `hive-llap-*` deps from the STS module and pass all SQL tests (previously some tests fail without `hive-llap-*` deps, see SPARK-51041)

Manually tested that call Hive UDF, UDAF and UDTF won't trigger `org.apache.hadoop.hive.ql.exec.FunctionRegistry.<clinit>`

```
$ bin/spark-sql
// UDF
spark-sql (default)> create temporary function hive_uuid as 'org.apache.hadoop.hive.ql.udf.UDFUUID';
Time taken: 0.878 seconds
spark-sql (default)> select hive_uuid();
840356e5-ce2a-4d6c-9383-294d620ec32b
Time taken: 2.264 seconds, Fetched 1 row(s)

// GenericUDF
spark-sql (default)> create temporary function hive_sha2 as 'org.apache.hadoop.hive.ql.udf.generic.GenericUDFSha2';
Time taken: 0.023 seconds
spark-sql (default)> select hive_sha2('ABC', 256);
b5d4045c3f466fa91fe2cc6abe79232a1a57cdf104f7a26e716e0a1e2789df78
Time taken: 0.157 seconds, Fetched 1 row(s)

// UDAF
spark-sql (default)> create temporary function hive_percentile as 'org.apache.hadoop.hive.ql.udf.UDAFPercentile';
Time taken: 0.032 seconds
spark-sql (default)> select hive_percentile(id, 0.5) from range(100);
49.5
Time taken: 0.474 seconds, Fetched 1 row(s)

// GenericUDAF
spark-sql (default)> create temporary function hive_sum as 'org.apache.hadoop.hive.ql.udf.generic.GenericUDAFSum';
Time taken: 0.017 seconds
spark-sql (default)> select hive_sum(*) from range(100);
4950
Time taken: 1.25 seconds, Fetched 1 row(s)

// GenericUDTF
spark-sql (default)> create temporary function hive_replicate_rows as 'org.apache.hadoop.hive.ql.udf.generic.GenericUDTFReplicateRows';
Time taken: 0.012 seconds
spark-sql (default)> select hive_replicate_rows(3L, id) from range(3);
3	0
3	0
3	0
3	1
3	1
3	1
3	2
3	2
3	2
Time taken: 0.19 seconds, Fetched 9 row(s)
```

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #50264 from pan3793/SPARK-51466-4.0.

Authored-by: Cheng Pan <[email protected]>
Signed-off-by: yangjie01 <[email protected]>
@cloud-fan
Copy link
Contributor

late LGTM

anoopj pushed a commit to anoopj/spark that referenced this pull request Mar 15, 2025
…on Hive UDF evaluation

### What changes were proposed in this pull request?

Fork a few methods from Hive to eliminate calls of `org.apache.hadoop.hive.ql.exec.FunctionRegistry` to avoid initializing Hive built-in UDFs

### Why are the changes needed?

Currently, when the user runs a query that contains Hive UDF, it triggers `o.a.h.hive.ql.exec.FunctionRegistry` initialization, which also initializes the [Hive built-in UDFs, UDAFs and UDTFs](https://github.com/apache/hive/blob/rel/release-2.3.10/ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java#L500).

Since [SPARK-51029](https://issues.apache.org/jira/browse/SPARK-51029) (apache#49725) removes hive-llap-common from the Spark binary distributions, `NoClassDefFoundError` occurs.

```
org.apache.spark.sql.execution.QueryExecutionException: java.lang.NoClassDefFoundError: org/apache/hadoop/hive/llap/security/LlapSigner$Signable
    at java.base/java.lang.Class.getDeclaredConstructors0(Native Method)
    at java.base/java.lang.Class.privateGetDeclaredConstructors(Class.java:3373)
    at java.base/java.lang.Class.getConstructor0(Class.java:3578)
    at java.base/java.lang.Class.getDeclaredConstructor(Class.java:2754)
    at org.apache.hive.common.util.ReflectionUtil.newInstance(ReflectionUtil.java:79)
    at org.apache.hadoop.hive.ql.exec.Registry.registerGenericUDTF(Registry.java:208)
    at org.apache.hadoop.hive.ql.exec.Registry.registerGenericUDTF(Registry.java:201)
    at org.apache.hadoop.hive.ql.exec.FunctionRegistry.<clinit>(FunctionRegistry.java:500)
    at org.apache.hadoop.hive.ql.udf.generic.GenericUDF.initializeAndFoldConstants(GenericUDF.java:160)
    at org.apache.spark.sql.hive.HiveGenericUDFEvaluator.returnInspector$lzycompute(hiveUDFEvaluators.scala:118)
    at org.apache.spark.sql.hive.HiveGenericUDFEvaluator.returnInspector(hiveUDFEvaluators.scala:117)
    at org.apache.spark.sql.hive.HiveGenericUDF.dataType$lzycompute(hiveUDFs.scala:132)
    at org.apache.spark.sql.hive.HiveGenericUDF.dataType(hiveUDFs.scala:132)
    at org.apache.spark.sql.hive.HiveUDFExpressionBuilder$.makeHiveFunctionExpression(HiveSessionStateBuilder.scala:197)
    at org.apache.spark.sql.hive.HiveUDFExpressionBuilder$.$anonfun$makeExpression$1(HiveSessionStateBuilder.scala:177)
    at org.apache.spark.util.Utils$.withContextClassLoader(Utils.scala:187)
    at org.apache.spark.sql.hive.HiveUDFExpressionBuilder$.makeExpression(HiveSessionStateBuilder.scala:171)
    at org.apache.spark.sql.catalyst.catalog.SessionCatalog.$anonfun$makeFunctionBuilder$1(SessionCatalog.scala:1689)
    ...
```

Actually, Spark does not use those Hive built-in functions, but still needs to pull those transitive deps to make Hive happy. By eliminating Hive built-in UDFs initialization, Spark can get rid of those transitive deps, and gain a small performance improvement on the first call Hive UDF.

### Does this PR introduce _any_ user-facing change?

No, except for a small perf improvement on the first call Hive UDF.

### How was this patch tested?

Pass GHA to ensure the porting code is correct.

Manually tested that call Hive UDF, UDAF and UDTF won't trigger `org.apache.hadoop.hive.ql.exec.FunctionRegistry.<clinit>`

```
$ bin/spark-sql
// UDF
spark-sql (default)> create temporary function hive_uuid as 'org.apache.hadoop.hive.ql.udf.UDFUUID';
Time taken: 0.878 seconds
spark-sql (default)> select hive_uuid();
840356e5-ce2a-4d6c-9383-294d620ec32b
Time taken: 2.264 seconds, Fetched 1 row(s)

// GenericUDF
spark-sql (default)> create temporary function hive_sha2 as 'org.apache.hadoop.hive.ql.udf.generic.GenericUDFSha2';
Time taken: 0.023 seconds
spark-sql (default)> select hive_sha2('ABC', 256);
b5d4045c3f466fa91fe2cc6abe79232a1a57cdf104f7a26e716e0a1e2789df78
Time taken: 0.157 seconds, Fetched 1 row(s)

// UDAF
spark-sql (default)> create temporary function hive_percentile as 'org.apache.hadoop.hive.ql.udf.UDAFPercentile';
Time taken: 0.032 seconds
spark-sql (default)> select hive_percentile(id, 0.5) from range(100);
49.5
Time taken: 0.474 seconds, Fetched 1 row(s)

// GenericUDAF
spark-sql (default)> create temporary function hive_sum as 'org.apache.hadoop.hive.ql.udf.generic.GenericUDAFSum';
Time taken: 0.017 seconds
spark-sql (default)> select hive_sum(*) from range(100);
4950
Time taken: 1.25 seconds, Fetched 1 row(s)

// GenericUDTF
spark-sql (default)> create temporary function hive_replicate_rows as 'org.apache.hadoop.hive.ql.udf.generic.GenericUDTFReplicateRows';
Time taken: 0.012 seconds
spark-sql (default)> select hive_replicate_rows(3L, id) from range(3);
3	0
3	0
3	0
3	1
3	1
3	1
3	2
3	2
3	2
Time taken: 0.19 seconds, Fetched 9 row(s)
```

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#50232 from pan3793/eliminate-hive-udf-init.

Authored-by: Cheng Pan <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
@pan3793 pan3793 deleted the eliminate-hive-udf-init branch July 22, 2025 13:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants