[SPARK-51466][SQL][HIVE][4.0] Eliminate Hive built-in UDFs initialization on Hive UDF evaluation #50264

pan3793 · 2025-03-13T02:17:34Z

Backport #50232 to branch-4.0

What changes were proposed in this pull request?

Fork a few methods from Hive to eliminate calls of org.apache.hadoop.hive.ql.exec.FunctionRegistry to avoid initializing Hive built-in UDFs

Why are the changes needed?

Currently, when the user runs a query that contains Hive UDF, it triggers o.a.h.hive.ql.exec.FunctionRegistry initialization, which also initializes the Hive built-in UDFs, UDAFs and UDTFs.

Since SPARK-51029 (#49725) removes hive-llap-common from the Spark binary distributions, NoClassDefFoundError occurs.

org.apache.spark.sql.execution.QueryExecutionException: java.lang.NoClassDefFoundError: org/apache/hadoop/hive/llap/security/LlapSigner$Signable
    at java.base/java.lang.Class.getDeclaredConstructors0(Native Method)
    at java.base/java.lang.Class.privateGetDeclaredConstructors(Class.java:3373)
    at java.base/java.lang.Class.getConstructor0(Class.java:3578)
    at java.base/java.lang.Class.getDeclaredConstructor(Class.java:2754)
    at org.apache.hive.common.util.ReflectionUtil.newInstance(ReflectionUtil.java:79)
    at org.apache.hadoop.hive.ql.exec.Registry.registerGenericUDTF(Registry.java:208)
    at org.apache.hadoop.hive.ql.exec.Registry.registerGenericUDTF(Registry.java:201)
    at org.apache.hadoop.hive.ql.exec.FunctionRegistry.<clinit>(FunctionRegistry.java:500)
    at org.apache.hadoop.hive.ql.udf.generic.GenericUDF.initializeAndFoldConstants(GenericUDF.java:160)
    at org.apache.spark.sql.hive.HiveGenericUDFEvaluator.returnInspector$lzycompute(hiveUDFEvaluators.scala:118)
    at org.apache.spark.sql.hive.HiveGenericUDFEvaluator.returnInspector(hiveUDFEvaluators.scala:117)
    at org.apache.spark.sql.hive.HiveGenericUDF.dataType$lzycompute(hiveUDFs.scala:132)
    at org.apache.spark.sql.hive.HiveGenericUDF.dataType(hiveUDFs.scala:132)
    at org.apache.spark.sql.hive.HiveUDFExpressionBuilder$.makeHiveFunctionExpression(HiveSessionStateBuilder.scala:197)
    at org.apache.spark.sql.hive.HiveUDFExpressionBuilder$.$anonfun$makeExpression$1(HiveSessionStateBuilder.scala:177)
    at org.apache.spark.util.Utils$.withContextClassLoader(Utils.scala:187)
    at org.apache.spark.sql.hive.HiveUDFExpressionBuilder$.makeExpression(HiveSessionStateBuilder.scala:171)
    at org.apache.spark.sql.catalyst.catalog.SessionCatalog.$anonfun$makeFunctionBuilder$1(SessionCatalog.scala:1689)
    ...

Actually, Spark does not use those Hive built-in functions, but still needs to pull those transitive deps to make Hive happy. By eliminating Hive built-in UDFs initialization, Spark can get rid of those transitive deps, and gain a small performance improvement on the first call Hive UDF.

Does this PR introduce any user-facing change?

No, except for a small perf improvement on the first call Hive UDF.

How was this patch tested?

Exclude hive-llap-* deps from the STS module and pass all SQL tests (previously some tests fail without hive-llap-* deps, see SPARK-51041)

Manually tested that call Hive UDF, UDAF and UDTF won't trigger org.apache.hadoop.hive.ql.exec.FunctionRegistry.<clinit>

$ bin/spark-sql
// UDF
spark-sql (default)> create temporary function hive_uuid as 'org.apache.hadoop.hive.ql.udf.UDFUUID';
Time taken: 0.878 seconds
spark-sql (default)> select hive_uuid();
840356e5-ce2a-4d6c-9383-294d620ec32b
Time taken: 2.264 seconds, Fetched 1 row(s)

// GenericUDF
spark-sql (default)> create temporary function hive_sha2 as 'org.apache.hadoop.hive.ql.udf.generic.GenericUDFSha2';
Time taken: 0.023 seconds
spark-sql (default)> select hive_sha2('ABC', 256);
b5d4045c3f466fa91fe2cc6abe79232a1a57cdf104f7a26e716e0a1e2789df78
Time taken: 0.157 seconds, Fetched 1 row(s)

// UDAF
spark-sql (default)> create temporary function hive_percentile as 'org.apache.hadoop.hive.ql.udf.UDAFPercentile';
Time taken: 0.032 seconds
spark-sql (default)> select hive_percentile(id, 0.5) from range(100);
49.5
Time taken: 0.474 seconds, Fetched 1 row(s)

// GenericUDAF
spark-sql (default)> create temporary function hive_sum as 'org.apache.hadoop.hive.ql.udf.generic.GenericUDAFSum';
Time taken: 0.017 seconds
spark-sql (default)> select hive_sum(*) from range(100);
4950
Time taken: 1.25 seconds, Fetched 1 row(s)

// GenericUDTF
spark-sql (default)> create temporary function hive_replicate_rows as 'org.apache.hadoop.hive.ql.udf.generic.GenericUDTFReplicateRows';
Time taken: 0.012 seconds
spark-sql (default)> select hive_replicate_rows(3L, id) from range(3);
3	0
3	0
3	0
3	1
3	1
3	1
3	2
3	2
3	2
Time taken: 0.19 seconds, Fetched 9 row(s)

Was this patch authored or co-authored using generative AI tooling?

No.

…on Hive UDF evaluation ### What changes were proposed in this pull request? Fork a few methods from Hive to eliminate calls of `org.apache.hadoop.hive.ql.exec.FunctionRegistry` to avoid initializing Hive built-in UDFs ### Why are the changes needed? Currently, when the user runs a query that contains Hive UDF, it triggers `o.a.h.hive.ql.exec.FunctionRegistry` initialization, which also initializes the [Hive built-in UDFs, UDAFs and UDTFs](https://github.com/apache/hive/blob/rel/release-2.3.10/ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java#L500). Since [SPARK-51029](https://issues.apache.org/jira/browse/SPARK-51029) (apache#49725) removes hive-llap-common from the Spark binary distributions, `NoClassDefFoundError` occurs. ``` org.apache.spark.sql.execution.QueryExecutionException: java.lang.NoClassDefFoundError: org/apache/hadoop/hive/llap/security/LlapSigner$Signable at java.base/java.lang.Class.getDeclaredConstructors0(Native Method) at java.base/java.lang.Class.privateGetDeclaredConstructors(Class.java:3373) at java.base/java.lang.Class.getConstructor0(Class.java:3578) at java.base/java.lang.Class.getDeclaredConstructor(Class.java:2754) at org.apache.hive.common.util.ReflectionUtil.newInstance(ReflectionUtil.java:79) at org.apache.hadoop.hive.ql.exec.Registry.registerGenericUDTF(Registry.java:208) at org.apache.hadoop.hive.ql.exec.Registry.registerGenericUDTF(Registry.java:201) at org.apache.hadoop.hive.ql.exec.FunctionRegistry.<clinit>(FunctionRegistry.java:500) at org.apache.hadoop.hive.ql.udf.generic.GenericUDF.initializeAndFoldConstants(GenericUDF.java:160) at org.apache.spark.sql.hive.HiveGenericUDFEvaluator.returnInspector$lzycompute(hiveUDFEvaluators.scala:118) at org.apache.spark.sql.hive.HiveGenericUDFEvaluator.returnInspector(hiveUDFEvaluators.scala:117) at org.apache.spark.sql.hive.HiveGenericUDF.dataType$lzycompute(hiveUDFs.scala:132) at org.apache.spark.sql.hive.HiveGenericUDF.dataType(hiveUDFs.scala:132) at org.apache.spark.sql.hive.HiveUDFExpressionBuilder$.makeHiveFunctionExpression(HiveSessionStateBuilder.scala:197) at org.apache.spark.sql.hive.HiveUDFExpressionBuilder$.$anonfun$makeExpression$1(HiveSessionStateBuilder.scala:177) at org.apache.spark.util.Utils$.withContextClassLoader(Utils.scala:187) at org.apache.spark.sql.hive.HiveUDFExpressionBuilder$.makeExpression(HiveSessionStateBuilder.scala:171) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.$anonfun$makeFunctionBuilder$1(SessionCatalog.scala:1689) ... ``` Actually, Spark does not use those Hive built-in functions, but still needs to pull those transitive deps to make Hive happy. By eliminating Hive built-in UDFs initialization, Spark can get rid of those transitive deps, and gain a small performance improvement on the first call Hive UDF. ### Does this PR introduce _any_ user-facing change? No, except for a small perf improvement on the first call Hive UDF. ### How was this patch tested? Pass GHA to ensure the porting code is correct. Manually tested that call Hive UDF, UDAF and UDTF won't trigger `org.apache.hadoop.hive.ql.exec.FunctionRegistry.<clinit>` ``` $ bin/spark-sql // UDF spark-sql (default)> create temporary function hive_uuid as 'org.apache.hadoop.hive.ql.udf.UDFUUID'; Time taken: 0.878 seconds spark-sql (default)> select hive_uuid(); 840356e5-ce2a-4d6c-9383-294d620ec32b Time taken: 2.264 seconds, Fetched 1 row(s) // GenericUDF spark-sql (default)> create temporary function hive_sha2 as 'org.apache.hadoop.hive.ql.udf.generic.GenericUDFSha2'; Time taken: 0.023 seconds spark-sql (default)> select hive_sha2('ABC', 256); b5d4045c3f466fa91fe2cc6abe79232a1a57cdf104f7a26e716e0a1e2789df78 Time taken: 0.157 seconds, Fetched 1 row(s) // UDAF spark-sql (default)> create temporary function hive_percentile as 'org.apache.hadoop.hive.ql.udf.UDAFPercentile'; Time taken: 0.032 seconds spark-sql (default)> select hive_percentile(id, 0.5) from range(100); 49.5 Time taken: 0.474 seconds, Fetched 1 row(s) // GenericUDAF spark-sql (default)> create temporary function hive_sum as 'org.apache.hadoop.hive.ql.udf.generic.GenericUDAFSum'; Time taken: 0.017 seconds spark-sql (default)> select hive_sum(*) from range(100); 4950 Time taken: 1.25 seconds, Fetched 1 row(s) // GenericUDTF spark-sql (default)> create temporary function hive_replicate_rows as 'org.apache.hadoop.hive.ql.udf.generic.GenericUDTFReplicateRows'; Time taken: 0.012 seconds spark-sql (default)> select hive_replicate_rows(3L, id) from range(3); 3 0 3 0 3 0 3 1 3 1 3 1 3 2 3 2 3 2 Time taken: 0.19 seconds, Fetched 9 row(s) ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#50232 from pan3793/eliminate-hive-udf-init. Authored-by: Cheng Pan <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

LuciferYang

+1, LGTM

…tion on Hive UDF evaluation Backport #50232 to branch-4.0 ### What changes were proposed in this pull request? Fork a few methods from Hive to eliminate calls of `org.apache.hadoop.hive.ql.exec.FunctionRegistry` to avoid initializing Hive built-in UDFs ### Why are the changes needed? Currently, when the user runs a query that contains Hive UDF, it triggers `o.a.h.hive.ql.exec.FunctionRegistry` initialization, which also initializes the [Hive built-in UDFs, UDAFs and UDTFs](https://github.com/apache/hive/blob/rel/release-2.3.10/ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java#L500). Since [SPARK-51029](https://issues.apache.org/jira/browse/SPARK-51029) (#49725) removes hive-llap-common from the Spark binary distributions, `NoClassDefFoundError` occurs. ``` org.apache.spark.sql.execution.QueryExecutionException: java.lang.NoClassDefFoundError: org/apache/hadoop/hive/llap/security/LlapSigner$Signable at java.base/java.lang.Class.getDeclaredConstructors0(Native Method) at java.base/java.lang.Class.privateGetDeclaredConstructors(Class.java:3373) at java.base/java.lang.Class.getConstructor0(Class.java:3578) at java.base/java.lang.Class.getDeclaredConstructor(Class.java:2754) at org.apache.hive.common.util.ReflectionUtil.newInstance(ReflectionUtil.java:79) at org.apache.hadoop.hive.ql.exec.Registry.registerGenericUDTF(Registry.java:208) at org.apache.hadoop.hive.ql.exec.Registry.registerGenericUDTF(Registry.java:201) at org.apache.hadoop.hive.ql.exec.FunctionRegistry.<clinit>(FunctionRegistry.java:500) at org.apache.hadoop.hive.ql.udf.generic.GenericUDF.initializeAndFoldConstants(GenericUDF.java:160) at org.apache.spark.sql.hive.HiveGenericUDFEvaluator.returnInspector$lzycompute(hiveUDFEvaluators.scala:118) at org.apache.spark.sql.hive.HiveGenericUDFEvaluator.returnInspector(hiveUDFEvaluators.scala:117) at org.apache.spark.sql.hive.HiveGenericUDF.dataType$lzycompute(hiveUDFs.scala:132) at org.apache.spark.sql.hive.HiveGenericUDF.dataType(hiveUDFs.scala:132) at org.apache.spark.sql.hive.HiveUDFExpressionBuilder$.makeHiveFunctionExpression(HiveSessionStateBuilder.scala:197) at org.apache.spark.sql.hive.HiveUDFExpressionBuilder$.$anonfun$makeExpression$1(HiveSessionStateBuilder.scala:177) at org.apache.spark.util.Utils$.withContextClassLoader(Utils.scala:187) at org.apache.spark.sql.hive.HiveUDFExpressionBuilder$.makeExpression(HiveSessionStateBuilder.scala:171) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.$anonfun$makeFunctionBuilder$1(SessionCatalog.scala:1689) ... ``` Actually, Spark does not use those Hive built-in functions, but still needs to pull those transitive deps to make Hive happy. By eliminating Hive built-in UDFs initialization, Spark can get rid of those transitive deps, and gain a small performance improvement on the first call Hive UDF. ### Does this PR introduce _any_ user-facing change? No, except for a small perf improvement on the first call Hive UDF. ### How was this patch tested? Exclude `hive-llap-*` deps from the STS module and pass all SQL tests (previously some tests fail without `hive-llap-*` deps, see SPARK-51041) Manually tested that call Hive UDF, UDAF and UDTF won't trigger `org.apache.hadoop.hive.ql.exec.FunctionRegistry.<clinit>` ``` $ bin/spark-sql // UDF spark-sql (default)> create temporary function hive_uuid as 'org.apache.hadoop.hive.ql.udf.UDFUUID'; Time taken: 0.878 seconds spark-sql (default)> select hive_uuid(); 840356e5-ce2a-4d6c-9383-294d620ec32b Time taken: 2.264 seconds, Fetched 1 row(s) // GenericUDF spark-sql (default)> create temporary function hive_sha2 as 'org.apache.hadoop.hive.ql.udf.generic.GenericUDFSha2'; Time taken: 0.023 seconds spark-sql (default)> select hive_sha2('ABC', 256); b5d4045c3f466fa91fe2cc6abe79232a1a57cdf104f7a26e716e0a1e2789df78 Time taken: 0.157 seconds, Fetched 1 row(s) // UDAF spark-sql (default)> create temporary function hive_percentile as 'org.apache.hadoop.hive.ql.udf.UDAFPercentile'; Time taken: 0.032 seconds spark-sql (default)> select hive_percentile(id, 0.5) from range(100); 49.5 Time taken: 0.474 seconds, Fetched 1 row(s) // GenericUDAF spark-sql (default)> create temporary function hive_sum as 'org.apache.hadoop.hive.ql.udf.generic.GenericUDAFSum'; Time taken: 0.017 seconds spark-sql (default)> select hive_sum(*) from range(100); 4950 Time taken: 1.25 seconds, Fetched 1 row(s) // GenericUDTF spark-sql (default)> create temporary function hive_replicate_rows as 'org.apache.hadoop.hive.ql.udf.generic.GenericUDTFReplicateRows'; Time taken: 0.012 seconds spark-sql (default)> select hive_replicate_rows(3L, id) from range(3); 3 0 3 0 3 0 3 1 3 1 3 1 3 2 3 2 3 2 Time taken: 0.19 seconds, Fetched 9 row(s) ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #50264 from pan3793/SPARK-51466-4.0. Authored-by: Cheng Pan <[email protected]> Signed-off-by: yangjie01 <[email protected]>

LuciferYang · 2025-03-13T08:28:07Z

Merged into branch-4.0. Thanks @pan3793

github-actions bot added SQL BUILD labels Mar 13, 2025

pan3793 mentioned this pull request Mar 13, 2025

[SPARK-51466][SQL][HIVE] Eliminate Hive built-in UDFs initialization on Hive UDF evaluation #50232

Closed

LuciferYang approved these changes Mar 13, 2025

View reviewed changes

LuciferYang closed this Mar 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-51466][SQL][HIVE][4.0] Eliminate Hive built-in UDFs initialization on Hive UDF evaluation #50264

[SPARK-51466][SQL][HIVE][4.0] Eliminate Hive built-in UDFs initialization on Hive UDF evaluation #50264

Uh oh!

pan3793 commented Mar 13, 2025 •

edited

Loading

Uh oh!

LuciferYang left a comment

Uh oh!

LuciferYang commented Mar 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[SPARK-51466][SQL][HIVE][4.0] Eliminate Hive built-in UDFs initialization on Hive UDF evaluation #50264

[SPARK-51466][SQL][HIVE][4.0] Eliminate Hive built-in UDFs initialization on Hive UDF evaluation #50264

Uh oh!

Conversation

pan3793 commented Mar 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

LuciferYang left a comment

Choose a reason for hiding this comment

Uh oh!

LuciferYang commented Mar 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pan3793 commented Mar 13, 2025 •

edited

Loading