[SPARK-51466][SQL][HIVE] Eliminate Hive built-in UDFs initialization on Hive UDF evaluation #50232

pan3793 · 2025-03-11T03:44:00Z

What changes were proposed in this pull request?

Fork a few methods from Hive to eliminate calls of org.apache.hadoop.hive.ql.exec.FunctionRegistry to avoid initializing Hive built-in UDFs

Why are the changes needed?

Currently, when the user runs a query that contains Hive UDF, it triggers o.a.h.hive.ql.exec.FunctionRegistry initialization, which also initializes the Hive built-in UDFs, UDAFs and UDTFs.

Since SPARK-51029 (#49725) removes hive-llap-common from the Spark binary distributions, NoClassDefFoundError occurs.

org.apache.spark.sql.execution.QueryExecutionException: java.lang.NoClassDefFoundError: org/apache/hadoop/hive/llap/security/LlapSigner$Signable
    at java.base/java.lang.Class.getDeclaredConstructors0(Native Method)
    at java.base/java.lang.Class.privateGetDeclaredConstructors(Class.java:3373)
    at java.base/java.lang.Class.getConstructor0(Class.java:3578)
    at java.base/java.lang.Class.getDeclaredConstructor(Class.java:2754)
    at org.apache.hive.common.util.ReflectionUtil.newInstance(ReflectionUtil.java:79)
    at org.apache.hadoop.hive.ql.exec.Registry.registerGenericUDTF(Registry.java:208)
    at org.apache.hadoop.hive.ql.exec.Registry.registerGenericUDTF(Registry.java:201)
    at org.apache.hadoop.hive.ql.exec.FunctionRegistry.<clinit>(FunctionRegistry.java:500)
    at org.apache.hadoop.hive.ql.udf.generic.GenericUDF.initializeAndFoldConstants(GenericUDF.java:160)
    at org.apache.spark.sql.hive.HiveGenericUDFEvaluator.returnInspector$lzycompute(hiveUDFEvaluators.scala:118)
    at org.apache.spark.sql.hive.HiveGenericUDFEvaluator.returnInspector(hiveUDFEvaluators.scala:117)
    at org.apache.spark.sql.hive.HiveGenericUDF.dataType$lzycompute(hiveUDFs.scala:132)
    at org.apache.spark.sql.hive.HiveGenericUDF.dataType(hiveUDFs.scala:132)
    at org.apache.spark.sql.hive.HiveUDFExpressionBuilder$.makeHiveFunctionExpression(HiveSessionStateBuilder.scala:197)
    at org.apache.spark.sql.hive.HiveUDFExpressionBuilder$.$anonfun$makeExpression$1(HiveSessionStateBuilder.scala:177)
    at org.apache.spark.util.Utils$.withContextClassLoader(Utils.scala:187)
    at org.apache.spark.sql.hive.HiveUDFExpressionBuilder$.makeExpression(HiveSessionStateBuilder.scala:171)
    at org.apache.spark.sql.catalyst.catalog.SessionCatalog.$anonfun$makeFunctionBuilder$1(SessionCatalog.scala:1689)
    ...

Actually, Spark does not use those Hive built-in functions, but still needs to pull those transitive deps to make Hive happy. By eliminating Hive built-in UDFs initialization, Spark can get rid of those transitive deps, and gain a small performance improvement on the first call Hive UDF.

Does this PR introduce any user-facing change?

No, except for a small perf improvement on the first call Hive UDF.

How was this patch tested?

Exclude hive-llap-* deps from the STS module and pass all SQL tests (previously some tests fail without hive-llap-* deps, see SPARK-51041)

Manually tested that call Hive UDF, UDAF and UDTF won't trigger org.apache.hadoop.hive.ql.exec.FunctionRegistry.<clinit>

$ bin/spark-sql
// UDF
spark-sql (default)> create temporary function hive_uuid as 'org.apache.hadoop.hive.ql.udf.UDFUUID';
Time taken: 0.878 seconds
spark-sql (default)> select hive_uuid();
840356e5-ce2a-4d6c-9383-294d620ec32b
Time taken: 2.264 seconds, Fetched 1 row(s)

// GenericUDF
spark-sql (default)> create temporary function hive_sha2 as 'org.apache.hadoop.hive.ql.udf.generic.GenericUDFSha2';
Time taken: 0.023 seconds
spark-sql (default)> select hive_sha2('ABC', 256);
b5d4045c3f466fa91fe2cc6abe79232a1a57cdf104f7a26e716e0a1e2789df78
Time taken: 0.157 seconds, Fetched 1 row(s)

// UDAF
spark-sql (default)> create temporary function hive_percentile as 'org.apache.hadoop.hive.ql.udf.UDAFPercentile';
Time taken: 0.032 seconds
spark-sql (default)> select hive_percentile(id, 0.5) from range(100);
49.5
Time taken: 0.474 seconds, Fetched 1 row(s)

// GenericUDAF
spark-sql (default)> create temporary function hive_sum as 'org.apache.hadoop.hive.ql.udf.generic.GenericUDAFSum';
Time taken: 0.017 seconds
spark-sql (default)> select hive_sum(*) from range(100);
4950
Time taken: 1.25 seconds, Fetched 1 row(s)

// GenericUDTF
spark-sql (default)> create temporary function hive_replicate_rows as 'org.apache.hadoop.hive.ql.udf.generic.GenericUDTFReplicateRows';
Time taken: 0.012 seconds
spark-sql (default)> select hive_replicate_rows(3L, id) from range(3);
3	0
3	0
3	0
3	1
3	1
3	1
3	2
3	2
3	2
Time taken: 0.19 seconds, Fetched 9 row(s)

Was this patch authored or co-authored using generative AI tooling?

No.

…on Hive UDF evaluation

pan3793 · 2025-03-11T08:33:51Z

cc @cloud-fan @dongjoon-hyun @LuciferYang @wangyum @yaooqinn this is an alternative of SPARK-51449 (#50222), the advantages of this one are:

applicable for both Hive 2.3.9 and 2.3.10, so we don't need to worry about the risk of reverting Hive 2.3.10 upgrading
address a broader scope of issues about Hive built-in UDF transitive deps
a small perf improvement on the first call Hive UDF

pan3793 · 2025-03-11T08:35:33Z

sql/hive/src/main/java/org/apache/hadoop/hive/ql/exec/SparkDefaultUDFMethodResolver.java

+    return getMethodInternal(udfClass, "evaluate", false, argClasses);
+  }
+
+  // Below methods are copied from Hive 2.3.10 o.a.h.hive.ql.exec.FunctionRegistry


above is wrapper code, below is copied code.

As discussed offline, if we proceed with the current pr, we need to refer to the description in #49736 to ensure that the hive-thriftserver tests can be successfully executed after removing the hive-llap dependency.

Yes, I wasn't aware of the STS issue previously, this approach requires further investigation.

I believe the STS issue is addressed now. @LuciferYang could you please take another look?

$ build/mvn -Phive-thriftserver install -DskipTests $ build/mvn -pl sql/hive-thriftserver -Phive-thriftserver install -fae ... Run completed in 17 minutes, 21 seconds. Total number of tests run: 639 Suites: completed 20, aborted 0 Tests: succeeded 639, failed 0, canceled 0, ignored 25, pending 0 All tests passed.

pan3793 · 2025-03-11T13:20:46Z

sql/hive-thriftserver/src/test/resources/log4j2.properties

 logger.thriftserver.level = off
+
+logger.dagscheduler.name = org.apache.spark.scheduler.DAGScheduler
+logger.dagscheduler.level = error


to suppress noisy logs like

20:44:53.029 WARN org.apache.spark.scheduler.DAGScheduler: Failed to cancel job group 2f794b16-abee-4bbe-9caa-8be3416c500b. Cannot find active jobs for it.

dongjoon-hyun

Thank you so much for suggesting alternatives for the community, @pan3793 and @LuciferYang .

LuciferYang · 2025-03-12T05:59:51Z

sql/hive-thriftserver/pom.xml

      <artifactId>byte-buddy-agent</artifactId>
      <scope>test</scope>
    </dependency>
-    <dependency>


Is it possible for us to add some configuration in SparkBuild.scala to ensure that hive-llap is not included in the classpath too when testing the thriftserver module with sbt?

Done by updating the SparkBuild.scala, it could be verified by

build/sbt -Phive-thriftserver hive-thriftserver/Test/dependencyTree | grep hive-llap

before

[info] +-org.apache.hive:hive-llap-client:2.3.10 [info] +-org.apache.hive:hive-llap-common:2.3.10 [info] | +-org.apache.hive:hive-llap-client:2.3.10 [info] | +-org.apache.hive:hive-llap-common:2.3.10

now result is empty.

LuciferYang · 2025-03-12T06:41:44Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveUDFEvaluators.scala

  @transient
  lazy val returnInspector = {
-    function.initializeAndFoldConstants(argumentInspectors.toArray)
+    // Inline o.a.h.hive.ql.udf.generic.GenericUDF#initializeAndFoldConstants, but


also cc @panbingkun

sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveUDFEvaluators.scala

sql/hive/src/main/java/org/apache/hadoop/hive/ql/exec/HiveFunctionRegistryUtils.java

LuciferYang

+1, LGTM
Thank you @pan3793

dongjoon-hyun

+1, LGTM.

dongjoon-hyun · 2025-03-12T18:40:59Z

Thank you, @pan3793 and @LuciferYang . Merged to master.

Could you make a backporting PR to branch-4.0 to pass CI there once more, @pan3793 ?

…on Hive UDF evaluation ### What changes were proposed in this pull request? Fork a few methods from Hive to eliminate calls of `org.apache.hadoop.hive.ql.exec.FunctionRegistry` to avoid initializing Hive built-in UDFs ### Why are the changes needed? Currently, when the user runs a query that contains Hive UDF, it triggers `o.a.h.hive.ql.exec.FunctionRegistry` initialization, which also initializes the [Hive built-in UDFs, UDAFs and UDTFs](https://github.com/apache/hive/blob/rel/release-2.3.10/ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java#L500). Since [SPARK-51029](https://issues.apache.org/jira/browse/SPARK-51029) (apache#49725) removes hive-llap-common from the Spark binary distributions, `NoClassDefFoundError` occurs. ``` org.apache.spark.sql.execution.QueryExecutionException: java.lang.NoClassDefFoundError: org/apache/hadoop/hive/llap/security/LlapSigner$Signable at java.base/java.lang.Class.getDeclaredConstructors0(Native Method) at java.base/java.lang.Class.privateGetDeclaredConstructors(Class.java:3373) at java.base/java.lang.Class.getConstructor0(Class.java:3578) at java.base/java.lang.Class.getDeclaredConstructor(Class.java:2754) at org.apache.hive.common.util.ReflectionUtil.newInstance(ReflectionUtil.java:79) at org.apache.hadoop.hive.ql.exec.Registry.registerGenericUDTF(Registry.java:208) at org.apache.hadoop.hive.ql.exec.Registry.registerGenericUDTF(Registry.java:201) at org.apache.hadoop.hive.ql.exec.FunctionRegistry.<clinit>(FunctionRegistry.java:500) at org.apache.hadoop.hive.ql.udf.generic.GenericUDF.initializeAndFoldConstants(GenericUDF.java:160) at org.apache.spark.sql.hive.HiveGenericUDFEvaluator.returnInspector$lzycompute(hiveUDFEvaluators.scala:118) at org.apache.spark.sql.hive.HiveGenericUDFEvaluator.returnInspector(hiveUDFEvaluators.scala:117) at org.apache.spark.sql.hive.HiveGenericUDF.dataType$lzycompute(hiveUDFs.scala:132) at org.apache.spark.sql.hive.HiveGenericUDF.dataType(hiveUDFs.scala:132) at org.apache.spark.sql.hive.HiveUDFExpressionBuilder$.makeHiveFunctionExpression(HiveSessionStateBuilder.scala:197) at org.apache.spark.sql.hive.HiveUDFExpressionBuilder$.$anonfun$makeExpression$1(HiveSessionStateBuilder.scala:177) at org.apache.spark.util.Utils$.withContextClassLoader(Utils.scala:187) at org.apache.spark.sql.hive.HiveUDFExpressionBuilder$.makeExpression(HiveSessionStateBuilder.scala:171) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.$anonfun$makeFunctionBuilder$1(SessionCatalog.scala:1689) ... ``` Actually, Spark does not use those Hive built-in functions, but still needs to pull those transitive deps to make Hive happy. By eliminating Hive built-in UDFs initialization, Spark can get rid of those transitive deps, and gain a small performance improvement on the first call Hive UDF. ### Does this PR introduce _any_ user-facing change? No, except for a small perf improvement on the first call Hive UDF. ### How was this patch tested? Pass GHA to ensure the porting code is correct. Manually tested that call Hive UDF, UDAF and UDTF won't trigger `org.apache.hadoop.hive.ql.exec.FunctionRegistry.<clinit>` ``` $ bin/spark-sql // UDF spark-sql (default)> create temporary function hive_uuid as 'org.apache.hadoop.hive.ql.udf.UDFUUID'; Time taken: 0.878 seconds spark-sql (default)> select hive_uuid(); 840356e5-ce2a-4d6c-9383-294d620ec32b Time taken: 2.264 seconds, Fetched 1 row(s) // GenericUDF spark-sql (default)> create temporary function hive_sha2 as 'org.apache.hadoop.hive.ql.udf.generic.GenericUDFSha2'; Time taken: 0.023 seconds spark-sql (default)> select hive_sha2('ABC', 256); b5d4045c3f466fa91fe2cc6abe79232a1a57cdf104f7a26e716e0a1e2789df78 Time taken: 0.157 seconds, Fetched 1 row(s) // UDAF spark-sql (default)> create temporary function hive_percentile as 'org.apache.hadoop.hive.ql.udf.UDAFPercentile'; Time taken: 0.032 seconds spark-sql (default)> select hive_percentile(id, 0.5) from range(100); 49.5 Time taken: 0.474 seconds, Fetched 1 row(s) // GenericUDAF spark-sql (default)> create temporary function hive_sum as 'org.apache.hadoop.hive.ql.udf.generic.GenericUDAFSum'; Time taken: 0.017 seconds spark-sql (default)> select hive_sum(*) from range(100); 4950 Time taken: 1.25 seconds, Fetched 1 row(s) // GenericUDTF spark-sql (default)> create temporary function hive_replicate_rows as 'org.apache.hadoop.hive.ql.udf.generic.GenericUDTFReplicateRows'; Time taken: 0.012 seconds spark-sql (default)> select hive_replicate_rows(3L, id) from range(3); 3 0 3 0 3 0 3 1 3 1 3 1 3 2 3 2 3 2 Time taken: 0.19 seconds, Fetched 9 row(s) ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#50232 from pan3793/eliminate-hive-udf-init. Authored-by: Cheng Pan <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

pan3793 · 2025-03-13T02:18:06Z

@dongjoon-hyun I opend #50264 for 4.0 backport

yaooqinn · 2025-03-13T07:16:17Z

Hey, why are any manual tests listed in the PR desc not included in UTs?

pan3793 · 2025-03-13T07:23:20Z

Hey, why are any manual tests listed in the PR desc not included in UTs?

Given that we have excluded hive-llap-* deps from STS modules, the existing STS SQL tests should cover all my manual test cases.

…tion on Hive UDF evaluation Backport #50232 to branch-4.0 ### What changes were proposed in this pull request? Fork a few methods from Hive to eliminate calls of `org.apache.hadoop.hive.ql.exec.FunctionRegistry` to avoid initializing Hive built-in UDFs ### Why are the changes needed? Currently, when the user runs a query that contains Hive UDF, it triggers `o.a.h.hive.ql.exec.FunctionRegistry` initialization, which also initializes the [Hive built-in UDFs, UDAFs and UDTFs](https://github.com/apache/hive/blob/rel/release-2.3.10/ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java#L500). Since [SPARK-51029](https://issues.apache.org/jira/browse/SPARK-51029) (#49725) removes hive-llap-common from the Spark binary distributions, `NoClassDefFoundError` occurs. ``` org.apache.spark.sql.execution.QueryExecutionException: java.lang.NoClassDefFoundError: org/apache/hadoop/hive/llap/security/LlapSigner$Signable at java.base/java.lang.Class.getDeclaredConstructors0(Native Method) at java.base/java.lang.Class.privateGetDeclaredConstructors(Class.java:3373) at java.base/java.lang.Class.getConstructor0(Class.java:3578) at java.base/java.lang.Class.getDeclaredConstructor(Class.java:2754) at org.apache.hive.common.util.ReflectionUtil.newInstance(ReflectionUtil.java:79) at org.apache.hadoop.hive.ql.exec.Registry.registerGenericUDTF(Registry.java:208) at org.apache.hadoop.hive.ql.exec.Registry.registerGenericUDTF(Registry.java:201) at org.apache.hadoop.hive.ql.exec.FunctionRegistry.<clinit>(FunctionRegistry.java:500) at org.apache.hadoop.hive.ql.udf.generic.GenericUDF.initializeAndFoldConstants(GenericUDF.java:160) at org.apache.spark.sql.hive.HiveGenericUDFEvaluator.returnInspector$lzycompute(hiveUDFEvaluators.scala:118) at org.apache.spark.sql.hive.HiveGenericUDFEvaluator.returnInspector(hiveUDFEvaluators.scala:117) at org.apache.spark.sql.hive.HiveGenericUDF.dataType$lzycompute(hiveUDFs.scala:132) at org.apache.spark.sql.hive.HiveGenericUDF.dataType(hiveUDFs.scala:132) at org.apache.spark.sql.hive.HiveUDFExpressionBuilder$.makeHiveFunctionExpression(HiveSessionStateBuilder.scala:197) at org.apache.spark.sql.hive.HiveUDFExpressionBuilder$.$anonfun$makeExpression$1(HiveSessionStateBuilder.scala:177) at org.apache.spark.util.Utils$.withContextClassLoader(Utils.scala:187) at org.apache.spark.sql.hive.HiveUDFExpressionBuilder$.makeExpression(HiveSessionStateBuilder.scala:171) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.$anonfun$makeFunctionBuilder$1(SessionCatalog.scala:1689) ... ``` Actually, Spark does not use those Hive built-in functions, but still needs to pull those transitive deps to make Hive happy. By eliminating Hive built-in UDFs initialization, Spark can get rid of those transitive deps, and gain a small performance improvement on the first call Hive UDF. ### Does this PR introduce _any_ user-facing change? No, except for a small perf improvement on the first call Hive UDF. ### How was this patch tested? Exclude `hive-llap-*` deps from the STS module and pass all SQL tests (previously some tests fail without `hive-llap-*` deps, see SPARK-51041) Manually tested that call Hive UDF, UDAF and UDTF won't trigger `org.apache.hadoop.hive.ql.exec.FunctionRegistry.<clinit>` ``` $ bin/spark-sql // UDF spark-sql (default)> create temporary function hive_uuid as 'org.apache.hadoop.hive.ql.udf.UDFUUID'; Time taken: 0.878 seconds spark-sql (default)> select hive_uuid(); 840356e5-ce2a-4d6c-9383-294d620ec32b Time taken: 2.264 seconds, Fetched 1 row(s) // GenericUDF spark-sql (default)> create temporary function hive_sha2 as 'org.apache.hadoop.hive.ql.udf.generic.GenericUDFSha2'; Time taken: 0.023 seconds spark-sql (default)> select hive_sha2('ABC', 256); b5d4045c3f466fa91fe2cc6abe79232a1a57cdf104f7a26e716e0a1e2789df78 Time taken: 0.157 seconds, Fetched 1 row(s) // UDAF spark-sql (default)> create temporary function hive_percentile as 'org.apache.hadoop.hive.ql.udf.UDAFPercentile'; Time taken: 0.032 seconds spark-sql (default)> select hive_percentile(id, 0.5) from range(100); 49.5 Time taken: 0.474 seconds, Fetched 1 row(s) // GenericUDAF spark-sql (default)> create temporary function hive_sum as 'org.apache.hadoop.hive.ql.udf.generic.GenericUDAFSum'; Time taken: 0.017 seconds spark-sql (default)> select hive_sum(*) from range(100); 4950 Time taken: 1.25 seconds, Fetched 1 row(s) // GenericUDTF spark-sql (default)> create temporary function hive_replicate_rows as 'org.apache.hadoop.hive.ql.udf.generic.GenericUDTFReplicateRows'; Time taken: 0.012 seconds spark-sql (default)> select hive_replicate_rows(3L, id) from range(3); 3 0 3 0 3 0 3 1 3 1 3 1 3 2 3 2 3 2 Time taken: 0.19 seconds, Fetched 9 row(s) ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #50264 from pan3793/SPARK-51466-4.0. Authored-by: Cheng Pan <[email protected]> Signed-off-by: yangjie01 <[email protected]>

cloud-fan · 2025-03-13T14:58:11Z

late LGTM

…on Hive UDF evaluation ### What changes were proposed in this pull request? Fork a few methods from Hive to eliminate calls of `org.apache.hadoop.hive.ql.exec.FunctionRegistry` to avoid initializing Hive built-in UDFs ### Why are the changes needed? Currently, when the user runs a query that contains Hive UDF, it triggers `o.a.h.hive.ql.exec.FunctionRegistry` initialization, which also initializes the [Hive built-in UDFs, UDAFs and UDTFs](https://github.com/apache/hive/blob/rel/release-2.3.10/ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java#L500). Since [SPARK-51029](https://issues.apache.org/jira/browse/SPARK-51029) (apache#49725) removes hive-llap-common from the Spark binary distributions, `NoClassDefFoundError` occurs. ``` org.apache.spark.sql.execution.QueryExecutionException: java.lang.NoClassDefFoundError: org/apache/hadoop/hive/llap/security/LlapSigner$Signable at java.base/java.lang.Class.getDeclaredConstructors0(Native Method) at java.base/java.lang.Class.privateGetDeclaredConstructors(Class.java:3373) at java.base/java.lang.Class.getConstructor0(Class.java:3578) at java.base/java.lang.Class.getDeclaredConstructor(Class.java:2754) at org.apache.hive.common.util.ReflectionUtil.newInstance(ReflectionUtil.java:79) at org.apache.hadoop.hive.ql.exec.Registry.registerGenericUDTF(Registry.java:208) at org.apache.hadoop.hive.ql.exec.Registry.registerGenericUDTF(Registry.java:201) at org.apache.hadoop.hive.ql.exec.FunctionRegistry.<clinit>(FunctionRegistry.java:500) at org.apache.hadoop.hive.ql.udf.generic.GenericUDF.initializeAndFoldConstants(GenericUDF.java:160) at org.apache.spark.sql.hive.HiveGenericUDFEvaluator.returnInspector$lzycompute(hiveUDFEvaluators.scala:118) at org.apache.spark.sql.hive.HiveGenericUDFEvaluator.returnInspector(hiveUDFEvaluators.scala:117) at org.apache.spark.sql.hive.HiveGenericUDF.dataType$lzycompute(hiveUDFs.scala:132) at org.apache.spark.sql.hive.HiveGenericUDF.dataType(hiveUDFs.scala:132) at org.apache.spark.sql.hive.HiveUDFExpressionBuilder$.makeHiveFunctionExpression(HiveSessionStateBuilder.scala:197) at org.apache.spark.sql.hive.HiveUDFExpressionBuilder$.$anonfun$makeExpression$1(HiveSessionStateBuilder.scala:177) at org.apache.spark.util.Utils$.withContextClassLoader(Utils.scala:187) at org.apache.spark.sql.hive.HiveUDFExpressionBuilder$.makeExpression(HiveSessionStateBuilder.scala:171) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.$anonfun$makeFunctionBuilder$1(SessionCatalog.scala:1689) ... ``` Actually, Spark does not use those Hive built-in functions, but still needs to pull those transitive deps to make Hive happy. By eliminating Hive built-in UDFs initialization, Spark can get rid of those transitive deps, and gain a small performance improvement on the first call Hive UDF. ### Does this PR introduce _any_ user-facing change? No, except for a small perf improvement on the first call Hive UDF. ### How was this patch tested? Pass GHA to ensure the porting code is correct. Manually tested that call Hive UDF, UDAF and UDTF won't trigger `org.apache.hadoop.hive.ql.exec.FunctionRegistry.<clinit>` ``` $ bin/spark-sql // UDF spark-sql (default)> create temporary function hive_uuid as 'org.apache.hadoop.hive.ql.udf.UDFUUID'; Time taken: 0.878 seconds spark-sql (default)> select hive_uuid(); 840356e5-ce2a-4d6c-9383-294d620ec32b Time taken: 2.264 seconds, Fetched 1 row(s) // GenericUDF spark-sql (default)> create temporary function hive_sha2 as 'org.apache.hadoop.hive.ql.udf.generic.GenericUDFSha2'; Time taken: 0.023 seconds spark-sql (default)> select hive_sha2('ABC', 256); b5d4045c3f466fa91fe2cc6abe79232a1a57cdf104f7a26e716e0a1e2789df78 Time taken: 0.157 seconds, Fetched 1 row(s) // UDAF spark-sql (default)> create temporary function hive_percentile as 'org.apache.hadoop.hive.ql.udf.UDAFPercentile'; Time taken: 0.032 seconds spark-sql (default)> select hive_percentile(id, 0.5) from range(100); 49.5 Time taken: 0.474 seconds, Fetched 1 row(s) // GenericUDAF spark-sql (default)> create temporary function hive_sum as 'org.apache.hadoop.hive.ql.udf.generic.GenericUDAFSum'; Time taken: 0.017 seconds spark-sql (default)> select hive_sum(*) from range(100); 4950 Time taken: 1.25 seconds, Fetched 1 row(s) // GenericUDTF spark-sql (default)> create temporary function hive_replicate_rows as 'org.apache.hadoop.hive.ql.udf.generic.GenericUDTFReplicateRows'; Time taken: 0.012 seconds spark-sql (default)> select hive_replicate_rows(3L, id) from range(3); 3 0 3 0 3 0 3 1 3 1 3 1 3 2 3 2 3 2 Time taken: 0.19 seconds, Fetched 9 row(s) ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#50232 from pan3793/eliminate-hive-udf-init. Authored-by: Cheng Pan <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

github-actions bot added the SQL label Mar 11, 2025

pan3793 changed the title ~~[WIP] Eliminate Hive built-in UDFs initialization on Hive UDF evaluation~~ [SPARK-51466][SQL][HIVE] Eliminate Hive built-in UDFs initialization on Hive UDF evaluation Mar 11, 2025

[SPARK-51466][SQL][HIVE] Eliminate Hive built-in UDFs initialization …

ec30d94

…on Hive UDF evaluation

pan3793 force-pushed the eliminate-hive-udf-init branch from 22fdec8 to ec30d94 Compare March 11, 2025 08:27

pan3793 marked this pull request as ready for review March 11, 2025 08:27

pan3793 commented Mar 11, 2025

View reviewed changes

pan3793 marked this pull request as draft March 11, 2025 12:11

Fix STS

30950d6

github-actions bot added the BUILD label Mar 11, 2025

nit

68c5718

pan3793 marked this pull request as ready for review March 11, 2025 13:19

pan3793 commented Mar 11, 2025

View reviewed changes

dongjoon-hyun reviewed Mar 11, 2025

View reviewed changes

LuciferYang reviewed Mar 12, 2025

View reviewed changes

sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveUDFEvaluators.scala Outdated Show resolved Hide resolved

pan3793 added 2 commits March 12, 2025 14:52

simplify SparkGenericUDAFBridge

9506f99

typo

55feae1

LuciferYang reviewed Mar 12, 2025

View reviewed changes

sql/hive/src/main/java/org/apache/hadoop/hive/ql/exec/HiveFunctionRegistryUtils.java Outdated Show resolved Hide resolved

nit

dd80f9c

LuciferYang approved these changes Mar 12, 2025

View reviewed changes

dongjoon-hyun approved these changes Mar 12, 2025

View reviewed changes

dongjoon-hyun closed this in 4b9b246 Mar 12, 2025

pan3793 mentioned this pull request Mar 13, 2025

[SPARK-51466][SQL][HIVE][4.0] Eliminate Hive built-in UDFs initialization on Hive UDF evaluation #50264

Closed

pan3793 mentioned this pull request Mar 13, 2025

[SPARK-51449][BUILD] Restore hive-llap-common to compile scope #50222

Closed

pan3793 mentioned this pull request Mar 20, 2025

[SPARK-43225][BUILD][SQL] Remove jackson-core-asl and jackson-mapper-asl from pre-built distribution #40893

Closed

pan3793 deleted the eliminate-hive-udf-init branch July 22, 2025 13:29

[SPARK-51466][SQL][HIVE] Eliminate Hive built-in UDFs initialization on Hive UDF evaluation #50232

[SPARK-51466][SQL][HIVE] Eliminate Hive built-in UDFs initialization on Hive UDF evaluation #50232

Uh oh!

Conversation

pan3793 commented Mar 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

pan3793 commented Mar 11, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

LuciferYang left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Mar 12, 2025

Uh oh!

pan3793 commented Mar 13, 2025

Uh oh!

yaooqinn commented Mar 13, 2025

Uh oh!

pan3793 commented Mar 13, 2025

Uh oh!

cloud-fan commented Mar 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

pan3793 commented Mar 11, 2025 •

edited

Loading

LuciferYang left a comment •

edited

Loading