Skip to content

Conversation

@HeartSaVioR
Copy link
Contributor

What changes were proposed in this pull request?

This patch proposes to cache Class instance for the UDF instance in HiveFunctionWrapper to fix the case where Hive simple UDF is somehow transformed (expression is copied) and evaluated later with another classloader (for the case current thread context classloader is somehow changed). In this case, Spark throws CNFE as of now.

It's only occurred for Hive simple UDF, as HiveFunctionWrapper caches the UDF instance whereas it doesn't do for UDF type. The comment says Spark has to create instance every time for UDF, so we cannot simply do the same. This patch caches Class instance instead, and switch current thread context classloader to which loads the Class instance.

This patch extends the test boundary as well. We only tested with GenericUDTF for SPARK-26560, and this patch actually requires only UDF. But to avoid regression for other types as well, this patch adds all available types (UDF, GenericUDF, AbstractGenericUDAFResolver, UDAF, GenericUDTF) into the boundary of tests.

Credit to @cloud-fan as he discovered the problem and proposed the solution.

Why are the changes needed?

Above section describes why it's a bug and how it's fixed.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

New UTs added.

}
}

test("SPARK-26560 Spark should be able to run Hive UDF using jar regardless of " +
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is moved to HiveUDFDynamicLoadSuite - now it's being tested with 5 available Hive UDF types.

clazz = Utils.getContextOrSparkClassLoader.loadClass(functionClassName)
.asInstanceOf[Class[_ <: AnyRef]]
}
val func = clazz.getConstructor().newInstance().asInstanceOf[UDFType]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we add clazz = null below this line, the new UT (SPARK-31312) fails with UDF type (only one of 5 fails, because other cases this cases instance instead).

@HeartSaVioR
Copy link
Contributor Author

cc. @cloud-fan @maropu

val jarUrl = getHiveUDFTestJarUrl
test("SPARK-26560 Spark should be able to run Hive UDF using jar regardless of " +
s"current thread context classloader (${udfInfo.identifier}") {
testHiveUDFUsingJarWithChangingClassloader(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: if the method is only called once, can we inline it?


test("SPARK-31312 Transformed Hive UDF using jar expression should not be failed to run " +
s"regardless of current thread context classloader (${udfInfo.identifier})") {
testHiveUDFUsingJarWithChangingClassloaderWithCopyUDFExpression(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

assert(Thread.currentThread().getContextClassLoader eq
spark.sqlContext.sharedState.jarClassLoader)

val udfExpr = fnCreateHiveUDFExpression()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this test should start here. The above test is the same as testHiveUDFUsingJarWithChangingClassloader.

@HeartSaVioR
Copy link
Contributor Author

Thanks for the review comments. I've just consolidated two tests into one, and inlined. Please take a look again.

@SparkQA
Copy link

SparkQA commented Mar 31, 2020

Test build #120634 has finished for PR 28079 at commit 5d8df3b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

)

udfTestInfos.foreach { udfInfo =>
val jarUrl = getHiveUDFTestJarUrl
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can inline it as well

@SparkQA
Copy link

SparkQA commented Mar 31, 2020

Test build #120641 has finished for PR 28079 at commit 7d34816.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 31, 2020

Test build #120642 has finished for PR 28079 at commit 5159423.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan cloud-fan closed this in 2a6aa8e Mar 31, 2020
cloud-fan pushed a commit that referenced this pull request Mar 31, 2020
…unctionWrapper

### What changes were proposed in this pull request?

This patch proposes to cache Class instance for the UDF instance in HiveFunctionWrapper to fix the case where Hive simple UDF is somehow transformed (expression is copied) and evaluated later with another classloader (for the case current thread context classloader is somehow changed). In this case, Spark throws CNFE as of now.

It's only occurred for Hive simple UDF, as HiveFunctionWrapper caches the UDF instance whereas it doesn't do for `UDF` type. The comment says Spark has to create instance every time for UDF, so we cannot simply do the same. This patch caches Class instance instead, and switch current thread context classloader to which loads the Class instance.

This patch extends the test boundary as well. We only tested with GenericUDTF for SPARK-26560, and this patch actually requires only UDF. But to avoid regression for other types as well, this patch adds all available types (UDF, GenericUDF, AbstractGenericUDAFResolver, UDAF, GenericUDTF) into the boundary of tests.

Credit to cloud-fan as he discovered the problem and proposed the solution.

### Why are the changes needed?

Above section describes why it's a bug and how it's fixed.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

New UTs added.

Closes #28079 from HeartSaVioR/SPARK-31312.

Authored-by: Jungtaek Lim (HeartSaVioR) <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 2a6aa8e)
Signed-off-by: Wenchen Fan <[email protected]>
@cloud-fan
Copy link
Contributor

thanks, merging to master/3.0!

@cloud-fan
Copy link
Contributor

@HeartSaVioR can you send a backport PR for 2.4? thanks!

HeartSaVioR added a commit to HeartSaVioR/spark that referenced this pull request Mar 31, 2020
…unctionWrapper

This patch proposes to cache Class instance for the UDF instance in HiveFunctionWrapper to fix the case where Hive simple UDF is somehow transformed (expression is copied) and evaluated later with another classloader (for the case current thread context classloader is somehow changed). In this case, Spark throws CNFE as of now.

It's only occurred for Hive simple UDF, as HiveFunctionWrapper caches the UDF instance whereas it doesn't do for `UDF` type. The comment says Spark has to create instance every time for UDF, so we cannot simply do the same. This patch caches Class instance instead, and switch current thread context classloader to which loads the Class instance.

This patch extends the test boundary as well. We only tested with GenericUDTF for SPARK-26560, and this patch actually requires only UDF. But to avoid regression for other types as well, this patch adds all available types (UDF, GenericUDF, AbstractGenericUDAFResolver, UDAF, GenericUDTF) into the boundary of tests.

Credit to cloud-fan as he discovered the problem and proposed the solution.

Above section describes why it's a bug and how it's fixed.

No.

New UTs added.

Closes apache#28079 from HeartSaVioR/SPARK-31312.

Authored-by: Jungtaek Lim (HeartSaVioR) <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
@HeartSaVioR
Copy link
Contributor Author

Thanks for the quick review and merge! #28086 is for branch-2.4.

@HeartSaVioR HeartSaVioR deleted the SPARK-31312 branch March 31, 2020 23:41
@maropu
Copy link
Member

maropu commented Apr 1, 2020

late LGTM, thanks for the work, @HeartSaVioR !

@HyukjinKwon
Copy link
Member

(very late) LGTM!

sjincho pushed a commit to sjincho/spark that referenced this pull request Apr 15, 2020
…unctionWrapper

### What changes were proposed in this pull request?

This patch proposes to cache Class instance for the UDF instance in HiveFunctionWrapper to fix the case where Hive simple UDF is somehow transformed (expression is copied) and evaluated later with another classloader (for the case current thread context classloader is somehow changed). In this case, Spark throws CNFE as of now.

It's only occurred for Hive simple UDF, as HiveFunctionWrapper caches the UDF instance whereas it doesn't do for `UDF` type. The comment says Spark has to create instance every time for UDF, so we cannot simply do the same. This patch caches Class instance instead, and switch current thread context classloader to which loads the Class instance.

This patch extends the test boundary as well. We only tested with GenericUDTF for SPARK-26560, and this patch actually requires only UDF. But to avoid regression for other types as well, this patch adds all available types (UDF, GenericUDF, AbstractGenericUDAFResolver, UDAF, GenericUDTF) into the boundary of tests.

Credit to cloud-fan as he discovered the problem and proposed the solution.

### Why are the changes needed?

Above section describes why it's a bug and how it's fixed.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

New UTs added.

Closes apache#28079 from HeartSaVioR/SPARK-31312.

Authored-by: Jungtaek Lim (HeartSaVioR) <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants