Skip to content

Conversation

@HeartSaVioR
Copy link
Contributor

What changes were proposed in this pull request?

This patch proposes to cache Class instance for the UDF instance in HiveFunctionWrapper to fix the case where Hive simple UDF is somehow transformed (expression is copied) and evaluated later with another classloader (for the case current thread context classloader is somehow changed). In this case, Spark throws CNFE as of now.

It's only occurred for Hive simple UDF, as HiveFunctionWrapper caches the UDF instance whereas it doesn't do for UDF type. The comment says Spark has to create instance every time for UDF, so we cannot simply do the same. This patch caches Class instance instead, and switch current thread context classloader to which loads the Class instance.

This patch extends the test boundary as well. We only tested with GenericUDTF for SPARK-26560, and this patch actually requires only UDF. But to avoid regression for other types as well, this patch adds all available types (UDF, GenericUDF, AbstractGenericUDAFResolver, UDAF, GenericUDTF) into the boundary of tests.

Credit to @cloud-fan as he discovered the problem and proposed the solution.

Why are the changes needed?

Above section describes why it's a bug and how it's fixed.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

New UTs added.

…unctionWrapper

This patch proposes to cache Class instance for the UDF instance in HiveFunctionWrapper to fix the case where Hive simple UDF is somehow transformed (expression is copied) and evaluated later with another classloader (for the case current thread context classloader is somehow changed). In this case, Spark throws CNFE as of now.

It's only occurred for Hive simple UDF, as HiveFunctionWrapper caches the UDF instance whereas it doesn't do for `UDF` type. The comment says Spark has to create instance every time for UDF, so we cannot simply do the same. This patch caches Class instance instead, and switch current thread context classloader to which loads the Class instance.

This patch extends the test boundary as well. We only tested with GenericUDTF for SPARK-26560, and this patch actually requires only UDF. But to avoid regression for other types as well, this patch adds all available types (UDF, GenericUDF, AbstractGenericUDAFResolver, UDAF, GenericUDTF) into the boundary of tests.

Credit to cloud-fan as he discovered the problem and proposed the solution.

Above section describes why it's a bug and how it's fixed.

No.

New UTs added.

Closes apache#28079 from HeartSaVioR/SPARK-31312.

Authored-by: Jungtaek Lim (HeartSaVioR) <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
@SparkQA
Copy link

SparkQA commented Mar 31, 2020

Test build #120655 has finished for PR 28086 at commit b515c69.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HeartSaVioR
Copy link
Contributor Author

HeartSaVioR commented Apr 1, 2020

To reduce the code divergence I just enabled language feature (as compiler guided on error message). There seems to be the way we don't enable it (simply use AnyRef instead of wildcard) so that's up to our preference.

@SparkQA
Copy link

SparkQA commented Apr 1, 2020

Test build #120656 has finished for PR 28086 at commit eb42ff1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

cloud-fan pushed a commit that referenced this pull request Apr 1, 2020
…HiveFunctionWrapper

### What changes were proposed in this pull request?

This patch proposes to cache Class instance for the UDF instance in HiveFunctionWrapper to fix the case where Hive simple UDF is somehow transformed (expression is copied) and evaluated later with another classloader (for the case current thread context classloader is somehow changed). In this case, Spark throws CNFE as of now.

It's only occurred for Hive simple UDF, as HiveFunctionWrapper caches the UDF instance whereas it doesn't do for `UDF` type. The comment says Spark has to create instance every time for UDF, so we cannot simply do the same. This patch caches Class instance instead, and switch current thread context classloader to which loads the Class instance.

This patch extends the test boundary as well. We only tested with GenericUDTF for SPARK-26560, and this patch actually requires only UDF. But to avoid regression for other types as well, this patch adds all available types (UDF, GenericUDF, AbstractGenericUDAFResolver, UDAF, GenericUDTF) into the boundary of tests.

Credit to cloud-fan as he discovered the problem and proposed the solution.

### Why are the changes needed?

Above section describes why it's a bug and how it's fixed.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

New UTs added.

Closes #28086 from HeartSaVioR/SPARK-31312-branch-2.4.

Authored-by: Jungtaek Lim (HeartSaVioR) <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
@cloud-fan
Copy link
Contributor

thanks, merging to 2.4!

@cloud-fan cloud-fan closed this Apr 1, 2020
@HeartSaVioR
Copy link
Contributor Author

Thanks for the quick review and merge!

@HeartSaVioR HeartSaVioR deleted the SPARK-31312-branch-2.4 branch April 1, 2020 03:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants