-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-16904] [SQL] Removal of Hive Built-in Hash Functions and TestHiveFunctionRegistry #14498
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #63240 has finished for PR 14498 at commit
|
|
retest this please |
|
Test build #63247 has finished for PR 14498 at commit
|
|
This seems fair. Shouldn't we do something about all the disabled test cases though? |
|
Yeah, we should check all the disabled test cases and see if we can move it. I am waiting for #14472. : ) |
|
#14472 has been merged. Will work on it soon. |
|
Is Spark's hashing function semantically equivalent to Hive's ? AFAIK, its not. I think it would be better to have a mode to be able to use Hive's hash method. eg. case when this would be needed: Users running a query in Hive want to switch to Spark. As this happens, you want to verify if the data produced is same or not. Also, for a brief time the pipeline would run in both the engines. Upstream consumers of the data generated should not see differences due to running in the different engines |
|
@tejasapatil Thanks for the inputs! I can understand your usage cases. It looks very reasonable. Actually, in Spark 2.0, we already removed it. You are unable to load/invoke the Hive's hash function. This PR is just to clean the usage of |
|
@gatorsmile can we take this to the finish line? This actually does not impact hash at all since it's only here for testing. |
|
@rxin Sure, will do it this weekend. Thanks! |
| } | ||
|
|
||
|
|
||
| private[hive] class TestHiveFunctionRegistry extends SimpleFunctionRegistry { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can still remove this class if we add back the removed spark builtin hash function manually right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think so
|
If this is no longer WIP, please update the title. Thanks. |
|
Based on the previous discussions in the other PRs, it sounds like these Hive-specific test cases are not very useful. Do we still need to rewrite them and add them to the other test suites? BTW, I went over these disabled test cases. I think the values are low. |
|
I'm ok with not having them. |
|
Test build #68268 has finished for PR 14498 at commit
|
|
Merging in master/branch-2.1. |
…veFunctionRegistry ### What changes were proposed in this pull request? Currently, the Hive built-in `hash` function is not being used in Spark since Spark 2.0. The public interface does not allow users to unregister the Spark built-in functions. Thus, users will never use Hive's built-in `hash` function. The only exception here is `TestHiveFunctionRegistry`, which allows users to unregister the built-in functions. Thus, we can load Hive's hash function in the test cases. If we disable it, 10+ test cases will fail because the results are different from the Hive golden answer files. This PR is to remove `hash` from the list of `hiveFunctions` in `HiveSessionCatalog`. It will also remove `TestHiveFunctionRegistry`. This removal makes us easier to remove `TestHiveSessionState` in the future. ### How was this patch tested? N/A Author: gatorsmile <[email protected]> Closes #14498 from gatorsmile/removeHash. (cherry picked from commit 57626a5) Signed-off-by: Reynold Xin <[email protected]>
…veFunctionRegistry ### What changes were proposed in this pull request? Currently, the Hive built-in `hash` function is not being used in Spark since Spark 2.0. The public interface does not allow users to unregister the Spark built-in functions. Thus, users will never use Hive's built-in `hash` function. The only exception here is `TestHiveFunctionRegistry`, which allows users to unregister the built-in functions. Thus, we can load Hive's hash function in the test cases. If we disable it, 10+ test cases will fail because the results are different from the Hive golden answer files. This PR is to remove `hash` from the list of `hiveFunctions` in `HiveSessionCatalog`. It will also remove `TestHiveFunctionRegistry`. This removal makes us easier to remove `TestHiveSessionState` in the future. ### How was this patch tested? N/A Author: gatorsmile <[email protected]> Closes apache#14498 from gatorsmile/removeHash.
What changes were proposed in this pull request?
Currently, the Hive built-in
hashfunction is not being used in Spark since Spark 2.0. The public interface does not allow users to unregister the Spark built-in functions. Thus, users will never use Hive's built-inhashfunction.The only exception here is
TestHiveFunctionRegistry, which allows users to unregister the built-in functions. Thus, we can load Hive's hash function in the test cases. If we disable it, 10+ test cases will fail because the results are different from the Hive golden answer files.This PR is to remove
hashfrom the list ofhiveFunctionsinHiveSessionCatalog. It will also removeTestHiveFunctionRegistry. This removal makes us easier to removeTestHiveSessionStatein the future.How was this patch tested?
N/A