Skip to content

Conversation

@gatorsmile
Copy link
Member

@gatorsmile gatorsmile commented Aug 4, 2016

What changes were proposed in this pull request?

Currently, the Hive built-in hash function is not being used in Spark since Spark 2.0. The public interface does not allow users to unregister the Spark built-in functions. Thus, users will never use Hive's built-in hash function.

The only exception here is TestHiveFunctionRegistry, which allows users to unregister the built-in functions. Thus, we can load Hive's hash function in the test cases. If we disable it, 10+ test cases will fail because the results are different from the Hive golden answer files.

This PR is to remove hash from the list of hiveFunctions in HiveSessionCatalog. It will also remove TestHiveFunctionRegistry. This removal makes us easier to remove TestHiveSessionState in the future.

How was this patch tested?

N/A

@SparkQA
Copy link

SparkQA commented Aug 4, 2016

Test build #63240 has finished for PR 14498 at commit 3b3f3c8.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented Aug 5, 2016

Test build #63247 has finished for PR 14498 at commit 3b3f3c8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@hvanhovell
Copy link
Contributor

This seems fair. Shouldn't we do something about all the disabled test cases though?

@gatorsmile
Copy link
Member Author

gatorsmile commented Aug 5, 2016

Yeah, we should check all the disabled test cases and see if we can move it. I am waiting for #14472.

: )

@gatorsmile
Copy link
Member Author

#14472 has been merged. Will work on it soon.

@tejasapatil
Copy link
Contributor

Is Spark's hashing function semantically equivalent to Hive's ? AFAIK, its not. I think it would be better to have a mode to be able to use Hive's hash method. eg. case when this would be needed: Users running a query in Hive want to switch to Spark. As this happens, you want to verify if the data produced is same or not. Also, for a brief time the pipeline would run in both the engines. Upstream consumers of the data generated should not see differences due to running in the different engines

@gatorsmile
Copy link
Member Author

@tejasapatil Thanks for the inputs! I can understand your usage cases. It looks very reasonable.

Actually, in Spark 2.0, we already removed it. You are unable to load/invoke the Hive's hash function. This PR is just to clean the usage of hash function in our testing package.

@rxin
Copy link
Contributor

rxin commented Nov 3, 2016

@gatorsmile can we take this to the finish line? This actually does not impact hash at all since it's only here for testing.

@gatorsmile
Copy link
Member Author

@rxin Sure, will do it this weekend. Thanks!

}


private[hive] class TestHiveFunctionRegistry extends SimpleFunctionRegistry {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can still remove this class if we add back the removed spark builtin hash function manually right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think so

@rxin
Copy link
Contributor

rxin commented Nov 7, 2016

If this is no longer WIP, please update the title. Thanks.

@gatorsmile gatorsmile changed the title [SPARK-16904] [SQL] Removal of Hive Built-in Hash Functions and TestHiveFunctionRegistry [WIP] [SPARK-16904] [SQL] Removal of Hive Built-in Hash Functions and TestHiveFunctionRegistry Nov 7, 2016
@gatorsmile
Copy link
Member Author

gatorsmile commented Nov 7, 2016

Based on the previous discussions in the other PRs, it sounds like these Hive-specific test cases are not very useful. Do we still need to rewrite them and add them to the other test suites?

    "auto_join19",
    "auto_join22",
    "auto_join25",
    "auto_join26",
    "auto_join27",
    "auto_join28",
    "auto_join30",
    "auto_join31",
    "auto_join_nulls",
    "auto_join_reordering_values",
    "correlationoptimizer1",
    "correlationoptimizer2",
    "correlationoptimizer3",
    "correlationoptimizer4",
    "multiMapJoin1",
    "orc_dictionary_threshold",
    "udf_hash"

BTW, I went over these disabled test cases. I think the values are low.

@rxin
Copy link
Contributor

rxin commented Nov 7, 2016

I'm ok with not having them.

@SparkQA
Copy link

SparkQA commented Nov 7, 2016

Test build #68268 has finished for PR 14498 at commit 05390ad.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Nov 7, 2016

Merging in master/branch-2.1.

@asfgit asfgit closed this in 57626a5 Nov 7, 2016
asfgit pushed a commit that referenced this pull request Nov 7, 2016
…veFunctionRegistry

### What changes were proposed in this pull request?

Currently, the Hive built-in `hash` function is not being used in Spark since Spark 2.0. The public interface does not allow users to unregister the Spark built-in functions. Thus, users will never use Hive's built-in `hash` function.

The only exception here is `TestHiveFunctionRegistry`, which allows users to unregister the built-in functions. Thus, we can load Hive's hash function in the test cases. If we disable it, 10+ test cases will fail because the results are different from the Hive golden answer files.

This PR is to remove `hash` from the list of `hiveFunctions` in `HiveSessionCatalog`. It will also remove `TestHiveFunctionRegistry`. This removal makes us easier to remove `TestHiveSessionState` in the future.
### How was this patch tested?
N/A

Author: gatorsmile <[email protected]>

Closes #14498 from gatorsmile/removeHash.

(cherry picked from commit 57626a5)
Signed-off-by: Reynold Xin <[email protected]>
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
…veFunctionRegistry

### What changes were proposed in this pull request?

Currently, the Hive built-in `hash` function is not being used in Spark since Spark 2.0. The public interface does not allow users to unregister the Spark built-in functions. Thus, users will never use Hive's built-in `hash` function.

The only exception here is `TestHiveFunctionRegistry`, which allows users to unregister the built-in functions. Thus, we can load Hive's hash function in the test cases. If we disable it, 10+ test cases will fail because the results are different from the Hive golden answer files.

This PR is to remove `hash` from the list of `hiveFunctions` in `HiveSessionCatalog`. It will also remove `TestHiveFunctionRegistry`. This removal makes us easier to remove `TestHiveSessionState` in the future.
### How was this patch tested?
N/A

Author: gatorsmile <[email protected]>

Closes apache#14498 from gatorsmile/removeHash.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants