Skip to content

Conversation

@gatorsmile
Copy link
Member

@gatorsmile gatorsmile commented Aug 12, 2016

What changes were proposed in this pull request?

#14498 plans to remove Hive Built-in Hash Functions. 10+ test cases are broken because the results are different from the Hive golden answer files. These broken test cases are not Hive specific. Thus, it makes more sense to move them to SQLQueryTestSuite

Based on file-based SQL end-to-end testing framework in SQLQueryTestSuite, this PR is to create test cases for joins in general. We will try to move most Join-specific SQL test cases into the new joins.sql file.

How was this patch tested?

This PR is just for improving test cases.

@gatorsmile
Copy link
Member Author

cc @rxin @petermaxlee @cloud-fan @hvanhovell

Before moving all the test cases of auto joins, I want to confirm whether this is the right direction. Thanks!

@@ -0,0 +1,9 @@
select sum(hash(a.k1,a.v1,a.k2, a.v2))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't guarantee hash consistency across versions so I'm not sure if it is a good idea to use hash here.

@rxin
Copy link
Contributor

rxin commented Aug 12, 2016

What exactly is the data set testing? I'm not a big fan of just pulling a non-trivial (say beyond 5 records) dataset in. It makes it very difficult to understand what exactly the test case is testing, and what the correct output should be.

@gatorsmile
Copy link
Member Author

The original purpose of Hive test sets is for verifying the query results and behaviors of Auto Join Conversion

This is not applicable to Spark SQL. I assume the current purposes in Spark are just for checking whether the query results match the outputs of Hive.

Do you want me to completely remove them? Or reduce the data set to around 5 records and keep the queries untouched?

@rxin
Copy link
Contributor

rxin commented Aug 12, 2016

I think we should create a comprehensive test suite for joins, and that should use a small dataset.

The Hive thing I'm OK with keeping it for a while, but eventually we should remove them since we are not even sure what they are testing ... I don't remember a case in which these join tests in Hive caught an issue that was not caught by our regular join suites.

@SparkQA
Copy link

SparkQA commented Aug 12, 2016

Test build #63703 has finished for PR 14625 at commit 9888f8a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member Author

gatorsmile commented Aug 12, 2016

I completely agree. Now, if we still want to temporarily keep these test cases, what should we do next? Based on my understanding, we want to get rid of TestHiveSessionState, TestHiveQueryExecution, TestHiveFunctionRegistry, ...

To remove these TestHive-specific classes, should we just use the existing SQLQuerySuite.scala and create a separate folder for these test cases and data sets? When it is ready, we can easily remove all of them in the future. Thanks!

@cloud-fan
Copy link
Contributor

cloud-fan commented Aug 13, 2016

I think HiveCompatibilitySuite has 2 purposes:

  1. using Hive's test to improve our test coverage
  2. make sure our result is same with Hive for same query(with blacklist)

Now we have SQLQueryTestSuite, we should use it to improve test coverage instead of HiveCompatibilitySuite. But this doesn't mean we should copy the test from HiveCompatibilitySuite to SQLQueryTestSuite, a test with hundreds of input records doesn't make sense.

HiveCompatibilitySuite will be simplified a lot and only contains tests just for compatibility checking.

@gatorsmile
Copy link
Member Author

@cloud-fan I see. Will try to rewrite the test cases with very few records. Thank you!

@SparkQA
Copy link

SparkQA commented Aug 13, 2016

Test build #63716 has finished for PR 14625 at commit 239191c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 13, 2016

Test build #63717 has finished for PR 14625 at commit 5f22316.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 13, 2016

Test build #63730 has finished for PR 14625 at commit 3fe55f1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Aug 13, 2016

@gatorsmile the comment should apply not only to data, but also query (e.g. what case we are testing ...)

@gatorsmile
Copy link
Member Author

@rxin Sure, will do it.

@gatorsmile
Copy link
Member Author

gatorsmile commented Aug 13, 2016

FYI, found the original JIRA that delivered the first 25 auto_join test cases to Hive: https://issues.apache.org/jira/browse/HIVE-1642

@gatorsmile
Copy link
Member Author

Below is the output of Hive for the same queries. They are the same.

outputHive.txt

@rxin
Copy link
Contributor

rxin commented Aug 13, 2016

Can we repurpose this ticket to just create test cases for joins in general?

@gatorsmile
Copy link
Member Author

Sure. The scope is a little bit large, but let me try to go over the existing join-related test cases in the test suites. We might not be able to cover all of them in a single ticket. Will try my best.

@SparkQA
Copy link

SparkQA commented Aug 14, 2016

Test build #63737 has finished for PR 14625 at commit cdea1a3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile gatorsmile changed the title [SPARK-17045] [SQL] Moving Auto_Joins from HiveCompatibilitySuite to SQLQueryTestSuite [SPARK-17045] [SQL] Build/move Join-related test cases in SQLQueryTestSuite Aug 14, 2016
@SparkQA
Copy link

SparkQA commented Aug 18, 2016

Test build #63945 has finished for PR 14625 at commit 7c6f85a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile gatorsmile changed the title [SPARK-17045] [SQL] Build/move Join-related test cases in SQLQueryTestSuite [WIP] [SPARK-17045] [SQL] Build/move Join-related test cases in SQLQueryTestSuite Aug 18, 2016
@SparkQA
Copy link

SparkQA commented Aug 18, 2016

Test build #63947 has finished for PR 14625 at commit 4bd38d2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 18, 2016

Test build #63950 has finished for PR 14625 at commit a059c77.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 18, 2016

Test build #63957 has finished for PR 14625 at commit b4801e0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member Author

@rxin @cloud-fan The code is ready for review. Thanks!

@SparkQA
Copy link

SparkQA commented Aug 20, 2016

Test build #64120 has finished for PR 14625 at commit bf55624.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -0,0 +1,225 @@
-- join nested table expressions (auto_join0.q)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to reference to the hive .q file? I think hive golden file tests will be removed eventually.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

: ) That is for helping reviewers know the origins of the queries. If you think we do not care, we can remove it.

@SparkQA
Copy link

SparkQA commented Aug 22, 2016

Test build #64174 has finished for PR 14625 at commit 9c6be8e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 22, 2016

Test build #64181 has finished for PR 14625 at commit e2677da.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented Aug 22, 2016

Test build #64226 has finished for PR 14625 at commit e2677da.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

hmmm, can we split this PR into multiple PRs? We are not copying tests from HiveCompatibilitySuite directly and it takes time to review each test and its result.

@gatorsmile
Copy link
Member Author

Sure, will split it to multiple PRs. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants