-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-18871][SQL] New test cases for IN/NOT IN subquery #16337
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
get latest code from upstream
adding trim characters support
get latest code for pr12646
merge latest code
merge upstream/master
|
Test build #70535 has finished for PR 16337 at commit
|
| --(select t3c from t3 group by t3c, t3b having t2b > 6 and t3b > t2b )); | ||
| -- TC 01.13, comment out pending SPARK-18863 | ||
| --select * from (select * from t2 where t2a in (select t1a from t1 where t1b = t2b)) t2 where t2a in | ||
| --(select t2a from t2 where t2a = t2a and t2c > 1 group by t2a having t2c > 8); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test suite is only for positive cases, right? The above two test cases are negative cases. The negative cases can be added with the other negative cases. Right? We can put these two cases in the JIRA-18863.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will enable these two test cases once the problem is fixed. They remain here to demonstrate the test coverage of IN/NOT IN subquery. At this point, I have only done an initial investigation on SPARK-18863. Once the investigation is confirmed that these two cases have the same root cause as the one in 18863, they will be cross-referenced.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two test cases will not be part of this test suite, right? So far, no negative cases are added to this test suite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, so we can add these test cases back when deliver the fixes as you mentioned below? I will put these test case into the JIRA-18863. Thanks.
| select distinct(t1a), t1b from t1 where t1b not in (select t2b from t2 where t1a < t2a and t2b > 8); | ||
| -- TC 01.10, comment out pending on SPARK-18966 | ||
| --select t1a, t1b from t1 where t1c not in (select t2b from t2 where t2a not in | ||
| -- (select t3a from t3 where t2c = t3c and t2b is NULL)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to add the test cases now. We can deliver the cases with the bug fixes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add these two cases into the JIRA. I think that is good enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SPARK-18966 has a simplified test case to demonstrate the problem. The complexity of these two test cases makes it difficult for us to walk through the root cause of the problem. We should enable the test cases here once we fix SPARK-18966 to demonstrate the test coverage of NOT IN in with correlated expressions. I can add a remark on SPARK-18966 to enable these test cases as part of its PR. Once this PR is reviewed and merged, I will have the file and the test cases' numbers recorded in SPARK-18966.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on my experience, we do not add them until the fix is ready.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is your recommendation to move this forward? Wouldn't commenting out the failing test cases be sufficient and at the same time, leave a trace that we had thought about these scenarios while we were writing tests for IN subquery?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, will do. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Normally, remove all the unnecessary changes. If any issue/bug exists, create a JIRA to track it.
| DROP TABLE t2; | ||
| DROP TABLE t3; | ||
| USE default; | ||
| DROP DATABASE indb; No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add one more empty line here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok.
| DROP TABLE t2; | ||
| DROP TABLE t3; | ||
| USE default; | ||
| DROP DATABASE indb; No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add one more empty line here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
| t1d in (select t3d from t3 where t1c = t3c group by t3d) group by t1a; | ||
| -- TC 01.18 | ||
| select t1a, min(t1b) from t1 where t1c in (select min(t2c) from t2 where t2b = t1b group by t2a having t2a > t1a) or | ||
| t1d in (select t3d from t3 where t1c = t3c group by t3d having t3d = t1d) group by t1a having min(t1b) is NOT NULL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Readability is not good for such complex queries. Could you re-format all the cases by using the tool like http://www.dpriver.com/pp/sqlformat.htm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure.
|
|
||
| -- tables and data types | ||
|
|
||
| CREATE DATABASE indb; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You created a database, but you did not use it.
Maybe you do not need it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The statement CREATE DATABASE/DROP DATABASE is used to isolate all the objects created in this test file to its own database/schema. The purpose is to protect any name collision from other test files leaving objects of the same names without properly dropping them. The use of different database/schema mitigates this problem as the chance of having the same database/schema name in two different test files is low.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess what's missing here is the USE indb after the CREATE DATABASE to direct all the objects created after this statement to have a different database/schema. My bad. I thought CREATE DATABASE implicitly switch to a new database/schema.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For each test suite, (e.g., in-group-by.sql), we create a dedicated sesssion. See the code.
Why not creating temporary views? These views will be automatically removed after the session ends.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is no explicit restriction, I would like to keep the create database/use database so that the test file is self-contained and can be run in different environments with minimal side effect.
I don't have any preference between real tables or temporary views but variations are good to exercise different code paths. If all the test cases are written homogeneously to certain patterns, it limits the coverage. Again, if there is no explicit rules or guidelines on which particular ways to write test cases, I would like to request to have it kept at this format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I will remove it.
|
Test build #70768 has finished for PR 16337 at commit
|
| -- !query 11 output | ||
| t1a 16 | ||
| t1e 10 | ||
| t1e 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The result does not match with the one from DB2, right?
select t1a, t1b from t1 where t1h not in (select t2h from t2 where t2a = t1a) and t1b not in ( (select min(t3b) from t3 where t3d = t1d))
T1A T1B
------------------------------ ------
t1a 16
t1e 10
2 record(s) selected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad. I missed that in my verification of the result sets. @kevinyu98 , please remove the last 2 test cases that have correlated predicates in the subqueries of NOT IN (i.e., TC 01.08 and 01.09).
|
Test build #70782 has finished for PR 16337 at commit
|
| ("t1e", 10S, null, 19L, float(17.0), 25D, 26E2, timestamp '2014-09-04 01:02:00.001', date '2014-09-04'), | ||
| ("t1d", 10S, null, 12L, float(17.0), 25D, 26E2, timestamp '2015-05-04 01:01:00.000', date '2015-05-04'), | ||
| ("t1a", 6S, 8, 10L, float(15.0), 20D, 20E2, timestamp '2014-04-04 01:02:00.001', date '2014-04-04'), | ||
| ("t1e", 10S, null, 19L, float(17.0), 25D, 26E2, timestamp '2014-05-04 01:01:00.000', date '2014-05-04') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The data set in DB2 has one row difference, as shown below:
('t1e', 10, null, 19, 17, 25, 26.00, timestamp(date('2014-05-04')), null /*(date('2014-05-0=4')*/)
The above difference causes some query results do not match.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I forgot to mention that I have made two changes in the data, we need to re-run the the db2 verification test.
- I remove the "=" at "date("2014-05-0=4")"
- I changed the the t1g/t2g/t3g from 26 to 2600(26E2)
thanks for checking.
| -- !query 3 output | ||
| t1b 8 16 19 17.0 25.0 2600 2014-05-04 01:01:00 2014-05-04 | ||
| t1c 8 16 19 17.0 25.0 2600 2014-05-04 01:02:00.001 2014-05-05 | ||
| t1e 10 NULL 19 17.0 25.0 2600 2014-05-04 01:01:00 2014-05-04 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This row does not match due to the input table t1 do not match.
t1e 10 - 19 +1.70000000000000E+001 +2.50000000000000E+001 26. 2014-05-04-00.00.00.000000 -
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry about that, I forgot to mention that I the input data has changed. We need to use the new data set to run again in DB2 to get new result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, please re-run it and update your PR description with the latest attached files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I have updated the PR description with the test results, the result looks fine.
| -- !query 3 output | ||
| t1b 8.0 | ||
| t1c 8.0 | ||
| t1e 10.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When comparing with the DB2 results, I found the data type of Avg is different. Just for the other reviewers, the document of DB2 explains the behavior:
The data type of the result is the same as the data type of the argument values, except that:
- The result is a large integer if the argument values are small integers.
- The result is double-precision floating point if the argument values are single-precision floating point.
- The result is DECFLOAT(34) if the argument is DECFLOAT(n).
|
I compared the results and confirmed the results are consistent. LGTM |
|
@gatorsmile @nsyca @kevinyu98 can we pull the trigger on this one? |
|
I believe it's ready to do so. Thank you, @gatorsmile, for your relentless effort in reviewing this PR. This PR will serve as a baseline for subsequent PRs we will submit for the remaining test cases on subquery. Hopefully we can get each of them in in a single pass of review. @hvanhovell, I'd like to bring your attention to a design doc for the first phase of coding I posted in SPARK-18874 for reviewing. Your comments are appreciated so we could have our delivery of the code for the first phase right after we have all the test cases in our collection merged. |
|
Will merge it if nobody objects it in the next 24 hours. : ) |
|
Go for it! |
|
Thanks! Merging to master. |
## What changes were proposed in this pull request? This PR extends the existing IN/NOT IN subquery test cases coverage, adds more test cases to the IN subquery test suite. Based on the discussion, we will create `subquery/in-subquery` sub structure under `sql/core/src/test/resources/sql-tests/inputs` directory. This is the high level grouping for IN subquery: `subquery/in-subquery/` `subquery/in-subquery/simple-in.sql` `subquery/in-subquery/in-group-by.sql (in parent side, subquery, and both)` `subquery/in-subquery/not-in-group-by.sql` `subquery/in-subquery/in-order-by.sql` `subquery/in-subquery/in-limit.sql` `subquery/in-subquery/in-having.sql` `subquery/in-subquery/in-joins.sql` `subquery/in-subquery/not-in-joins.sql` `subquery/in-subquery/in-set-operations.sql` `subquery/in-subquery/in-with-cte.sql` `subquery/in-subquery/not-in-with-cte.sql` subquery/in-subquery/in-multiple-columns.sql` We will deliver it through multiple prs, this is the first pr for the IN subquery, it has `subquery/in-subquery/simple-in.sql` `subquery/in-subquery/in-group-by.sql (in parent side, subquery, and both)` These are the results from running on DB2. [Modified test file of in-group-by.sql used to run on DB2](https://github.com/apache/spark/files/683367/in-group-by.sql.db2.txt) [Output of the run result on DB2](https://github.com/apache/spark/files/683362/in-group-by.sql.db2.out.txt) [Modified test file of simple-in.sql used to run on DB2](https://github.com/apache/spark/files/683378/simple-in.sql.db2.txt) [Output of the run result on DB2](https://github.com/apache/spark/files/683379/simple-in.sql.db2.out.txt) ## How was this patch tested? This patch is adding tests. Author: Kevin Yu <[email protected]> Closes apache#16337 from kevinyu98/spark-18871.
## What changes were proposed in this pull request? This PR extends the existing IN/NOT IN subquery test cases coverage, adds more test cases to the IN subquery test suite. Based on the discussion, we will create `subquery/in-subquery` sub structure under `sql/core/src/test/resources/sql-tests/inputs` directory. This is the high level grouping for IN subquery: `subquery/in-subquery/` `subquery/in-subquery/simple-in.sql` `subquery/in-subquery/in-group-by.sql (in parent side, subquery, and both)` `subquery/in-subquery/not-in-group-by.sql` `subquery/in-subquery/in-order-by.sql` `subquery/in-subquery/in-limit.sql` `subquery/in-subquery/in-having.sql` `subquery/in-subquery/in-joins.sql` `subquery/in-subquery/not-in-joins.sql` `subquery/in-subquery/in-set-operations.sql` `subquery/in-subquery/in-with-cte.sql` `subquery/in-subquery/not-in-with-cte.sql` subquery/in-subquery/in-multiple-columns.sql` We will deliver it through multiple prs, this is the first pr for the IN subquery, it has `subquery/in-subquery/simple-in.sql` `subquery/in-subquery/in-group-by.sql (in parent side, subquery, and both)` These are the results from running on DB2. [Modified test file of in-group-by.sql used to run on DB2](https://github.com/apache/spark/files/683367/in-group-by.sql.db2.txt) [Output of the run result on DB2](https://github.com/apache/spark/files/683362/in-group-by.sql.db2.out.txt) [Modified test file of simple-in.sql used to run on DB2](https://github.com/apache/spark/files/683378/simple-in.sql.db2.txt) [Output of the run result on DB2](https://github.com/apache/spark/files/683379/simple-in.sql.db2.out.txt) ## How was this patch tested? This patch is adding tests. Author: Kevin Yu <[email protected]> Closes apache#16337 from kevinyu98/spark-18871.
…ll up to Optimizer phase ## What changes were proposed in this pull request? Currently Analyzer as part of ResolveSubquery, pulls up the correlated predicates to its originating SubqueryExpression. The subquery plan is then transformed to remove the correlated predicates after they are moved up to the outer plan. In this PR, the task of pulling up correlated predicates is deferred to Optimizer. This is the initial work that will allow us to support the form of correlated subqueries that we don't support today. The design document from nsyca can be found in the following link : [DesignDoc](https://docs.google.com/document/d/1QDZ8JwU63RwGFS6KVF54Rjj9ZJyK33d49ZWbjFBaIgU/edit#) The brief description of code changes (hopefully to aid with code review) can be be found in the following link: [CodeChanges](https://docs.google.com/document/d/18mqjhL9V1An-tNta7aVE13HkALRZ5GZ24AATA-Vqqf0/edit#) ## How was this patch tested? The test case PRs were submitted earlier using. [16337](#16337) [16759](#16759) [16841](#16841) [16915](#16915) [16798](#16798) [16712](#16712) [16710](#16710) [16760](#16760) [16802](#16802) Author: Dilip Biswal <[email protected]> Closes #16954 from dilipbiswal/SPARK-18874.
What changes were proposed in this pull request?
This PR extends the existing IN/NOT IN subquery test cases coverage, adds more test cases to the IN subquery test suite.
Based on the discussion, we will create
subquery/in-subquerysub structure undersql/core/src/test/resources/sql-tests/inputsdirectory.This is the high level grouping for IN subquery:
subquery/in-subquery/subquery/in-subquery/simple-in.sqlsubquery/in-subquery/in-group-by.sql (in parent side, subquery, and both)subquery/in-subquery/not-in-group-by.sqlsubquery/in-subquery/in-order-by.sqlsubquery/in-subquery/in-limit.sqlsubquery/in-subquery/in-having.sqlsubquery/in-subquery/in-joins.sqlsubquery/in-subquery/not-in-joins.sqlsubquery/in-subquery/in-set-operations.sqlsubquery/in-subquery/in-with-cte.sqlsubquery/in-subquery/not-in-with-cte.sqlsubquery/in-subquery/in-multiple-columns.sql`
We will deliver it through multiple prs, this is the first pr for the IN subquery, it has
subquery/in-subquery/simple-in.sqlsubquery/in-subquery/in-group-by.sql (in parent side, subquery, and both)These are the results from running on DB2.
Modified test file of in-group-by.sql used to run on DB2
Output of the run result on DB2
Modified test file of simple-in.sql used to run on DB2
Output of the run result on DB2
How was this patch tested?
This patch is adding tests.