[SPARK-18871][SQL] New test cases for IN/NOT IN subquery #16337

kevinyu98 · 2016-12-19T07:00:01Z

What changes were proposed in this pull request?

This PR extends the existing IN/NOT IN subquery test cases coverage, adds more test cases to the IN subquery test suite.

Based on the discussion, we will create subquery/in-subquery sub structure under sql/core/src/test/resources/sql-tests/inputs directory.

This is the high level grouping for IN subquery:

subquery/in-subquery/
subquery/in-subquery/simple-in.sql
subquery/in-subquery/in-group-by.sql (in parent side, subquery, and both)
subquery/in-subquery/not-in-group-by.sql
subquery/in-subquery/in-order-by.sql
subquery/in-subquery/in-limit.sql
subquery/in-subquery/in-having.sql
subquery/in-subquery/in-joins.sql
subquery/in-subquery/not-in-joins.sql
subquery/in-subquery/in-set-operations.sql
subquery/in-subquery/in-with-cte.sql
subquery/in-subquery/not-in-with-cte.sql
subquery/in-subquery/in-multiple-columns.sql`

We will deliver it through multiple prs, this is the first pr for the IN subquery, it has

subquery/in-subquery/simple-in.sql
subquery/in-subquery/in-group-by.sql (in parent side, subquery, and both)

These are the results from running on DB2.
Modified test file of in-group-by.sql used to run on DB2
Output of the run result on DB2
Modified test file of simple-in.sql used to run on DB2
Output of the run result on DB2

How was this patch tested?

This patch is adding tests.

get latest code from upstream

adding trim characters support

get latest code for pr12646

merge latest code

merge upstream/master

SparkQA · 2016-12-23T07:20:30Z

Test build #70535 has finished for PR 16337 at commit 1c1900a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-12-24T02:38:48Z

sql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/in-group-by.sql

+--(select t3c from t3 group by t3c, t3b having t2b > 6 and t3b > t2b ));
+-- TC 01.13, comment out pending SPARK-18863
+--select * from (select * from t2 where t2a in (select t1a from t1 where t1b = t2b)) t2 where t2a in
+--(select t2a from t2 where t2a = t2a and t2c > 1 group by t2a having t2c > 8);


This test suite is only for positive cases, right? The above two test cases are negative cases. The negative cases can be added with the other negative cases. Right? We can put these two cases in the JIRA-18863.

We will enable these two test cases once the problem is fixed. They remain here to demonstrate the test coverage of IN/NOT IN subquery. At this point, I have only done an initial investigation on SPARK-18863. Once the investigation is confirmed that these two cases have the same root cause as the one in 18863, they will be cross-referenced.

These two test cases will not be part of this test suite, right? So far, no negative cases are added to this test suite.

Sure, so we can add these test cases back when deliver the fixes as you mentioned below? I will put these test case into the JIRA-18863. Thanks.

gatorsmile · 2016-12-24T02:44:58Z

sql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/simple-in.sql

+select distinct(t1a), t1b from t1 where t1b not in (select t2b from t2 where t1a < t2a and t2b > 8);
+-- TC 01.10, comment out pending on SPARK-18966
+--select t1a, t1b from t1 where t1c not in (select t2b from t2 where t2a not in
+--                      (select t3a from t3 where t2c = t3c and t2b is NULL));


No need to add the test cases now. We can deliver the cases with the bug fixes.

Please add these two cases into the JIRA. I think that is good enough.

SPARK-18966 has a simplified test case to demonstrate the problem. The complexity of these two test cases makes it difficult for us to walk through the root cause of the problem. We should enable the test cases here once we fix SPARK-18966 to demonstrate the test coverage of NOT IN in with correlated expressions. I can add a remark on SPARK-18966 to enable these test cases as part of its PR. Once this PR is reviewed and merged, I will have the file and the test cases' numbers recorded in SPARK-18966.

Based on my experience, we do not add them until the fix is ready.

What is your recommendation to move this forward? Wouldn't commenting out the failing test cases be sufficient and at the same time, leave a trace that we had thought about these scenarios while we were writing tests for IN subquery?

Sure, will do. Thanks.

Normally, remove all the unnecessary changes. If any issue/bug exists, create a JIRA to track it.

gatorsmile · 2016-12-24T02:46:54Z

sql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/simple-in.sql

+DROP TABLE t2;
+DROP TABLE t3;
+USE default;
+DROP DATABASE indb;


Please add one more empty line here.

gatorsmile · 2016-12-24T02:47:01Z

sql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/in-group-by.sql

+DROP TABLE t2;
+DROP TABLE t3;
+USE default;
+DROP DATABASE indb;


Please add one more empty line here.

gatorsmile · 2016-12-24T02:52:02Z

sql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/in-group-by.sql

+t1d in (select t3d from t3 where t1c = t3c group by t3d) group by t1a;
+-- TC 01.18
+select t1a, min(t1b) from t1 where t1c in (select min(t2c) from t2 where t2b = t1b group by t2a having t2a > t1a) or
+t1d in (select t3d from t3 where t1c = t3c group by t3d having t3d = t1d) group by t1a having min(t1b) is NOT NULL;


Readability is not good for such complex queries. Could you re-format all the cases by using the tool like http://www.dpriver.com/pp/sqlformat.htm

gatorsmile · 2016-12-24T03:00:19Z

sql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/in-group-by.sql

+
+-- tables and data types
+
+CREATE DATABASE indb;


You created a database, but you did not use it.

Maybe you do not need it.

The statement CREATE DATABASE/DROP DATABASE is used to isolate all the objects created in this test file to its own database/schema. The purpose is to protect any name collision from other test files leaving objects of the same names without properly dropping them. The use of different database/schema mitigates this problem as the chance of having the same database/schema name in two different test files is low.

I guess what's missing here is the USE indb after the CREATE DATABASE to direct all the objects created after this statement to have a different database/schema. My bad. I thought CREATE DATABASE implicitly switch to a new database/schema.

For each test suite, (e.g., in-group-by.sql), we create a dedicated sesssion. See the code.

Why not creating temporary views? These views will be automatically removed after the session ends.

If there is no explicit restriction, I would like to keep the create database/use database so that the test file is self-contained and can be run in different environments with minimal side effect.

I don't have any preference between real tables or temporary views but variations are good to exercise different code paths. If all the test cases are written homogeneously to certain patterns, it limits the coverage. Again, if there is no explicit rules or guidelines on which particular ways to write test cases, I would like to request to have it kept at this format.

ok, I will remove it.

SparkQA · 2017-01-01T03:33:03Z

Test build #70768 has finished for PR 16337 at commit 7c129d9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-01-01T18:52:56Z

sql/core/src/test/resources/sql-tests/results/subquery/in-subquery/simple-in.sql.out

+-- !query 11 output
+t1a	16
+t1e	10
+t1e	10


The result does not match with the one from DB2, right?

select t1a, t1b from t1 where t1h not in (select t2h from t2 where t2a = t1a) and t1b not in ( (select min(t3b) from t3 where t3d = t1d)) T1A T1B ------------------------------ ------ t1a 16 t1e 10 2 record(s) selected.

My bad. I missed that in my verification of the result sets. @kevinyu98 , please remove the last 2 test cases that have correlated predicates in the subqueries of NOT IN (i.e., TC 01.08 and 01.09).

SparkQA · 2017-01-02T06:36:22Z

Test build #70782 has finished for PR 16337 at commit 895bb35.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-01-03T01:25:26Z

sql/core/src/test/resources/sql-tests/results/subquery/in-subquery/simple-in.sql.out

+  ("t1e", 10S, null, 19L, float(17.0), 25D, 26E2, timestamp '2014-09-04 01:02:00.001', date '2014-09-04'),
+  ("t1d", 10S, null, 12L, float(17.0), 25D, 26E2, timestamp '2015-05-04 01:01:00.000', date '2015-05-04'),
+  ("t1a", 6S, 8, 10L, float(15.0), 20D, 20E2, timestamp '2014-04-04 01:02:00.001', date '2014-04-04'),
+  ("t1e", 10S, null, 19L, float(17.0), 25D, 26E2, timestamp '2014-05-04 01:01:00.000', date '2014-05-04')


The data set in DB2 has one row difference, as shown below:

('t1e', 10, null, 19, 17, 25, 26.00, timestamp(date('2014-05-04')), null /*(date('2014-05-0=4')*/)

The above difference causes some query results do not match.

Sorry, I forgot to mention that I have made two changes in the data, we need to re-run the the db2 verification test.

I remove the "=" at "date("2014-05-0=4")"

I changed the the t1g/t2g/t3g from 26 to 2600(26E2)
thanks for checking.

gatorsmile · 2017-01-03T01:27:29Z

sql/core/src/test/resources/sql-tests/results/subquery/in-subquery/simple-in.sql.out

+-- !query 3 output
+t1b	8	16	19	17.0	25.0	2600	2014-05-04 01:01:00	2014-05-04
+t1c	8	16	19	17.0	25.0	2600	2014-05-04 01:02:00.001	2014-05-05
+t1e	10	NULL	19	17.0	25.0	2600	2014-05-04 01:01:00	2014-05-04


This row does not match due to the input table t1 do not match.

t1e 10 - 19 +1.70000000000000E+001 +2.50000000000000E+001 26. 2014-05-04-00.00.00.000000 -

Sorry about that, I forgot to mention that I the input data has changed. We need to use the new data set to run again in DB2 to get new result.

Yeah, please re-run it and update your PR description with the latest attached files.

Thanks, I have updated the PR description with the test results, the result looks fine.

gatorsmile · 2017-01-05T06:08:26Z

sql/core/src/test/resources/sql-tests/results/subquery/in-subquery/in-group-by.sql.out

+-- !query 3 output
+t1b	8.0
+t1c	8.0
+t1e	10.0


When comparing with the DB2 results, I found the data type of Avg is different. Just for the other reviewers, the document of DB2 explains the behavior:

The data type of the result is the same as the data type of the argument values, except that: - The result is a large integer if the argument values are small integers. - The result is double-precision floating point if the argument values are single-precision floating point. - The result is DECFLOAT(34) if the argument is DECFLOAT(n).

gatorsmile · 2017-01-05T06:17:05Z

I compared the results and confirmed the results are consistent. LGTM

hvanhovell · 2017-01-05T11:43:06Z

@gatorsmile @nsyca @kevinyu98 can we pull the trigger on this one?

nsyca · 2017-01-05T14:34:55Z

I believe it's ready to do so. Thank you, @gatorsmile, for your relentless effort in reviewing this PR. This PR will serve as a baseline for subsequent PRs we will submit for the remaining test cases on subquery. Hopefully we can get each of them in in a single pass of review.

@hvanhovell, I'd like to bring your attention to a design doc for the first phase of coding I posted in SPARK-18874 for reviewing. Your comments are appreciated so we could have our delivery of the code for the first phase right after we have all the test cases in our collection merged.

gatorsmile · 2017-01-05T23:56:49Z

Will merge it if nobody objects it in the next 24 hours. : )

rxin · 2017-01-06T00:00:20Z

Go for it!

gatorsmile · 2017-01-06T03:01:18Z

Thanks! Merging to master.

## What changes were proposed in this pull request? This PR extends the existing IN/NOT IN subquery test cases coverage, adds more test cases to the IN subquery test suite. Based on the discussion, we will create `subquery/in-subquery` sub structure under `sql/core/src/test/resources/sql-tests/inputs` directory. This is the high level grouping for IN subquery: `subquery/in-subquery/` `subquery/in-subquery/simple-in.sql` `subquery/in-subquery/in-group-by.sql (in parent side, subquery, and both)` `subquery/in-subquery/not-in-group-by.sql` `subquery/in-subquery/in-order-by.sql` `subquery/in-subquery/in-limit.sql` `subquery/in-subquery/in-having.sql` `subquery/in-subquery/in-joins.sql` `subquery/in-subquery/not-in-joins.sql` `subquery/in-subquery/in-set-operations.sql` `subquery/in-subquery/in-with-cte.sql` `subquery/in-subquery/not-in-with-cte.sql` subquery/in-subquery/in-multiple-columns.sql` We will deliver it through multiple prs, this is the first pr for the IN subquery, it has `subquery/in-subquery/simple-in.sql` `subquery/in-subquery/in-group-by.sql (in parent side, subquery, and both)` These are the results from running on DB2. [Modified test file of in-group-by.sql used to run on DB2](https://github.com/apache/spark/files/683367/in-group-by.sql.db2.txt) [Output of the run result on DB2](https://github.com/apache/spark/files/683362/in-group-by.sql.db2.out.txt) [Modified test file of simple-in.sql used to run on DB2](https://github.com/apache/spark/files/683378/simple-in.sql.db2.txt) [Output of the run result on DB2](https://github.com/apache/spark/files/683379/simple-in.sql.db2.out.txt) ## How was this patch tested? This patch is adding tests. Author: Kevin Yu <[email protected]> Closes apache#16337 from kevinyu98/spark-18871.

…ll up to Optimizer phase ## What changes were proposed in this pull request? Currently Analyzer as part of ResolveSubquery, pulls up the correlated predicates to its originating SubqueryExpression. The subquery plan is then transformed to remove the correlated predicates after they are moved up to the outer plan. In this PR, the task of pulling up correlated predicates is deferred to Optimizer. This is the initial work that will allow us to support the form of correlated subqueries that we don't support today. The design document from nsyca can be found in the following link : [DesignDoc](https://docs.google.com/document/d/1QDZ8JwU63RwGFS6KVF54Rjj9ZJyK33d49ZWbjFBaIgU/edit#) The brief description of code changes (hopefully to aid with code review) can be be found in the following link: [CodeChanges](https://docs.google.com/document/d/18mqjhL9V1An-tNta7aVE13HkALRZ5GZ24AATA-Vqqf0/edit#) ## How was this patch tested? The test case PRs were submitted earlier using. [16337](#16337) [16759](#16759) [16841](#16841) [16915](#16915) [16798](#16798) [16712](#16712) [16710](#16710) [16760](#16760) [16802](#16802) Author: Dilip Biswal <[email protected]> Closes #16954 from dilipbiswal/SPARK-18874.

kevinyu98 added 30 commits April 20, 2016 11:06

adding testcase

3b44c59

Merge remote-tracking branch 'upstream/master'

18b4a31

Merge remote-tracking branch 'upstream/master'

4f4d1c8

get latest code from upstream

Merge remote-tracking branch 'upstream/master'

f5f0cbe

adding trim characters support

Merge remote-tracking branch 'upstream/master'

d8b2edb

get latest code for pr12646

Merge remote-tracking branch 'upstream/master'

196b6c6

merge latest code

Merge remote-tracking branch 'upstream/master'

f37a01e

merge upstream/master

Merge remote-tracking branch 'upstream/master'

bb5b01f

Merge remote-tracking branch 'upstream/master'

bde5820

Merge remote-tracking branch 'upstream/master'

5f7cd96

Merge remote-tracking branch 'upstream/master'

893a49a

Merge remote-tracking branch 'upstream/master'

4bbe1fd

Merge remote-tracking branch 'upstream/master'

b2dd795

Merge remote-tracking branch 'upstream/master'

8c3e5da

Merge remote-tracking branch 'upstream/master'

a0eaa40

Merge remote-tracking branch 'upstream/master'

d03c940

Merge remote-tracking branch 'upstream/master'

d728d5e

Merge remote-tracking branch 'upstream/master'

ea104dd

Merge remote-tracking branch 'upstream/master'

6ab1215

Merge remote-tracking branch 'upstream/master'

0c56653

Merge remote-tracking branch 'upstream/master'

d7a1874

Merge remote-tracking branch 'upstream/master'

85d3500

Merge remote-tracking branch 'upstream/master'

c056f91

Merge remote-tracking branch 'upstream/master'

0b8189d

Merge remote-tracking branch 'upstream/master'

c2ea31d

Merge remote-tracking branch 'upstream/master'

a2d3056

Merge remote-tracking branch 'upstream/master'

39e5648

Merge remote-tracking branch 'upstream/master'

b9370a3

Merge remote-tracking branch 'upstream/master'

01224a4

Merge remote-tracking branch 'upstream/master'

d05d39a

gatorsmile reviewed Dec 24, 2016

View reviewed changes

address comments

7c129d9

gatorsmile reviewed Jan 1, 2017

View reviewed changes

remove two not in cases

895bb35

gatorsmile reviewed Jan 3, 2017

View reviewed changes

nsyca mentioned this pull request Jan 4, 2017

[SPARK-19017][SQL] NOT IN subquery with more than one column may return incorrect results #16467

Closed

gatorsmile reviewed Jan 5, 2017

View reviewed changes

asfgit closed this in bcc510b Jan 6, 2017

nsyca mentioned this pull request Jan 30, 2017

[SPARK-18873][SQL][TEST] New test cases for scalar subquery (part 1 of 2) - scalar subquery in SELECT clause #16712

Closed

dilipbiswal mentioned this pull request Feb 16, 2017

[SPARK-18874][SQL] First phase: Deferring the correlated predicate pull up to Optimizer phase #16954

Closed

[SPARK-18871][SQL] New test cases for IN/NOT IN subquery #16337

[SPARK-18871][SQL] New test cases for IN/NOT IN subquery #16337

Uh oh!

Conversation

kevinyu98 commented Dec 19, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Dec 23, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 1, 2017

Uh oh!

gatorsmile Jan 1, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 2, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

kevinyu98 commented Dec 19, 2016 •

edited

Loading

gatorsmile Jan 1, 2017 •

edited

Loading

gatorsmile Jan 5, 2017 •

edited

Loading

nsyca commented Jan 5, 2017 •

edited

Loading