Skip to content

Conversation

@kevinyu98
Copy link
Contributor

@kevinyu98 kevinyu98 commented Dec 19, 2016

What changes were proposed in this pull request?

This PR extends the existing IN/NOT IN subquery test cases coverage, adds more test cases to the IN subquery test suite.

Based on the discussion, we will create subquery/in-subquery sub structure under sql/core/src/test/resources/sql-tests/inputs directory.

This is the high level grouping for IN subquery:

subquery/in-subquery/
subquery/in-subquery/simple-in.sql
subquery/in-subquery/in-group-by.sql (in parent side, subquery, and both)
subquery/in-subquery/not-in-group-by.sql
subquery/in-subquery/in-order-by.sql
subquery/in-subquery/in-limit.sql
subquery/in-subquery/in-having.sql
subquery/in-subquery/in-joins.sql
subquery/in-subquery/not-in-joins.sql
subquery/in-subquery/in-set-operations.sql
subquery/in-subquery/in-with-cte.sql
subquery/in-subquery/not-in-with-cte.sql
subquery/in-subquery/in-multiple-columns.sql`

We will deliver it through multiple prs, this is the first pr for the IN subquery, it has

subquery/in-subquery/simple-in.sql
subquery/in-subquery/in-group-by.sql (in parent side, subquery, and both)

These are the results from running on DB2.
Modified test file of in-group-by.sql used to run on DB2
Output of the run result on DB2
Modified test file of simple-in.sql used to run on DB2
Output of the run result on DB2

How was this patch tested?

This patch is adding tests.

get latest code from upstream
adding trim characters support
@SparkQA
Copy link

SparkQA commented Dec 23, 2016

Test build #70535 has finished for PR 16337 at commit 1c1900a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

--(select t3c from t3 group by t3c, t3b having t2b > 6 and t3b > t2b ));
-- TC 01.13, comment out pending SPARK-18863
--select * from (select * from t2 where t2a in (select t1a from t1 where t1b = t2b)) t2 where t2a in
--(select t2a from t2 where t2a = t2a and t2c > 1 group by t2a having t2c > 8);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test suite is only for positive cases, right? The above two test cases are negative cases. The negative cases can be added with the other negative cases. Right? We can put these two cases in the JIRA-18863.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will enable these two test cases once the problem is fixed. They remain here to demonstrate the test coverage of IN/NOT IN subquery. At this point, I have only done an initial investigation on SPARK-18863. Once the investigation is confirmed that these two cases have the same root cause as the one in 18863, they will be cross-referenced.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two test cases will not be part of this test suite, right? So far, no negative cases are added to this test suite.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, so we can add these test cases back when deliver the fixes as you mentioned below? I will put these test case into the JIRA-18863. Thanks.

select distinct(t1a), t1b from t1 where t1b not in (select t2b from t2 where t1a < t2a and t2b > 8);
-- TC 01.10, comment out pending on SPARK-18966
--select t1a, t1b from t1 where t1c not in (select t2b from t2 where t2a not in
-- (select t3a from t3 where t2c = t3c and t2b is NULL));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to add the test cases now. We can deliver the cases with the bug fixes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add these two cases into the JIRA. I think that is good enough.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SPARK-18966 has a simplified test case to demonstrate the problem. The complexity of these two test cases makes it difficult for us to walk through the root cause of the problem. We should enable the test cases here once we fix SPARK-18966 to demonstrate the test coverage of NOT IN in with correlated expressions. I can add a remark on SPARK-18966 to enable these test cases as part of its PR. Once this PR is reviewed and merged, I will have the file and the test cases' numbers recorded in SPARK-18966.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on my experience, we do not add them until the fix is ready.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is your recommendation to move this forward? Wouldn't commenting out the failing test cases be sufficient and at the same time, leave a trace that we had thought about these scenarios while we were writing tests for IN subquery?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will do. Thanks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normally, remove all the unnecessary changes. If any issue/bug exists, create a JIRA to track it.

DROP TABLE t2;
DROP TABLE t3;
USE default;
DROP DATABASE indb; No newline at end of file
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add one more empty line here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok.

DROP TABLE t2;
DROP TABLE t3;
USE default;
DROP DATABASE indb; No newline at end of file
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add one more empty line here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

t1d in (select t3d from t3 where t1c = t3c group by t3d) group by t1a;
-- TC 01.18
select t1a, min(t1b) from t1 where t1c in (select min(t2c) from t2 where t2b = t1b group by t2a having t2a > t1a) or
t1d in (select t3d from t3 where t1c = t3c group by t3d having t3d = t1d) group by t1a having min(t1b) is NOT NULL;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Readability is not good for such complex queries. Could you re-format all the cases by using the tool like http://www.dpriver.com/pp/sqlformat.htm

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure.


-- tables and data types

CREATE DATABASE indb;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You created a database, but you did not use it.

Maybe you do not need it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The statement CREATE DATABASE/DROP DATABASE is used to isolate all the objects created in this test file to its own database/schema. The purpose is to protect any name collision from other test files leaving objects of the same names without properly dropping them. The use of different database/schema mitigates this problem as the chance of having the same database/schema name in two different test files is low.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess what's missing here is the USE indb after the CREATE DATABASE to direct all the objects created after this statement to have a different database/schema. My bad. I thought CREATE DATABASE implicitly switch to a new database/schema.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For each test suite, (e.g., in-group-by.sql), we create a dedicated sesssion. See the code.

Why not creating temporary views? These views will be automatically removed after the session ends.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is no explicit restriction, I would like to keep the create database/use database so that the test file is self-contained and can be run in different environments with minimal side effect.

I don't have any preference between real tables or temporary views but variations are good to exercise different code paths. If all the test cases are written homogeneously to certain patterns, it limits the coverage. Again, if there is no explicit rules or guidelines on which particular ways to write test cases, I would like to request to have it kept at this format.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I will remove it.

@SparkQA
Copy link

SparkQA commented Jan 1, 2017

Test build #70768 has finished for PR 16337 at commit 7c129d9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

-- !query 11 output
t1a 16
t1e 10
t1e 10
Copy link
Member

@gatorsmile gatorsmile Jan 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The result does not match with the one from DB2, right?

select t1a, t1b from t1 where t1h not in (select t2h from t2 where t2a = t1a) and t1b not in ( (select min(t3b) from t3 where t3d = t1d))

T1A                            T1B   
------------------------------ ------
t1a                                16
t1e                                10

  2 record(s) selected.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad. I missed that in my verification of the result sets. @kevinyu98 , please remove the last 2 test cases that have correlated predicates in the subqueries of NOT IN (i.e., TC 01.08 and 01.09).

@SparkQA
Copy link

SparkQA commented Jan 2, 2017

Test build #70782 has finished for PR 16337 at commit 895bb35.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

("t1e", 10S, null, 19L, float(17.0), 25D, 26E2, timestamp '2014-09-04 01:02:00.001', date '2014-09-04'),
("t1d", 10S, null, 12L, float(17.0), 25D, 26E2, timestamp '2015-05-04 01:01:00.000', date '2015-05-04'),
("t1a", 6S, 8, 10L, float(15.0), 20D, 20E2, timestamp '2014-04-04 01:02:00.001', date '2014-04-04'),
("t1e", 10S, null, 19L, float(17.0), 25D, 26E2, timestamp '2014-05-04 01:01:00.000', date '2014-05-04')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The data set in DB2 has one row difference, as shown below:

('t1e', 10, null, 19, 17, 25, 26.00, timestamp(date('2014-05-04')), null /*(date('2014-05-0=4')*/)

The above difference causes some query results do not match.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I forgot to mention that I have made two changes in the data, we need to re-run the the db2 verification test.

  1. I remove the "=" at "date("2014-05-0=4")"
  2. I changed the the t1g/t2g/t3g from 26 to 2600(26E2)
    thanks for checking.

-- !query 3 output
t1b 8 16 19 17.0 25.0 2600 2014-05-04 01:01:00 2014-05-04
t1c 8 16 19 17.0 25.0 2600 2014-05-04 01:02:00.001 2014-05-05
t1e 10 NULL 19 17.0 25.0 2600 2014-05-04 01:01:00 2014-05-04
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This row does not match due to the input table t1 do not match.

t1e                                10           -                   19   +1.70000000000000E+001   +2.50000000000000E+001     26. 2014-05-04-00.00.00.000000 -         

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about that, I forgot to mention that I the input data has changed. We need to use the new data set to run again in DB2 to get new result.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, please re-run it and update your PR description with the latest attached files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I have updated the PR description with the test results, the result looks fine.

-- !query 3 output
t1b 8.0
t1c 8.0
t1e 10.0
Copy link
Member

@gatorsmile gatorsmile Jan 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When comparing with the DB2 results, I found the data type of Avg is different. Just for the other reviewers, the document of DB2 explains the behavior:

The data type of the result is the same as the data type of the argument values, except that:
- The result is a large integer if the argument values are small integers.
- The result is double-precision floating point if the argument values are single-precision floating point.
- The result is DECFLOAT(34) if the argument is DECFLOAT(n).

@gatorsmile
Copy link
Member

I compared the results and confirmed the results are consistent. LGTM

@hvanhovell
Copy link
Contributor

@gatorsmile @nsyca @kevinyu98 can we pull the trigger on this one?

@nsyca
Copy link
Contributor

nsyca commented Jan 5, 2017

I believe it's ready to do so. Thank you, @gatorsmile, for your relentless effort in reviewing this PR. This PR will serve as a baseline for subsequent PRs we will submit for the remaining test cases on subquery. Hopefully we can get each of them in in a single pass of review.

@hvanhovell, I'd like to bring your attention to a design doc for the first phase of coding I posted in SPARK-18874 for reviewing. Your comments are appreciated so we could have our delivery of the code for the first phase right after we have all the test cases in our collection merged.

@gatorsmile
Copy link
Member

Will merge it if nobody objects it in the next 24 hours. : )

@rxin
Copy link
Contributor

rxin commented Jan 6, 2017

Go for it!

@gatorsmile
Copy link
Member

Thanks! Merging to master.

@asfgit asfgit closed this in bcc510b Jan 6, 2017
cmonkey pushed a commit to cmonkey/spark that referenced this pull request Jan 9, 2017
## What changes were proposed in this pull request?
This PR extends the existing IN/NOT IN subquery test cases coverage, adds more test cases to the IN subquery test suite.

Based on the discussion, we will create  `subquery/in-subquery` sub structure under `sql/core/src/test/resources/sql-tests/inputs` directory.

This is the high level grouping for IN subquery:

`subquery/in-subquery/`
`subquery/in-subquery/simple-in.sql`
`subquery/in-subquery/in-group-by.sql (in parent side, subquery, and both)`
`subquery/in-subquery/not-in-group-by.sql`
`subquery/in-subquery/in-order-by.sql`
`subquery/in-subquery/in-limit.sql`
`subquery/in-subquery/in-having.sql`
`subquery/in-subquery/in-joins.sql`
`subquery/in-subquery/not-in-joins.sql`
`subquery/in-subquery/in-set-operations.sql`
`subquery/in-subquery/in-with-cte.sql`
`subquery/in-subquery/not-in-with-cte.sql`
subquery/in-subquery/in-multiple-columns.sql`

We will deliver it through multiple prs, this is the first pr for the IN subquery, it has

`subquery/in-subquery/simple-in.sql`
`subquery/in-subquery/in-group-by.sql (in parent side, subquery, and both)`

These are the results from running on DB2.
[Modified test file of in-group-by.sql used to run on DB2](https://github.com/apache/spark/files/683367/in-group-by.sql.db2.txt)
[Output of the run result on DB2](https://github.com/apache/spark/files/683362/in-group-by.sql.db2.out.txt)
[Modified test file of simple-in.sql used to run on DB2](https://github.com/apache/spark/files/683378/simple-in.sql.db2.txt)
[Output of the run result on DB2](https://github.com/apache/spark/files/683379/simple-in.sql.db2.out.txt)

## How was this patch tested?

This patch is adding tests.

Author: Kevin Yu <[email protected]>

Closes apache#16337 from kevinyu98/spark-18871.
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
## What changes were proposed in this pull request?
This PR extends the existing IN/NOT IN subquery test cases coverage, adds more test cases to the IN subquery test suite.

Based on the discussion, we will create  `subquery/in-subquery` sub structure under `sql/core/src/test/resources/sql-tests/inputs` directory.

This is the high level grouping for IN subquery:

`subquery/in-subquery/`
`subquery/in-subquery/simple-in.sql`
`subquery/in-subquery/in-group-by.sql (in parent side, subquery, and both)`
`subquery/in-subquery/not-in-group-by.sql`
`subquery/in-subquery/in-order-by.sql`
`subquery/in-subquery/in-limit.sql`
`subquery/in-subquery/in-having.sql`
`subquery/in-subquery/in-joins.sql`
`subquery/in-subquery/not-in-joins.sql`
`subquery/in-subquery/in-set-operations.sql`
`subquery/in-subquery/in-with-cte.sql`
`subquery/in-subquery/not-in-with-cte.sql`
subquery/in-subquery/in-multiple-columns.sql`

We will deliver it through multiple prs, this is the first pr for the IN subquery, it has

`subquery/in-subquery/simple-in.sql`
`subquery/in-subquery/in-group-by.sql (in parent side, subquery, and both)`

These are the results from running on DB2.
[Modified test file of in-group-by.sql used to run on DB2](https://github.com/apache/spark/files/683367/in-group-by.sql.db2.txt)
[Output of the run result on DB2](https://github.com/apache/spark/files/683362/in-group-by.sql.db2.out.txt)
[Modified test file of simple-in.sql used to run on DB2](https://github.com/apache/spark/files/683378/simple-in.sql.db2.txt)
[Output of the run result on DB2](https://github.com/apache/spark/files/683379/simple-in.sql.db2.out.txt)

## How was this patch tested?

This patch is adding tests.

Author: Kevin Yu <[email protected]>

Closes apache#16337 from kevinyu98/spark-18871.
asfgit pushed a commit that referenced this pull request Mar 14, 2017
…ll up to Optimizer phase

## What changes were proposed in this pull request?
Currently Analyzer as part of ResolveSubquery, pulls up the correlated predicates to its
originating SubqueryExpression. The subquery plan is then transformed to remove the correlated
predicates after they are moved up to the outer plan. In this PR, the task of pulling up
correlated predicates is deferred to Optimizer. This is the initial work that will allow us to
support the form of correlated subqueries that we don't support today. The design document
from nsyca can be found in the following link :
[DesignDoc](https://docs.google.com/document/d/1QDZ8JwU63RwGFS6KVF54Rjj9ZJyK33d49ZWbjFBaIgU/edit#)

The brief description of code changes (hopefully to aid with code review) can be be found in the
following link:
[CodeChanges](https://docs.google.com/document/d/18mqjhL9V1An-tNta7aVE13HkALRZ5GZ24AATA-Vqqf0/edit#)

## How was this patch tested?
The test case PRs were submitted earlier using.
[16337](#16337) [16759](#16759) [16841](#16841) [16915](#16915) [16798](#16798) [16712](#16712) [16710](#16710) [16760](#16760) [16802](#16802)

Author: Dilip Biswal <[email protected]>

Closes #16954 from dilipbiswal/SPARK-18874.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants