[SPARK-24341][SQL] Support only IN subqueries with the same number of items per row #21403

mgaido91 · 2018-05-22T22:19:02Z

What changes were proposed in this pull request?

Using struct types in subqueries with the IN clause can generate invalid plans in RewritePredicateSubquery. Indeed, we are not handling clearly the cases when the outer value is a struct or the output of the inner subquery is a struct.

The PR aims to make Spark's behavior the same as the one of the other RDBMS - namely Oracle and Postgres behavior were checked. So we consider valid only queries having the same number of fields in the outer value and in the subquery. This means that:

(a, b) IN (select c, d from ...) is a valid query;
(a, b) IN (select (c, d) from ...) throws an AnalysisException, as in the subquery we have only one field of type struct while in the outer value we have 2 fields;
a IN (select (c, d) from ...) - where a is a struct - is a valid query.

How was this patch tested?

Added UT

mgaido91 · 2018-05-22T22:20:43Z

cc @cloud-fan @dilipbiswal @gatorsmile @juliuszsompolski: it is currently a WIP since I think the UTs have to be formalized a bit better. But I wanted to share with you this in order to understand if you agree on the strategy of this PR. I'd really appreciate any feedback. Thanks.

SparkQA · 2018-05-23T01:22:16Z

Test build #90998 has finished for PR 21403 at commit 5b6226f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

hvanhovell · 2018-05-23T08:22:29Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala

I am not sure if we should unpack the struct and do a field by field comparison. The reason for this is that the field by field comparison can yield a null value, and the struct level comparison cannot. This matters a lot for null aware anti joins.

I see. Then I think that the example reported in the JIRA should be considered an invalid query, since the number of elements of the outside value is different from the one inside the query. So we should throw an AnalysisException for that case. Do you agree with this approach?

hvanhovell · 2018-05-23T08:23:19Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala

Can we just add these tests to the SqlQueryTestSuite. This is where most of the subquery tests can be found.

I cannot really add them there since I need to intercept AnalysisException here, but if you have suggestions about better places for this, I am happy to move it.

hvanhovell · 2018-05-23T08:26:59Z

@mgaido91 can you link the correct JIRA? This one does not seem to be the correct one.

mgaido91 · 2018-05-23T11:46:19Z

thanks @hvanhovell, sorry for the error. I changed to the right one.

juliuszsompolski · 2018-05-29T09:50:39Z

I think that the way the columns are defined in the subquery should define the semantics.
E.g.:
(a, b) IN (select c, d from ...) - unpack (a, b) and treat it as a multi column comparison as in current semantics.
(a, b) IN (select (c, d) from ..) - keep it packed and treat it as a single column IN.
(a, b, c) IN (select (d, e), f from ..) or similar combinations - catch it in analysis as ambiguous
(a, b, c) IN (select (d, e), f, g from ..) - but this is valid as long as a matches the type of (d, e)

mgaido91 · 2018-05-29T11:32:30Z

@juliuszsompolski I see your point, though it has some problems I think. If we follow this path, we are saying that: (a, b) IN (select c, d from ...) has a different result from (a, b) IN (select (c, d) from ..) and (a, b) IN ((1, 2)). We can probably argument that they are different things so they can lead to different results, but this is not very intuitive for a user.

I'd prefer, in this case, having a rule about how we behave and follow that, throwing an AnalysisException otherwise. This is also the behavior of other RDBMS (I checked Oracle and Postgres):

(a, b) IN (select c, d from ...) unpacks them;
(a, b) IN (select (c, d) from ..) throws an AnalysisException

So I would suggest going on with this approach, which could solve also other issues like SPARK-24395 since they would be considered as invalid.

cc @hvanhovell what do you think?

juliuszsompolski · 2018-05-29T18:02:47Z

@mgaido91 This also works, +1.
What about a in (select (b, c) from ...) when a is a struct? - I guess allow it, but a potential gotcha during implementation

juliuszsompolski · 2018-05-29T20:14:17Z

@mgaido91 BTW: In SPARK-24395 I would consider the cases to still be valid, because I believe there is no other syntactic way to do a multi-column IN/NOT IN with list of literals.
The question is whether it should be treated as structs, or unpacked?
If like structs, then the current behavior is correct, I think.

mgaido91 · 2018-05-29T20:41:06Z

@juliuszsompolski yes, you're right, sorry, SPARK-24395 uses literal and not subqueries, sorry.

mgaido91 · 2018-05-30T15:13:14Z

I am encountering big issues in enforcing the behavior we mentioned. The problem is that we cannot really distinguish the cases:

... (a, b) in (select ...)
... from (select (a, b) as x ...) where x in (select ...)

So the problem is that we don't know if we have a CreateNamedStruct exactly there or if the Optimizer puts it somehow there. It may be needed to change the parsing logic for this and to revisit the whole In structure. I mean, we have to parse not a single value but a list of values in the outer operator.

And as well we cannot really distinguish:

... (a, b) in (select (a, b) from ...)
... (a, b) in (select a from (select (a, b) from ...))

This is a bit trickier indeed.

What do you think?

SparkQA · 2018-06-19T14:17:58Z

Test build #92085 has finished for PR 21403 at commit c9a36e0.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class In(values: Seq[Expression], list: Seq[Expression]) extends Predicate

SparkQA · 2018-06-19T14:57:25Z

Test build #92084 has finished for PR 21403 at commit df7d3ee.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mgaido91 · 2018-06-19T15:52:49Z

I updated the PR according to the previous discussion.

@hvanhovell @juliuszsompolski may you please take a look at it now? Thanks.

cloud-fan · 2018-06-25T15:32:42Z

I agree with the proposed behavior, but I'm a little worried about hacking the existig In expression to implement it.

I took a look at postgres, a = b is equal to a.i = b.i and a.j = b.j if a and b are both struct type with same number of fields with same type. If we make Spark's equal operator follow this semantic, it can make this PR simpler.

mgaido91 · 2018-06-25T16:10:09Z

@cloud-fan thank for looking at this. I don't think that "hack" can be removed. Let me show an example when I think we cannot avoid that change.

Imagine this query:

select 1 from (select (1, 'a') as col1) tab1 where col1 in (select 1, 'a')

Without changing the way In is built this is equivalent to:

select 1 from (select 1 as col1, 'a' as col2) tab1 where (col1, col2) in (select 1, 'a')

But the first query is invalid, as the outer value has one element an the subquery has 2 output fields, while the second query is valid. So the only way I found in order to avoid problem like this is changing In as done in this PR.

cloud-fan · 2018-06-28T09:20:38Z

OK I see the point now.

a IN (select (c, d) from ...) is valid but (i, j) IN (select (c, d) from ...) is not. This is a little weird because a is semantically same with (i, j) if a is struct type with 2 fields i and j. It sounds like we want to treat (...) specially in the context of IN expression.

Can you give some similar examples in other databases?

mgaido91 · 2018-06-28T09:51:53Z

yes @cloud-fan , you're 100% right, we want to treat (...) differently when it is in front of IN.

Here you are the previous example in Postgres:

mgaido=# select 1 from (select (1, 'a') as col1) tab1 where col1 in (select 1, 'a');
ERROR:  subquery has too many columns
LINE 1: ... 1 from (select (1, 'a') as col1) tab1 where col1 in (select...

mgaido=# select 1 from (select 1 as col1, 'a' as col2) tab1 where (col1, col2) in (select 1, 'a');
 ?column? 
----------
        1
(1 row)

In Oracle/MySQL you cannot create structs using (...) but you have to define a custom data type for structs, so this situation is prevented to happen.

cloud-fan · 2018-06-28T12:47:10Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala

  """)
 // scalastyle:on line.size.limit
-case class In(value: Expression, list: Seq[Expression]) extends Predicate {
+case class In(values: Seq[Expression], list: Seq[Expression]) extends Predicate {


can we have an analyzer rule to deal with In(CreateStruct(...), ListQuery(...)), to unpack the CreateStruct, or pack the ListQuery? Then we don't need to change In.

I don't think so, as the value can be replaced later by other rules. So we do need to have a Seq[Expression] here, instead of a single expression. Another possible option which I haven't checked, but I think it may be feasible is to create a new kind of Expression (eg. InValues) we can use only for this specific case. What do you think?

+1 on InValues. Maybe call it InSubquery

it is not a subquery, this is the "left part" of IN, so I don't really agree on InSubquery, but if you have another suggestion I am happy to follow it. Thanks.

I mean case class InSubquery(values: Seq[Expression], subquery: ListSubquery), it's not just the left part.

Anyway, I think right behavior is the one which both Postgres and Hive have (and it is also the same of Oracle/MySQL, in which we don't have structs). What do you think?

I agree we should treat (...) specially if it's in front of In, but I'm wondering if we need to do the same thing for =.

I am not sure. The behavior when comparing structs in not uniform among different DBs. Hive doesn't allow = on structs. Postgres and Presto does, but their behavior with nulls is not consistent and it is different from ours. In particular, comparing a struct containing a null returns null on Postgres and causes an exception in Presto (we return false instead). This is causing another problem which has been reported in another JIRA for which we can return results different from Postgres and Oracle (SPARK-24395).

For this specific case, instead, I'll update this PR creating the new ad-hoc expression for the values in front of IN if you agree, as we have to deal not only with the subquery case. What do you think?

SparkQA · 2018-07-02T14:21:01Z

Test build #92524 has finished for PR 21403 at commit 268307f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class InValues(children: Seq[Expression]) extends Expression
case class In(value: Expression, list: Seq[Expression]) extends Predicate

SparkQA · 2018-07-04T01:55:11Z

Test build #92581 has finished for PR 21403 at commit d3e39ed.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mgaido91 · 2018-07-09T10:09:22Z

anymore comments @cloud-fan @hvanhovell @juliuszsompolski ?

SparkQA · 2018-08-02T15:24:34Z

Test build #93999 has finished for PR 21403 at commit 0f00a06.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-08-03T15:39:15Z

Test build #94129 has finished for PR 21403 at commit 45a91fc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-08-03T18:34:48Z

Test build #94156 has finished for PR 21403 at commit cb3467b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class Exists(
case class InSubquery(values: Seq[Expression],

This reverts commit cb3467b.

SparkQA · 2018-08-04T00:03:41Z

Test build #94178 has finished for PR 21403 at commit a6114a6.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class InSubquery(values: Seq[Expression], query: ListQuery)
case class ListQuery(
case class Exists(

gatorsmile · 2018-08-04T06:40:22Z

retest this please

gatorsmile · 2018-08-04T06:45:57Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala


      // If the value expression is NULL then transform the In expression to null literal.
      case In(Literal(null, _), _) => Literal.create(null, BooleanType)
+      case InSubquery(Seq(Literal(null, _)), _) => Literal.create(null, BooleanType)


Thanks for adding this! Please double check all the cases of IN in all the optimizer rules. We are afraid this new expression might introduce a regression.

Add a test case in OptimizeInSuite

Thanks for your comment. I checked it again and I am pretty sure no regression is introduced. We don't have many optimizer rules using In and all the others were and are applied only to In with a list of literals. I am adding this and the other tests. Thanks.

gatorsmile · 2018-08-04T06:51:50Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/ExpressionParserSuite.scala

    assertEqual(
      "a in (select b from c)",
-      In('a, Seq(ListQuery(table("c").select('b)))))
+      InSubquery(Seq('a), ListQuery(table("c").select('b))))


Could you add more cases in this test case? For example, when the input is CreateNamedStruct

SparkQA · 2018-08-04T07:05:01Z

Test build #94202 has finished for PR 21403 at commit a6114a6.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class InSubquery(values: Seq[Expression], query: ListQuery)
case class ListQuery(
case class Exists(

SparkQA · 2018-08-06T11:23:44Z

Test build #94269 has finished for PR 21403 at commit eb1dfb7.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

mgaido91 · 2018-08-06T11:27:50Z

retest this please

SparkQA · 2018-08-06T15:47:02Z

Test build #94281 has finished for PR 21403 at commit eb1dfb7.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

mgaido91 · 2018-08-06T15:55:51Z

retest this please

SparkQA · 2018-08-06T20:20:16Z

Test build #94292 has finished for PR 21403 at commit eb1dfb7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-08-07T07:43:53Z

LGTM, merging to master!

cloud-fan · 2018-09-18T04:05:20Z

sql/core/src/test/resources/sql-tests/results/subquery/in-subquery/in-basic.sql.out

+
+
+-- !query 3
+select 1 from tab_a where (a1, b1) not in (select a2, b2 from tab_b)


what's the result of this query without this patch?

the same as after the patch, ie. an empty result set. It is included here in order to ensure that this is considered a valid query.

cloud-fan · 2018-09-18T04:05:31Z

sql/core/src/test/resources/sql-tests/results/subquery/in-subquery/in-basic.sql.out

+
+
+-- !query 4
+select 1 from tab_a where (a1, b1) not in (select (a2, b2) from tab_b)


This fails with a compile exception like the one reported in the JIRA

cloud-fan · 2018-09-18T14:23:47Z

I'm writing release notes, and this one gets my attention. @mgaido91 can you confirm that this patch doesn't introduce any behavior change? i.e. if it fails previously, it still fails. If it successes previously, it still successes.

mgaido91 · 2018-09-19T07:11:16Z

@cloud-fan, no, it introduces a behavior change when structs are involved. The two queries here would fail before this query, while the version written like this would work (and after the PR doesn't work instead):

select count(*) from struct_tab where record in (select a2, b2 from tab_b);
select count(*) from struct_tab where record not in (select a2, b2 from tab_b);

Since before the PR any struct before the IN operator behaves like having (f1, f2, ...), while after the PR a struct there is considered as a field.

cloud-fan · 2018-09-19T10:48:32Z

ah i see. Can you add it to the migration guide? We need to tell users what will break after upgrading to 2.4 and why.

mgaido91 · 2018-09-19T10:53:48Z

@cloud-fan sure, I'll create a followup PR, thanks.

cloud-fan · 2018-09-26T16:26:23Z

I just realized there are 2 InSubquery expressions, seems we need to rename one of it. @mgaido91 any ideas?

mgaido91 · 2018-09-26T16:58:43Z

oh, I see @cloud-fan. But, IIUC, the other one is not used anymore. The only reference was removed by 4ce970d. I'll submit a PR to remove it if you agree.

cloud-fan · 2018-09-27T00:02:01Z

sure, feel free to open a PR first.

hvanhovell reviewed May 23, 2018

View reviewed changes

mgaido91 changed the title ~~[SPARK-24313][WIP][SQL] Support IN subqueries with struct type~~ [SPARK-24341][WIP][SQL] Support IN subqueries with struct type May 23, 2018

[SPARK-24313][SQL] Support In subqueries which are valid in other RDBMS

c9a36e0

mgaido91 force-pushed the SPARK-24313 branch from df7d3ee to c9a36e0 Compare June 19, 2018 11:20

mgaido91 changed the title ~~[SPARK-24341][WIP][SQL] Support IN subqueries with struct type~~ [SPARK-24341][SQL] Support only IN subqueries with the same number of items per row Jun 25, 2018

cloud-fan reviewed Jun 28, 2018

View reviewed changes

mgaido91 added 2 commits June 29, 2018 16:05

introduce InValues

65ff49a

add analyzer rule to add InValues

268307f

fix ut failures

d3e39ed

fix test error

45a91fc

remove ListQuery

cb3467b

Revert "remove ListQuery"

a6114a6

This reverts commit cb3467b.

gatorsmile reviewed Aug 4, 2018

View reviewed changes

address comment

eb1dfb7

asfgit closed this in 88e0c7b Aug 7, 2018

cloud-fan reviewed Sep 18, 2018

View reviewed changes

cloud-fan mentioned this pull request Oct 15, 2018

[SPARK-24395][SQL] IN operator should return NULL when comparing struct with NULL fields #22029

Closed

cloud-fan mentioned this pull request Sep 21, 2020

[SPARK-32931][SQL] Unevaluable Expressions are not Foldable #29798

Closed



		-- !query 3
		select 1 from tab_a where (a1, b1) not in (select a2, b2 from tab_b)



		-- !query 4
		select 1 from tab_a where (a1, b1) not in (select (a2, b2) from tab_b)

[SPARK-24341][SQL] Support only IN subqueries with the same number of items per row #21403

[SPARK-24341][SQL] Support only IN subqueries with the same number of items per row #21403

Uh oh!

Conversation

mgaido91 commented May 22, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

mgaido91 commented May 22, 2018

Uh oh!

SparkQA commented May 23, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hvanhovell commented May 23, 2018

Uh oh!

mgaido91 commented May 23, 2018

Uh oh!

juliuszsompolski commented May 29, 2018

Uh oh!

mgaido91 commented May 29, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

juliuszsompolski commented May 29, 2018

Uh oh!

juliuszsompolski commented May 29, 2018

Uh oh!

mgaido91 commented May 29, 2018

Uh oh!

mgaido91 commented May 30, 2018

Uh oh!

SparkQA commented Jun 19, 2018

Uh oh!

SparkQA commented Jun 19, 2018

Uh oh!

mgaido91 commented Jun 19, 2018

Uh oh!

cloud-fan commented Jun 25, 2018

Uh oh!

mgaido91 commented Jun 25, 2018

Uh oh!

cloud-fan commented Jun 28, 2018

Uh oh!

mgaido91 commented Jun 28, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 2, 2018

Uh oh!

SparkQA commented Jul 4, 2018

Uh oh!

mgaido91 commented Jul 9, 2018

Uh oh!

SparkQA commented Aug 2, 2018

Uh oh!

mgaido91 commented May 22, 2018 •

edited

Loading

mgaido91 commented May 29, 2018 •

edited

Loading