Skip to content

Conversation

@mgaido91
Copy link
Contributor

@mgaido91 mgaido91 commented May 22, 2018

What changes were proposed in this pull request?

Using struct types in subqueries with the IN clause can generate invalid plans in RewritePredicateSubquery. Indeed, we are not handling clearly the cases when the outer value is a struct or the output of the inner subquery is a struct.

The PR aims to make Spark's behavior the same as the one of the other RDBMS - namely Oracle and Postgres behavior were checked. So we consider valid only queries having the same number of fields in the outer value and in the subquery. This means that:

  • (a, b) IN (select c, d from ...) is a valid query;
  • (a, b) IN (select (c, d) from ...) throws an AnalysisException, as in the subquery we have only one field of type struct while in the outer value we have 2 fields;
  • a IN (select (c, d) from ...) - where a is a struct - is a valid query.

How was this patch tested?

Added UT

@mgaido91
Copy link
Contributor Author

cc @cloud-fan @dilipbiswal @gatorsmile @juliuszsompolski: it is currently a WIP since I think the UTs have to be formalized a bit better. But I wanted to share with you this in order to understand if you agree on the strategy of this PR. I'd really appreciate any feedback. Thanks.

@SparkQA
Copy link

SparkQA commented May 23, 2018

Test build #90998 has finished for PR 21403 at commit 5b6226f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if we should unpack the struct and do a field by field comparison. The reason for this is that the field by field comparison can yield a null value, and the struct level comparison cannot. This matters a lot for null aware anti joins.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Then I think that the example reported in the JIRA should be considered an invalid query, since the number of elements of the outside value is different from the one inside the query. So we should throw an AnalysisException for that case. Do you agree with this approach?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just add these tests to the SqlQueryTestSuite. This is where most of the subquery tests can be found.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot really add them there since I need to intercept AnalysisException here, but if you have suggestions about better places for this, I am happy to move it.

@hvanhovell
Copy link
Contributor

@mgaido91 can you link the correct JIRA? This one does not seem to be the correct one.

@mgaido91 mgaido91 changed the title [SPARK-24313][WIP][SQL] Support IN subqueries with struct type [SPARK-24341][WIP][SQL] Support IN subqueries with struct type May 23, 2018
@mgaido91
Copy link
Contributor Author

thanks @hvanhovell, sorry for the error. I changed to the right one.

@juliuszsompolski
Copy link
Contributor

I think that the way the columns are defined in the subquery should define the semantics.
E.g.:
(a, b) IN (select c, d from ...) - unpack (a, b) and treat it as a multi column comparison as in current semantics.
(a, b) IN (select (c, d) from ..) - keep it packed and treat it as a single column IN.
(a, b, c) IN (select (d, e), f from ..) or similar combinations - catch it in analysis as ambiguous
(a, b, c) IN (select (d, e), f, g from ..) - but this is valid as long as a matches the type of (d, e)

@mgaido91
Copy link
Contributor Author

mgaido91 commented May 29, 2018

@juliuszsompolski I see your point, though it has some problems I think. If we follow this path, we are saying that: (a, b) IN (select c, d from ...) has a different result from (a, b) IN (select (c, d) from ..) and (a, b) IN ((1, 2)). We can probably argument that they are different things so they can lead to different results, but this is not very intuitive for a user.

I'd prefer, in this case, having a rule about how we behave and follow that, throwing an AnalysisException otherwise. This is also the behavior of other RDBMS (I checked Oracle and Postgres):

  • (a, b) IN (select c, d from ...) unpacks them;
  • (a, b) IN (select (c, d) from ..) throws an AnalysisException

So I would suggest going on with this approach, which could solve also other issues like SPARK-24395 since they would be considered as invalid.

cc @hvanhovell what do you think?

@juliuszsompolski
Copy link
Contributor

@mgaido91 This also works, +1.
What about a in (select (b, c) from ...) when a is a struct? - I guess allow it, but a potential gotcha during implementation

@juliuszsompolski
Copy link
Contributor

@mgaido91 BTW: In SPARK-24395 I would consider the cases to still be valid, because I believe there is no other syntactic way to do a multi-column IN/NOT IN with list of literals.
The question is whether it should be treated as structs, or unpacked?
If like structs, then the current behavior is correct, I think.

@mgaido91
Copy link
Contributor Author

@juliuszsompolski yes, you're right, sorry, SPARK-24395 uses literal and not subqueries, sorry.

@mgaido91
Copy link
Contributor Author

I am encountering big issues in enforcing the behavior we mentioned. The problem is that we cannot really distinguish the cases:

  • ... (a, b) in (select ...)
  • ... from (select (a, b) as x ...) where x in (select ...)

So the problem is that we don't know if we have a CreateNamedStruct exactly there or if the Optimizer puts it somehow there. It may be needed to change the parsing logic for this and to revisit the whole In structure. I mean, we have to parse not a single value but a list of values in the outer operator.

And as well we cannot really distinguish:

  • ... (a, b) in (select (a, b) from ...)
  • ... (a, b) in (select a from (select (a, b) from ...))

This is a bit trickier indeed.

What do you think?

@SparkQA
Copy link

SparkQA commented Jun 19, 2018

Test build #92085 has finished for PR 21403 at commit c9a36e0.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class In(values: Seq[Expression], list: Seq[Expression]) extends Predicate

@SparkQA
Copy link

SparkQA commented Jun 19, 2018

Test build #92084 has finished for PR 21403 at commit df7d3ee.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mgaido91
Copy link
Contributor Author

I updated the PR according to the previous discussion.

@hvanhovell @juliuszsompolski may you please take a look at it now? Thanks.

@mgaido91 mgaido91 changed the title [SPARK-24341][WIP][SQL] Support IN subqueries with struct type [SPARK-24341][SQL] Support only IN subqueries with the same number of items per row Jun 25, 2018
@cloud-fan
Copy link
Contributor

I agree with the proposed behavior, but I'm a little worried about hacking the existig In expression to implement it.

I took a look at postgres, a = b is equal to a.i = b.i and a.j = b.j if a and b are both struct type with same number of fields with same type. If we make Spark's equal operator follow this semantic, it can make this PR simpler.

@mgaido91
Copy link
Contributor Author

@cloud-fan thank for looking at this. I don't think that "hack" can be removed. Let me show an example when I think we cannot avoid that change.

Imagine this query:

select 1 from (select (1, 'a') as col1) tab1 where col1 in (select 1, 'a')

Without changing the way In is built this is equivalent to:

select 1 from (select 1 as col1, 'a' as col2) tab1 where (col1, col2) in (select 1, 'a')

But the first query is invalid, as the outer value has one element an the subquery has 2 output fields, while the second query is valid. So the only way I found in order to avoid problem like this is changing In as done in this PR.

@cloud-fan
Copy link
Contributor

OK I see the point now.

a IN (select (c, d) from ...) is valid but (i, j) IN (select (c, d) from ...) is not. This is a little weird because a is semantically same with (i, j) if a is struct type with 2 fields i and j. It sounds like we want to treat (...) specially in the context of IN expression.

Can you give some similar examples in other databases?

@mgaido91
Copy link
Contributor Author

yes @cloud-fan , you're 100% right, we want to treat (...) differently when it is in front of IN.

Here you are the previous example in Postgres:

mgaido=# select 1 from (select (1, 'a') as col1) tab1 where col1 in (select 1, 'a');
ERROR:  subquery has too many columns
LINE 1: ... 1 from (select (1, 'a') as col1) tab1 where col1 in (select...

mgaido=# select 1 from (select 1 as col1, 'a' as col2) tab1 where (col1, col2) in (select 1, 'a');
 ?column? 
----------
        1
(1 row)

In Oracle/MySQL you cannot create structs using (...) but you have to define a custom data type for structs, so this situation is prevented to happen.

""")
// scalastyle:on line.size.limit
case class In(value: Expression, list: Seq[Expression]) extends Predicate {
case class In(values: Seq[Expression], list: Seq[Expression]) extends Predicate {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we have an analyzer rule to deal with In(CreateStruct(...), ListQuery(...)), to unpack the CreateStruct, or pack the ListQuery? Then we don't need to change In.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, as the value can be replaced later by other rules. So we do need to have a Seq[Expression] here, instead of a single expression. Another possible option which I haven't checked, but I think it may be feasible is to create a new kind of Expression (eg. InValues) we can use only for this specific case. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on InValues. Maybe call it InSubquery

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is not a subquery, this is the "left part" of IN, so I don't really agree on InSubquery, but if you have another suggestion I am happy to follow it. Thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean case class InSubquery(values: Seq[Expression], subquery: ListSubquery), it's not just the left part.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyway, I think right behavior is the one which both Postgres and Hive have (and it is also the same of Oracle/MySQL, in which we don't have structs). What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree we should treat (...) specially if it's in front of In, but I'm wondering if we need to do the same thing for =.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure. The behavior when comparing structs in not uniform among different DBs. Hive doesn't allow = on structs. Postgres and Presto does, but their behavior with nulls is not consistent and it is different from ours. In particular, comparing a struct containing a null returns null on Postgres and causes an exception in Presto (we return false instead). This is causing another problem which has been reported in another JIRA for which we can return results different from Postgres and Oracle (SPARK-24395).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this specific case, instead, I'll update this PR creating the new ad-hoc expression for the values in front of IN if you agree, as we have to deal not only with the subquery case. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM

@SparkQA
Copy link

SparkQA commented Jul 2, 2018

Test build #92524 has finished for PR 21403 at commit 268307f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class InValues(children: Seq[Expression]) extends Expression
  • case class In(value: Expression, list: Seq[Expression]) extends Predicate

@SparkQA
Copy link

SparkQA commented Jul 4, 2018

Test build #92581 has finished for PR 21403 at commit d3e39ed.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mgaido91
Copy link
Contributor Author

mgaido91 commented Jul 9, 2018

anymore comments @cloud-fan @hvanhovell @juliuszsompolski ?

@SparkQA
Copy link

SparkQA commented Aug 2, 2018

Test build #93999 has finished for PR 21403 at commit 0f00a06.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 3, 2018

Test build #94129 has finished for PR 21403 at commit 45a91fc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 3, 2018

Test build #94156 has finished for PR 21403 at commit cb3467b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class Exists(
  • case class InSubquery(values: Seq[Expression],

This reverts commit cb3467b.
@SparkQA
Copy link

SparkQA commented Aug 4, 2018

Test build #94178 has finished for PR 21403 at commit a6114a6.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class InSubquery(values: Seq[Expression], query: ListQuery)
  • case class ListQuery(
  • case class Exists(

@gatorsmile
Copy link
Member

retest this please


// If the value expression is NULL then transform the In expression to null literal.
case In(Literal(null, _), _) => Literal.create(null, BooleanType)
case InSubquery(Seq(Literal(null, _)), _) => Literal.create(null, BooleanType)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this! Please double check all the cases of IN in all the optimizer rules. We are afraid this new expression might introduce a regression.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a test case in OptimizeInSuite

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your comment. I checked it again and I am pretty sure no regression is introduced. We don't have many optimizer rules using In and all the others were and are applied only to In with a list of literals. I am adding this and the other tests. Thanks.

assertEqual(
"a in (select b from c)",
In('a, Seq(ListQuery(table("c").select('b)))))
InSubquery(Seq('a), ListQuery(table("c").select('b))))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add more cases in this test case? For example, when the input is CreateNamedStruct

@SparkQA
Copy link

SparkQA commented Aug 4, 2018

Test build #94202 has finished for PR 21403 at commit a6114a6.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class InSubquery(values: Seq[Expression], query: ListQuery)
  • case class ListQuery(
  • case class Exists(

@SparkQA
Copy link

SparkQA commented Aug 6, 2018

Test build #94269 has finished for PR 21403 at commit eb1dfb7.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mgaido91
Copy link
Contributor Author

mgaido91 commented Aug 6, 2018

retest this please

@SparkQA
Copy link

SparkQA commented Aug 6, 2018

Test build #94281 has finished for PR 21403 at commit eb1dfb7.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mgaido91
Copy link
Contributor Author

mgaido91 commented Aug 6, 2018

retest this please

@SparkQA
Copy link

SparkQA commented Aug 6, 2018

Test build #94292 has finished for PR 21403 at commit eb1dfb7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

LGTM, merging to master!

@asfgit asfgit closed this in 88e0c7b Aug 7, 2018


-- !query 3
select 1 from tab_a where (a1, b1) not in (select a2, b2 from tab_b)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the result of this query without this patch?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the same as after the patch, ie. an empty result set. It is included here in order to ensure that this is considered a valid query.



-- !query 4
select 1 from tab_a where (a1, b1) not in (select (a2, b2) from tab_b)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fails with a compile exception like the one reported in the JIRA

@cloud-fan
Copy link
Contributor

I'm writing release notes, and this one gets my attention. @mgaido91 can you confirm that this patch doesn't introduce any behavior change? i.e. if it fails previously, it still fails. If it successes previously, it still successes.

@mgaido91
Copy link
Contributor Author

@cloud-fan, no, it introduces a behavior change when structs are involved. The two queries here would fail before this query, while the version written like this would work (and after the PR doesn't work instead):

select count(*) from struct_tab where record in (select a2, b2 from tab_b);
select count(*) from struct_tab where record not in (select a2, b2 from tab_b);

Since before the PR any struct before the IN operator behaves like having (f1, f2, ...), while after the PR a struct there is considered as a field.

@cloud-fan
Copy link
Contributor

ah i see. Can you add it to the migration guide? We need to tell users what will break after upgrading to 2.4 and why.

@mgaido91
Copy link
Contributor Author

@cloud-fan sure, I'll create a followup PR, thanks.

@cloud-fan
Copy link
Contributor

I just realized there are 2 InSubquery expressions, seems we need to rename one of it. @mgaido91 any ideas?

@mgaido91
Copy link
Contributor Author

oh, I see @cloud-fan. But, IIUC, the other one is not used anymore. The only reference was removed by 4ce970d. I'll submit a PR to remove it if you agree.

@cloud-fan
Copy link
Contributor

sure, feel free to open a PR first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants