-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-24341][SQL] Support only IN subqueries with the same number of items per row #21403
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @cloud-fan @dilipbiswal @gatorsmile @juliuszsompolski: it is currently a WIP since I think the UTs have to be formalized a bit better. But I wanted to share with you this in order to understand if you agree on the strategy of this PR. I'd really appreciate any feedback. Thanks. |
|
Test build #90998 has finished for PR 21403 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if we should unpack the struct and do a field by field comparison. The reason for this is that the field by field comparison can yield a null value, and the struct level comparison cannot. This matters a lot for null aware anti joins.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Then I think that the example reported in the JIRA should be considered an invalid query, since the number of elements of the outside value is different from the one inside the query. So we should throw an AnalysisException for that case. Do you agree with this approach?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just add these tests to the SqlQueryTestSuite. This is where most of the subquery tests can be found.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I cannot really add them there since I need to intercept AnalysisException here, but if you have suggestions about better places for this, I am happy to move it.
|
@mgaido91 can you link the correct JIRA? This one does not seem to be the correct one. |
|
thanks @hvanhovell, sorry for the error. I changed to the right one. |
|
I think that the way the columns are defined in the subquery should define the semantics. |
|
@juliuszsompolski I see your point, though it has some problems I think. If we follow this path, we are saying that: I'd prefer, in this case, having a rule about how we behave and follow that, throwing an AnalysisException otherwise. This is also the behavior of other RDBMS (I checked Oracle and Postgres):
So I would suggest going on with this approach, which could solve also other issues like SPARK-24395 since they would be considered as invalid. cc @hvanhovell what do you think? |
|
@mgaido91 This also works, +1. |
|
@mgaido91 BTW: In SPARK-24395 I would consider the cases to still be valid, because I believe there is no other syntactic way to do a multi-column IN/NOT IN with list of literals. |
|
@juliuszsompolski yes, you're right, sorry, SPARK-24395 uses literal and not subqueries, sorry. |
|
I am encountering big issues in enforcing the behavior we mentioned. The problem is that we cannot really distinguish the cases:
So the problem is that we don't know if we have a And as well we cannot really distinguish:
This is a bit trickier indeed. What do you think? |
|
Test build #92085 has finished for PR 21403 at commit
|
|
Test build #92084 has finished for PR 21403 at commit
|
|
I updated the PR according to the previous discussion. @hvanhovell @juliuszsompolski may you please take a look at it now? Thanks. |
|
I agree with the proposed behavior, but I'm a little worried about hacking the existig I took a look at postgres, |
|
@cloud-fan thank for looking at this. I don't think that "hack" can be removed. Let me show an example when I think we cannot avoid that change. Imagine this query: Without changing the way But the first query is invalid, as the outer value has one element an the subquery has 2 output fields, while the second query is valid. So the only way I found in order to avoid problem like this is changing |
|
OK I see the point now.
Can you give some similar examples in other databases? |
|
yes @cloud-fan , you're 100% right, we want to treat Here you are the previous example in Postgres: In Oracle/MySQL you cannot create structs using |
| """) | ||
| // scalastyle:on line.size.limit | ||
| case class In(value: Expression, list: Seq[Expression]) extends Predicate { | ||
| case class In(values: Seq[Expression], list: Seq[Expression]) extends Predicate { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we have an analyzer rule to deal with In(CreateStruct(...), ListQuery(...)), to unpack the CreateStruct, or pack the ListQuery? Then we don't need to change In.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so, as the value can be replaced later by other rules. So we do need to have a Seq[Expression] here, instead of a single expression. Another possible option which I haven't checked, but I think it may be feasible is to create a new kind of Expression (eg. InValues) we can use only for this specific case. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on InValues. Maybe call it InSubquery
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is not a subquery, this is the "left part" of IN, so I don't really agree on InSubquery, but if you have another suggestion I am happy to follow it. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean case class InSubquery(values: Seq[Expression], subquery: ListSubquery), it's not just the left part.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anyway, I think right behavior is the one which both Postgres and Hive have (and it is also the same of Oracle/MySQL, in which we don't have structs). What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree we should treat (...) specially if it's in front of In, but I'm wondering if we need to do the same thing for =.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure. The behavior when comparing structs in not uniform among different DBs. Hive doesn't allow = on structs. Postgres and Presto does, but their behavior with nulls is not consistent and it is different from ours. In particular, comparing a struct containing a null returns null on Postgres and causes an exception in Presto (we return false instead). This is causing another problem which has been reported in another JIRA for which we can return results different from Postgres and Oracle (SPARK-24395).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this specific case, instead, I'll update this PR creating the new ad-hoc expression for the values in front of IN if you agree, as we have to deal not only with the subquery case. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM
|
Test build #92524 has finished for PR 21403 at commit
|
|
Test build #92581 has finished for PR 21403 at commit
|
|
anymore comments @cloud-fan @hvanhovell @juliuszsompolski ? |
|
Test build #93999 has finished for PR 21403 at commit
|
|
Test build #94129 has finished for PR 21403 at commit
|
|
Test build #94156 has finished for PR 21403 at commit
|
This reverts commit cb3467b.
|
Test build #94178 has finished for PR 21403 at commit
|
|
retest this please |
|
|
||
| // If the value expression is NULL then transform the In expression to null literal. | ||
| case In(Literal(null, _), _) => Literal.create(null, BooleanType) | ||
| case InSubquery(Seq(Literal(null, _)), _) => Literal.create(null, BooleanType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this! Please double check all the cases of IN in all the optimizer rules. We are afraid this new expression might introduce a regression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a test case in OptimizeInSuite
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your comment. I checked it again and I am pretty sure no regression is introduced. We don't have many optimizer rules using In and all the others were and are applied only to In with a list of literals. I am adding this and the other tests. Thanks.
| assertEqual( | ||
| "a in (select b from c)", | ||
| In('a, Seq(ListQuery(table("c").select('b))))) | ||
| InSubquery(Seq('a), ListQuery(table("c").select('b)))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add more cases in this test case? For example, when the input is CreateNamedStruct
|
Test build #94202 has finished for PR 21403 at commit
|
|
Test build #94269 has finished for PR 21403 at commit
|
|
retest this please |
|
Test build #94281 has finished for PR 21403 at commit
|
|
retest this please |
|
Test build #94292 has finished for PR 21403 at commit
|
|
LGTM, merging to master! |
|
|
||
|
|
||
| -- !query 3 | ||
| select 1 from tab_a where (a1, b1) not in (select a2, b2 from tab_b) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the result of this query without this patch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the same as after the patch, ie. an empty result set. It is included here in order to ensure that this is considered a valid query.
|
|
||
|
|
||
| -- !query 4 | ||
| select 1 from tab_a where (a1, b1) not in (select (a2, b2) from tab_b) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This fails with a compile exception like the one reported in the JIRA
|
I'm writing release notes, and this one gets my attention. @mgaido91 can you confirm that this patch doesn't introduce any behavior change? i.e. if it fails previously, it still fails. If it successes previously, it still successes. |
|
@cloud-fan, no, it introduces a behavior change when structs are involved. The two queries here would fail before this query, while the version written like this would work (and after the PR doesn't work instead): Since before the PR any struct before the IN operator behaves like having |
|
ah i see. Can you add it to the migration guide? We need to tell users what will break after upgrading to 2.4 and why. |
|
@cloud-fan sure, I'll create a followup PR, thanks. |
|
I just realized there are 2 |
|
oh, I see @cloud-fan. But, IIUC, the other one is not used anymore. The only reference was removed by 4ce970d. I'll submit a PR to remove it if you agree. |
|
sure, feel free to open a PR first. |
What changes were proposed in this pull request?
Using struct types in subqueries with the
INclause can generate invalid plans inRewritePredicateSubquery. Indeed, we are not handling clearly the cases when the outer value is a struct or the output of the inner subquery is a struct.The PR aims to make Spark's behavior the same as the one of the other RDBMS - namely Oracle and Postgres behavior were checked. So we consider valid only queries having the same number of fields in the outer value and in the subquery. This means that:
(a, b) IN (select c, d from ...)is a valid query;(a, b) IN (select (c, d) from ...)throws an AnalysisException, as in the subquery we have only one field of type struct while in the outer value we have 2 fields;a IN (select (c, d) from ...)- whereais a struct - is a valid query.How was this patch tested?
Added UT