-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-24395][SQL] IN operator should return NULL when comparing struct with NULL fields #22029
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ct with NULL fields
|
Test build #94382 has finished for PR 22029 at commit
|
|
Is there a clear definition for the expected behavior? I tried postgre before, it returns null for things like |
|
@cloud-fan I agree with you that struct comparison behavior is questionable. But the IN (and NOT IN) behavior is coherent among Postgres, Oracle (and different from the current Spark behavior). This is the reason why I decided to propose to change only the behavior of the IN operator without changing the behavior of the struct comparison itself, which would be a much bigger change. Anyway, answering to your question, since this PR relates only to IN/NOT IN, yes, there is a clear definition for the expected behavior in the other RDBMS. PS please also notice that Spark currently has a different behavior for |
Can you put the general definition of the IN behavior (without subquery) in the PR description? The PR titile only specified the behavior for a specific case, not a general definition. e.g. nested fields are not mentioned. |
Sure, I am changing it. Let me know if the description I am writing is clear/complete enough. Thanks.
They are not mentioned as in other RDBMS are not really possible. Neither Postgres nor Oracle allow that. So I haven't changed their handling as - I think - the expected behavior in that case is questionable. I think in general, IN using literal and IN using subqueries should return the same result if they are provided with the same data (as it happens in the RDBMS). So that case is handled as before the PR since the comparison of structs has not been changed here. |
|
@cloud-fan I updated the description. Let me know if now it is clear enough. Thanks. |
|
can we also update the classdoc of |
|
Test build #94418 has finished for PR 22029 at commit
|
|
BTW it will be great if this behavior is defined by the SQL standard. And we should also check the big data SQL systems like Hive and Presto. In general we should be careful for uncertain behaviors like this. That said, I feel returing false is not that wrong to me without considering other systems. also cc @gatorsmile @dongjoon-hyun |
|
Presto and Hive work in different ways. Hive Hive behaves as Spark before the PR, but Hive doesn't support IN with subqueries where the subquery outputs more than one field. So in Hive there is not the incoherence which is present in Spark (ie. with the same data IN with literals returns a different result than IN with subquery). Presto Presto, instead, behaves as Spark before the PR, but when there is a null both in the value and in the list of literal, it throws an exception and it behaves in the same way also when IN is used with subqueries. Summarizing:
|
| override def nullable: Boolean = value.dataType match { | ||
| case _: StructType if !SQLConf.get.inFalseForNullField => | ||
| children.exists(_.nullable) || | ||
| children.exists(_.dataType.asInstanceOf[StructType].exists(_.nullable)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mgaido91 . Ur, this asInstanceOf[StructType] looks unsafe for non-nullable primitive type. Could you add an additional test case for the nested StructType to provide the coverage for this line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you think it is unsafe? We are doing it only if the dataType of value match StructType and according also to what is enforced in the checkInputDataTypes, therefore all the children must be StructTypes.
I will add anyway a test case for the scenario you described, thanks.
| "important especially when using NOT IN as in the second case, it filters out the rows " + | ||
| "when a null is present in a filed; while in the first one, those rows are returned.") | ||
| .booleanConf | ||
| .createWithDefault(false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we set this true by default in Spark 2.4 at least?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it can be done, WDYT @cloud-fan @gatorsmile ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kindly ping @cloud-fan @gatorsmile
|
Thank you for pinging me, @cloud-fan . For me, new way suggested in this PR looks reasonable. Also, this PR is configurable. We may be able to start to provide this as a non-default way in Spark 2.4. |
|
@mgaido91 do you mean the current behavior is same with Hive and Presto? |
|
@cloud-fan if we consider only the expression IN with literals, yes, the behavior is very similar, with the following difference: Presto throws exception when null is present on both sides. But what I'd really like to highlight is that there is a huge difference: both Hive and Presto (as well as the RDBMs are coherent in their behavior between IN with literals and IN with subquery). Actually, Hive doesn't support subqueries with more than one output field; Presto instead behaves in the same way in the case with literals and subqueries, which means it behaves like Spark before this PR with literals, but different from Spark when subqueries are involved. |
|
Test build #94480 has finished for PR 22029 at commit
|
|
Test build #94482 has finished for PR 22029 at commit
|
|
@cloud-fan @gatorsmile any thoughts on this? |
|
kindly ping @cloud-fan @gatorsmile |
|
kindly ping @cloud-fan @gatorsmile |
|
Test build #96035 has finished for PR 22029 at commit
|
|
@cloud-fan @dongjoon-hyun @gatorsmile anymore comments on this? |
| // to FalseLiteral which is tested in OptimizeInSuite.scala | ||
| If(IsNotNull(v), FalseLiteral, Literal(null, BooleanType)) | ||
| case expr @ In(v, list) if expr.inSetConvertible => | ||
| If(IsNotNull(i.value), FalseLiteral, Literal(null, BooleanType)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this needs to look at inFalseForNullField right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rigth, I fixed it and added a test, thanks.
| && !expr.value.isInstanceOf[CreateNamedStructLike] | ||
| && !newList.head.isInstanceOf[CreateNamedStructLike]) { | ||
| EqualTo(v, newList.head) | ||
| EqualTo(expr.value, newList.head) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, we do this only when value is not a CreateNamedStructLike, so we don't go here if there are multi-values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we update the match here? I think it should be In(Seq(vaue) ...) now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, sorry, we can't do that, otherwise we would skip the other possible optimizations here, eg. converting to InSet, reducing the list of values, etc.etc.
What should be done, instead, is doing the same change to InSet, so that the way nulls are handled is coherent.
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
Outdated
Show resolved
Hide resolved
| // TODO: `EqualTo` for structural types are not working. Until SPARK-24443 is addressed, | ||
| // TODO: we exclude them in this rule. | ||
| && !v.isInstanceOf[CreateNamedStructLike] | ||
| && !expr.value.isInstanceOf[CreateNamedStructLike] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@transient protected lazy val isMultiValued = values.length > 1
@transient lazy val value: Expression = if (isMultiValued) {
CreateNamedStruct(values.zipWithIndex.flatMap {
case (v: NamedExpression, _) => Seq(Literal(v.name), v)
case (v, idx) => Seq(Literal(s"_$idx"), v)
})
} else {
values.head
}
}
According to the implementation, expr.value.isInstanceOf[CreateNamedStructLike] means expr.values.length > 1, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, rigth
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we use expr.values.length == 1 here to make it more clear?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not really, because expr.value.isInstanceOf[CreateNamedStructLike] means:
- either
expr.values.length == 1; - or
expr.values.head.isInstanceOf[CreateNamedStructLike];
Basically there are 2 cases: one where we have several attributes in the value before IN; the other when there is a single value before IN but the value is a struct. expr.value.isInstanceOf[CreateNamedStructLike] catches both. I can add a comment explaining these 2 cases if you think is needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, you mean
expr.values.length > 1 => expr.value.isInstanceOf[CreateNamedStructLike]
but expr.value.isInstanceOf[CreateNamedStructLike] can't => expr.values.length > 1
Can you give an example?
Based on my understanding, the code here is trying to optimize a case when it's not a multi-value in and the list has only one element.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I mean that. An example is:
select 1 from (select struct('a', 1, 'b', '2') as a1) t1 where a1 in ((...), ...);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for your case, it's not CreateNamedStructLike, but just a struct type column?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, because of optimizations, it is a CreateNamedStructLike
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well, I think for this case we should optimize it.
Anyway it follows the previous behavior, we can change it later.
|
Test build #98084 has finished for PR 22029 at commit
|
|
Test build #98089 has finished for PR 22029 at commit
|
|
anymore comments on this @cloud-fan ? Thanks. |
| usage = "expr1 _FUNC_(expr2, expr3, ...) - Returns true if `expr` equals to any valN.", | ||
| usage = """ | ||
| expr1 _FUNC_(expr2, expr3, ...) - Returns true if `expr` equals to any valN. Otherwise, if | ||
| spark.sql.legacy.inOperator.falseForNullField is false and any of the elements or fields of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any of the elements or fields ...
We should explicitly mention multi-column IN, which is different from a in (b, c, ...) while a is struct type.
| "important especially when using NOT IN as in the second case, it filters out the rows " + | ||
| "when a null is present in a field; while in the first one, those rows are returned.") | ||
| .booleanConf | ||
| .createWithDefault(true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we set false as default to follow SQL standard? and be consistent with in-subquery
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I agree, let me switch it, thanks.
|
|
||
| override def eval(input: InternalRow): Any = { | ||
| val inputValue = value.eval(input) | ||
| if (checkNullEval(inputValue)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we change behavior here? seems null inset (null, xxx) returns true previously.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, because previously we were overriding nullSafeEval and the child of InSet, so null inset (whatever) was returning null
| IsNotNull(i.value) | ||
| } else { | ||
| val valuesNotNull: Seq[Expression] = values.map(IsNotNull) | ||
| valuesNotNull.tail.foldLeft(valuesNotNull.head)(And) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: values.map(IsNotNull).reduce(And)
|
Test build #98318 has finished for PR 22029 at commit
|
|
Do you know how Presto supports multi-value in subquery? By reading the PR description, it seems impossible if Preso treats |
|
@cloud-fan Presto doesn't have structs. Presto has |
|
which Presto version did you test? I tried 0.203 and it fails |
|
@cloud-fan sorry, I mistyped. I meant |
|
Another point: I think it's also important to make the behavior of IN be consistent with EQUAL. I tried PostgreSQL and Shall we update EQUAL first? The behavior of IN will be updated accordingly after we update |
|
@cloud-fan yes, but the behavior of EQUAL is not so consistent among the different DBs. In Hive EQUAL on struct behaves like Spark as of now, in Presto throws exception if there is a null, Postgres is different and it is as you mentioned. So I am not sure we have strong reasons to update how EQUAL works for struct. That doesn't seem to be handled consistently in other DBs. That's why I proposed this change, in order to handle IN consistently without modifying EQUAL for structs. |
|
If we want to follow PostgreSQL/Oracle for the IN behavior, why don't we follow the EQUAL behavior as well? |
|
Oracle doesn't support EQUAL between structs: |
|
If we decide to follow PostgreSQL about the EQUAL behavior eventually, then it will be much easier to fix the IN behavior, right? |
|
I think so, in that case we wouldn't need to distinguish between a struct and a mutli-value IN. |
|
@cloud-fan do you think we can go ahead with this or change EQUAL behavior for structs? thanks. |
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
What changes were proposed in this pull request?
Spark's IN operator behaves different from other RDBMS (Oracle, Postgres) when structs containing NULL fields are involved. In this case, Spark returns
false, while other RDBMS returnNULL. This is critical especially when there are NOT IN filters, as Spark doesn't filter rows containing NULLs in that scenario (instead other RDBMS do).The PR proposes to change Spark's IN operator behavior in order to align with the behavior of other RDBMS and introduces a flag which can be used by users to switch back to the previous behavior.
In particular, the proposed change affects only the IN operator when the values are structs (ie. every value is a list of more than one element) and NULL are involved. The proposed behavior can be summarized as follows:
null;trueis returned; otherwise,nullis returnedPlease notice that Hive and Presto, instead, behave as Spark before this change for this specific scenario. But they don't support IN with subqueries returning more than one value. So before this change Spark is anyway the only system where the IN operator behaves incoherently between IN + subquery and IN + literals.
Summarizing:
The PR introduces also a the config flag
spark.sql.legacy.inOperator.falseForNullFieldwhich can be used to turn on/off the new behavior.How was this patch tested?
added UTs