[SPARK-24395][SQL] IN operator should return NULL when comparing struct with NULL fields #22029

mgaido91 · 2018-08-07T16:04:56Z

What changes were proposed in this pull request?

Spark's IN operator behaves different from other RDBMS (Oracle, Postgres) when structs containing NULL fields are involved. In this case, Spark returns false, while other RDBMS return NULL. This is critical especially when there are NOT IN filters, as Spark doesn't filter rows containing NULLs in that scenario (instead other RDBMS do).

The PR proposes to change Spark's IN operator behavior in order to align with the behavior of other RDBMS and introduces a flag which can be used by users to switch back to the previous behavior.

In particular, the proposed change affects only the IN operator when the values are structs (ie. every value is a list of more than one element) and NULL are involved. The proposed behavior can be summarized as follows:

If the value before IN contains a NULL element, IN evaluates to null;
If the list of values contains a values with a NULL element, then: If the list contains also the value looked for, true is returned; otherwise, null is returned

Please notice that Hive and Presto, instead, behave as Spark before this change for this specific scenario. But they don't support IN with subqueries returning more than one value. So before this change Spark is anyway the only system where the IN operator behaves incoherently between IN + subquery and IN + literals.

Summarizing:

Spark is the only framework which behaves differently with the same data when IN contains literals and subquery;
Hive and Presto behave as Spark before the PR when literals are involved (but Presto throws exception when null is present in both sides...);
Oracle, Postgres behaves like Spark after the PR;
Hive and Presto don't support In with subqueries containing more than 1 output field.

The PR introduces also a the config flag spark.sql.legacy.inOperator.falseForNullField which can be used to turn on/off the new behavior.

How was this patch tested?

added UTs

…ct with NULL fields

mgaido91 · 2018-08-07T16:05:32Z

cc @cloud-fan @hvanhovell @juliuszsompolski

SparkQA · 2018-08-07T18:20:19Z

Test build #94382 has finished for PR 22029 at commit 54ee21a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-08-08T03:31:55Z

Is there a clear definition for the expected behavior? I tried postgre before, it returns null for things like (x, y) = (a, null), but throws analysis error for things like (x, (y, z)) = (a, (null, b))

mgaido91 · 2018-08-08T10:29:09Z

@cloud-fan I agree with you that struct comparison behavior is questionable. But the IN (and NOT IN) behavior is coherent among Postgres, Oracle (and different from the current Spark behavior). This is the reason why I decided to propose to change only the behavior of the IN operator without changing the behavior of the struct comparison itself, which would be a much bigger change.

Anyway, answering to your question, since this PR relates only to IN/NOT IN, yes, there is a clear definition for the expected behavior in the other RDBMS.

PS please also notice that Spark currently has a different behavior for select 1 from (select 1) a where (1, 2) not in ((1, null)) and select 1 from (select 1) a where (1, 2) not in (select (1, null)); while all other RDBMS behave in the same way in the two cases.

cloud-fan · 2018-08-08T10:53:13Z

yes, there is a clear definition for the expected behavior in the other RDBMS.

Can you put the general definition of the IN behavior (without subquery) in the PR description? The PR titile only specified the behavior for a specific case, not a general definition. e.g. nested fields are not mentioned.

mgaido91 · 2018-08-08T11:03:08Z

Can you put the general definition of the IN behavior (without subquery) in the PR description?

Sure, I am changing it. Let me know if the description I am writing is clear/complete enough. Thanks.

nested fields are not mentioned.

They are not mentioned as in other RDBMS are not really possible. Neither Postgres nor Oracle allow that. So I haven't changed their handling as - I think - the expected behavior in that case is questionable. I think in general, IN using literal and IN using subqueries should return the same result if they are provided with the same data (as it happens in the RDBMS). So that case is handled as before the PR since the comparison of structs has not been changed here.

mgaido91 · 2018-08-08T11:30:03Z

@cloud-fan I updated the description. Let me know if now it is clear enough. Thanks.

cloud-fan · 2018-08-08T14:43:48Z

can we also update the classdoc of In?

SparkQA · 2018-08-08T14:49:09Z

Test build #94418 has finished for PR 22029 at commit 5974aea.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-08-08T14:53:44Z

BTW it will be great if this behavior is defined by the SQL standard. And we should also check the big data SQL systems like Hive and Presto. In general we should be careful for uncertain behaviors like this. That said, I feel returing false is not that wrong to me without considering other systems.

also cc @gatorsmile @dongjoon-hyun

mgaido91 · 2018-08-08T16:43:24Z

Presto and Hive work in different ways.

Hive

Hive behaves as Spark before the PR, but Hive doesn't support IN with subqueries where the subquery outputs more than one field. So in Hive there is not the incoherence which is present in Spark (ie. with the same data IN with literals returns a different result than IN with subquery).

Presto

Presto, instead, behaves as Spark before the PR, but when there is a null both in the value and in the list of literal, it throws an exception and it behaves in the same way also when IN is used with subqueries.

Summarizing:

Spark is the only framework which behaves differently with the same data when IN contains literals and subquery;
Hive and Presto behave as Spark before the PR when literals are involved (but Presto throws exception when null is present in both sides...);
Oracle, Postgres behaves like Spark after the PR;
Hive doesn't support In with subqueries containing more than 1 output field and Presto behaves differently from Spark, Oracle, Postgres when IN is used with subqueries (but it is coherent with its behavior with literals).

dongjoon-hyun · 2018-08-08T19:56:13Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala

+  override def nullable: Boolean = value.dataType match {
+    case _: StructType if !SQLConf.get.inFalseForNullField =>
+      children.exists(_.nullable) ||
+        children.exists(_.dataType.asInstanceOf[StructType].exists(_.nullable))


@mgaido91 . Ur, this asInstanceOf[StructType] looks unsafe for non-nullable primitive type. Could you add an additional test case for the nested StructType to provide the coverage for this line?

why do you think it is unsafe? We are doing it only if the dataType of value match StructType and according also to what is enforced in the checkInputDataTypes, therefore all the children must be StructTypes.

I will add anyway a test case for the scenario you described, thanks.

dongjoon-hyun · 2018-08-08T20:07:59Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+        "important especially when using NOT IN as in the second case, it filters out the rows " +
+        "when a null is present in a filed; while in the first one, those rows are returned.")
+      .booleanConf
+      .createWithDefault(false)


Can we set this true by default in Spark 2.4 at least?

it can be done, WDYT @cloud-fan @gatorsmile ?

kindly ping @cloud-fan @gatorsmile

dongjoon-hyun · 2018-08-08T20:37:15Z

Thank you for pinging me, @cloud-fan . For me, new way suggested in this PR looks reasonable. Also, this PR is configurable. We may be able to start to provide this as a non-default way in Spark 2.4.

cloud-fan · 2018-08-09T10:29:59Z

@mgaido91 do you mean the current behavior is same with Hive and Presto?

mgaido91 · 2018-08-09T10:48:17Z

@cloud-fan if we consider only the expression IN with literals, yes, the behavior is very similar, with the following difference: Presto throws exception when null is present on both sides.

But what I'd really like to highlight is that there is a huge difference: both Hive and Presto (as well as the RDBMs are coherent in their behavior between IN with literals and IN with subquery). Actually, Hive doesn't support subqueries with more than one output field; Presto instead behaves in the same way in the case with literals and subqueries, which means it behaves like Spark before this PR with literals, but different from Spark when subqueries are involved.

SparkQA · 2018-08-09T12:03:18Z

Test build #94480 has finished for PR 22029 at commit 9b28842.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-08-09T12:16:59Z

Test build #94482 has finished for PR 22029 at commit 655eaa4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mgaido91 · 2018-08-16T13:05:51Z

@cloud-fan @gatorsmile any thoughts on this?

mgaido91 · 2018-08-27T13:56:39Z

kindly ping @cloud-fan @gatorsmile

mgaido91 · 2018-09-13T07:58:53Z

kindly ping @cloud-fan @gatorsmile

SparkQA · 2018-09-13T12:15:27Z

Test build #96035 has finished for PR 22029 at commit 062c9fd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mgaido91 · 2018-10-08T10:43:27Z

@cloud-fan @dongjoon-hyun @gatorsmile anymore comments on this?

cloud-fan · 2018-10-26T02:58:57Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala

        // to FalseLiteral which is tested in OptimizeInSuite.scala
-        If(IsNotNull(v), FalseLiteral, Literal(null, BooleanType))
-      case expr @ In(v, list) if expr.inSetConvertible =>
+        If(IsNotNull(i.value), FalseLiteral, Literal(null, BooleanType))


this needs to look at inFalseForNullField right?

rigth, I fixed it and added a test, thanks.

cloud-fan · 2018-10-26T02:59:22Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala

+          && !expr.value.isInstanceOf[CreateNamedStructLike]
          && !newList.head.isInstanceOf[CreateNamedStructLike]) {
-          EqualTo(v, newList.head)
+          EqualTo(expr.value, newList.head)


no, we do this only when value is not a CreateNamedStructLike, so we don't go here if there are multi-values

shall we update the match here? I think it should be In(Seq(vaue) ...) now

no, sorry, we can't do that, otherwise we would skip the other possible optimizations here, eg. converting to InSet, reducing the list of values, etc.etc.

What should be done, instead, is doing the same change to InSet, so that the way nulls are handled is coherent.

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

cloud-fan · 2018-10-26T10:40:28Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala

          // TODO: `EqualTo` for structural types are not working. Until SPARK-24443 is addressed,
          // TODO: we exclude them in this rule.
-          && !v.isInstanceOf[CreateNamedStructLike]
+          && !expr.value.isInstanceOf[CreateNamedStructLike]


@transient protected lazy val isMultiValued = values.length > 1 @transient lazy val value: Expression = if (isMultiValued) { CreateNamedStruct(values.zipWithIndex.flatMap { case (v: NamedExpression, _) => Seq(Literal(v.name), v) case (v, idx) => Seq(Literal(s"_$idx"), v) }) } else { values.head } }

According to the implementation, expr.value.isInstanceOf[CreateNamedStructLike] means expr.values.length > 1, right?

shall we use expr.values.length == 1 here to make it more clear?

not really, because expr.value.isInstanceOf[CreateNamedStructLike] means:

either expr.values.length == 1;

or expr.values.head.isInstanceOf[CreateNamedStructLike];

Basically there are 2 cases: one where we have several attributes in the value before IN; the other when there is a single value before IN but the value is a struct. expr.value.isInstanceOf[CreateNamedStructLike] catches both. I can add a comment explaining these 2 cases if you think is needed.

IIUC, you mean
expr.values.length > 1 => expr.value.isInstanceOf[CreateNamedStructLike]
but expr.value.isInstanceOf[CreateNamedStructLike] can't => expr.values.length > 1

Can you give an example?

Based on my understanding, the code here is trying to optimize a case when it's not a multi-value in and the list has only one element.

yes, I mean that. An example is:

select 1 from (select struct('a', 1, 'b', '2') as a1) t1 where a1 in ((...), ...);

for your case, it's not CreateNamedStructLike, but just a struct type column?

no, because of optimizations, it is a CreateNamedStructLike

well, I think for this case we should optimize it.

Anyway it follows the previous behavior, we can change it later.

SparkQA · 2018-10-26T12:52:49Z

Test build #98084 has finished for PR 22029 at commit 1933e9d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
trait InBase extends Predicate
case class InSet(values: Seq[Expression], hset: Set[Any]) extends InBase

SparkQA · 2018-10-26T17:25:05Z

Test build #98089 has finished for PR 22029 at commit 809e80c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mgaido91 · 2018-10-31T10:42:28Z

anymore comments on this @cloud-fan ? Thanks.

cloud-fan · 2018-10-31T13:29:29Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala

-  usage = "expr1 _FUNC_(expr2, expr3, ...) - Returns true if `expr` equals to any valN.",
+  usage = """
+    expr1 _FUNC_(expr2, expr3, ...) - Returns true if `expr` equals to any valN. Otherwise, if
+      spark.sql.legacy.inOperator.falseForNullField is false and any of the elements or fields of


any of the elements or fields ...

We should explicitly mention multi-column IN, which is different from a in (b, c, ...) while a is struct type.

cloud-fan · 2018-10-31T13:41:54Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+        "important especially when using NOT IN as in the second case, it filters out the rows " +
+        "when a null is present in a field; while in the first one, those rows are returned.")
+      .booleanConf
+      .createWithDefault(true)


shall we set false as default to follow SQL standard? and be consistent with in-subquery

yes, I agree, let me switch it, thanks.

cloud-fan · 2018-10-31T13:48:24Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala

+
+  override def eval(input: InternalRow): Any = {
+    val inputValue = value.eval(input)
+    if (checkNullEval(inputValue)) {


do we change behavior here? seems null inset (null, xxx) returns true previously.

no, because previously we were overriding nullSafeEval and the child of InSet, so null inset (whatever) was returning null

cloud-fan · 2018-10-31T13:50:31Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala

+          IsNotNull(i.value)
+        } else {
+          val valuesNotNull: Seq[Expression] = values.map(IsNotNull)
+          valuesNotNull.tail.foldLeft(valuesNotNull.head)(And)


nit: values.map(IsNotNull).reduce(And)

SparkQA · 2018-10-31T18:36:07Z

Test build #98318 has finished for PR 22029 at commit 389e6de.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-11-01T07:01:56Z

Do you know how Presto supports multi-value in subquery? By reading the PR description, it seems impossible if Preso treats (a, b) as a struct value. How Preso distinguish (a, b) IN (select x,y ...) and struct_col IN (select x,y ...)?

mgaido91 · 2018-11-01T08:14:40Z

@cloud-fan Presto doesn't have structs. Presto has ROW. (a, b) IN (select (x, y) ...) is a valid subquery in Presto, while for a struct (ie. a row) the valid syntax is row_col IN (select ROW(x,y) ...). ~~So (a, b) is never considered a struct value in Presto.~~ Did this answer?

cloud-fan · 2018-11-01T09:01:26Z

which Presto version did you test? I tried 0.203 and it fails

presto:default> select * from t2 where (1, 2) in (select x, y from t);
Query 20181101_085707_00012_n644a failed: line 1:31: Multiple columns returned by subquery are not yet supported. Found 2

mgaido91 · 2018-11-01T09:16:16Z

@cloud-fan sorry, I mistyped. I meant (a, b) IN (select (x, y) ...) is a valid subquery, but probably my understanding that it was a multivalued subquery is wrong, as I realize now that (x, y) returns Presto's equivalent of a struct. So I have to update the PR description that Presto behaves as Hive, ie. it doesn't support multivalue subqueries. Sorry for the confusion, I am updating the PR description.

cloud-fan · 2018-11-01T10:24:19Z

Another point: I think it's also important to make the behavior of IN be consistent with EQUAL. I tried PostgreSQL and (1, 2) = (3, null) returns null.

Shall we update EQUAL first? The behavior of IN will be updated accordingly after we update EQUAL.

mgaido91 · 2018-11-01T11:31:41Z

@cloud-fan yes, but the behavior of EQUAL is not so consistent among the different DBs. In Hive EQUAL on struct behaves like Spark as of now, in Presto throws exception if there is a null, Postgres is different and it is as you mentioned. So I am not sure we have strong reasons to update how EQUAL works for struct. That doesn't seem to be handled consistently in other DBs. That's why I proposed this change, in order to handle IN consistently without modifying EQUAL for structs.

cloud-fan · 2018-11-01T12:42:39Z

If we want to follow PostgreSQL/Oracle for the IN behavior, why don't we follow the EQUAL behavior as well?

mgaido91 · 2018-11-01T13:45:11Z

Oracle doesn't support EQUAL between structs: (1, 'a') = (1, 'b') doesn't work on Oracle. Postgres is the only one returning NULL in the case (1, 'a') = (1, null).
My main reason here is to make IN working in the same way (assuming that the same data is involved) both when literals and when subqueries are involved: this is my goal. One of the main reasons is that often users who want to debug queries try to put the expected output in the subquery in the IN expression so having a different behavior may be deceived and anyway a user would be at least surprised by this behavior.

cloud-fan · 2018-11-01T16:12:27Z

If we decide to follow PostgreSQL about the EQUAL behavior eventually, then it will be much easier to fix the IN behavior, right?

mgaido91 · 2018-11-01T18:02:37Z

I think so, in that case we wouldn't need to distinguish between a struct and a mutli-value IN.

mgaido91 · 2018-11-10T11:56:31Z

@cloud-fan do you think we can go ahead with this or change EQUAL behavior for structs? thanks.

github-actions · 2020-01-08T00:07:33Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

[SPARK-24395][SQL] IN operator should return NULL when comparing stru…

54ee21a

…ct with NULL fields

fix ut

5974aea

dongjoon-hyun reviewed Aug 8, 2018

View reviewed changes

mgaido91 added 2 commits August 9, 2018 09:59

add tests for nested struct

9b28842

update IN doc

655eaa4

Merge branch 'master' into SPARK-24395

062c9fd

mgaido91 added 2 commits October 8, 2018 12:40

Merge branch 'master' into SPARK-24395

3b00217

change default according to comment

cf6b3b0

cloud-fan reviewed Oct 26, 2018

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Oct 26, 2018

View reviewed changes

migrate also InSet to new management

1933e9d

fix

809e80c

cloud-fan reviewed Oct 31, 2018

View reviewed changes

address comments

389e6de

dongjoon-hyun added the SQL label Jun 14, 2019

github-actions bot added the Stale label Jan 8, 2020

github-actions bot closed this Jan 9, 2020

[SPARK-24395][SQL] IN operator should return NULL when comparing struct with NULL fields #22029

[SPARK-24395][SQL] IN operator should return NULL when comparing struct with NULL fields #22029

Uh oh!

Conversation

mgaido91 commented Aug 7, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

mgaido91 commented Aug 7, 2018

Uh oh!

SparkQA commented Aug 7, 2018

Uh oh!

cloud-fan commented Aug 8, 2018

Uh oh!

mgaido91 commented Aug 8, 2018

Uh oh!

cloud-fan commented Aug 8, 2018

Uh oh!

mgaido91 commented Aug 8, 2018

Uh oh!

mgaido91 commented Aug 8, 2018

Uh oh!

cloud-fan commented Aug 8, 2018

Uh oh!

SparkQA commented Aug 8, 2018

Uh oh!

cloud-fan commented Aug 8, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mgaido91 commented Aug 8, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Aug 8, 2018

Uh oh!

cloud-fan commented Aug 9, 2018

Uh oh!

mgaido91 commented Aug 9, 2018

Uh oh!

SparkQA commented Aug 9, 2018

Uh oh!

SparkQA commented Aug 9, 2018

Uh oh!

mgaido91 commented Aug 16, 2018

Uh oh!

mgaido91 commented Aug 27, 2018

Uh oh!

mgaido91 commented Sep 13, 2018

Uh oh!

SparkQA commented Sep 13, 2018

Uh oh!

mgaido91 commented Oct 8, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mgaido91 Oct 26, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mgaido91 commented Aug 7, 2018 •

edited

Loading

cloud-fan commented Aug 8, 2018 •

edited

Loading

mgaido91 commented Aug 8, 2018 •

edited

Loading

mgaido91 Oct 26, 2018 •

edited

Loading

mgaido91 Oct 31, 2018 •

edited

Loading