Skip to content

Conversation

@mgaido91
Copy link
Contributor

@mgaido91 mgaido91 commented Aug 7, 2018

What changes were proposed in this pull request?

Spark's IN operator behaves different from other RDBMS (Oracle, Postgres) when structs containing NULL fields are involved. In this case, Spark returns false, while other RDBMS return NULL. This is critical especially when there are NOT IN filters, as Spark doesn't filter rows containing NULLs in that scenario (instead other RDBMS do).

The PR proposes to change Spark's IN operator behavior in order to align with the behavior of other RDBMS and introduces a flag which can be used by users to switch back to the previous behavior.

In particular, the proposed change affects only the IN operator when the values are structs (ie. every value is a list of more than one element) and NULL are involved. The proposed behavior can be summarized as follows:

  • If the value before IN contains a NULL element, IN evaluates to null;
  • If the list of values contains a values with a NULL element, then: If the list contains also the value looked for, true is returned; otherwise, null is returned

Please notice that Hive and Presto, instead, behave as Spark before this change for this specific scenario. But they don't support IN with subqueries returning more than one value. So before this change Spark is anyway the only system where the IN operator behaves incoherently between IN + subquery and IN + literals.

Summarizing:

  • Spark is the only framework which behaves differently with the same data when IN contains literals and subquery;
  • Hive and Presto behave as Spark before the PR when literals are involved (but Presto throws exception when null is present in both sides...);
  • Oracle, Postgres behaves like Spark after the PR;
  • Hive and Presto don't support In with subqueries containing more than 1 output field.

The PR introduces also a the config flag spark.sql.legacy.inOperator.falseForNullField which can be used to turn on/off the new behavior.

How was this patch tested?

added UTs

@mgaido91
Copy link
Contributor Author

mgaido91 commented Aug 7, 2018

@SparkQA
Copy link

SparkQA commented Aug 7, 2018

Test build #94382 has finished for PR 22029 at commit 54ee21a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

Is there a clear definition for the expected behavior? I tried postgre before, it returns null for things like (x, y) = (a, null), but throws analysis error for things like (x, (y, z)) = (a, (null, b))

@mgaido91
Copy link
Contributor Author

mgaido91 commented Aug 8, 2018

@cloud-fan I agree with you that struct comparison behavior is questionable. But the IN (and NOT IN) behavior is coherent among Postgres, Oracle (and different from the current Spark behavior). This is the reason why I decided to propose to change only the behavior of the IN operator without changing the behavior of the struct comparison itself, which would be a much bigger change.

Anyway, answering to your question, since this PR relates only to IN/NOT IN, yes, there is a clear definition for the expected behavior in the other RDBMS.

PS please also notice that Spark currently has a different behavior for select 1 from (select 1) a where (1, 2) not in ((1, null)) and select 1 from (select 1) a where (1, 2) not in (select (1, null)); while all other RDBMS behave in the same way in the two cases.

@cloud-fan
Copy link
Contributor

yes, there is a clear definition for the expected behavior in the other RDBMS.

Can you put the general definition of the IN behavior (without subquery) in the PR description? The PR titile only specified the behavior for a specific case, not a general definition. e.g. nested fields are not mentioned.

@mgaido91
Copy link
Contributor Author

mgaido91 commented Aug 8, 2018

Can you put the general definition of the IN behavior (without subquery) in the PR description?

Sure, I am changing it. Let me know if the description I am writing is clear/complete enough. Thanks.

nested fields are not mentioned.

They are not mentioned as in other RDBMS are not really possible. Neither Postgres nor Oracle allow that. So I haven't changed their handling as - I think - the expected behavior in that case is questionable. I think in general, IN using literal and IN using subqueries should return the same result if they are provided with the same data (as it happens in the RDBMS). So that case is handled as before the PR since the comparison of structs has not been changed here.

@mgaido91
Copy link
Contributor Author

mgaido91 commented Aug 8, 2018

@cloud-fan I updated the description. Let me know if now it is clear enough. Thanks.

@cloud-fan
Copy link
Contributor

can we also update the classdoc of In?

@SparkQA
Copy link

SparkQA commented Aug 8, 2018

Test build #94418 has finished for PR 22029 at commit 5974aea.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

cloud-fan commented Aug 8, 2018

BTW it will be great if this behavior is defined by the SQL standard. And we should also check the big data SQL systems like Hive and Presto. In general we should be careful for uncertain behaviors like this. That said, I feel returing false is not that wrong to me without considering other systems.

also cc @gatorsmile @dongjoon-hyun

@mgaido91
Copy link
Contributor Author

mgaido91 commented Aug 8, 2018

Presto and Hive work in different ways.

Hive

Hive behaves as Spark before the PR, but Hive doesn't support IN with subqueries where the subquery outputs more than one field. So in Hive there is not the incoherence which is present in Spark (ie. with the same data IN with literals returns a different result than IN with subquery).

Presto

Presto, instead, behaves as Spark before the PR, but when there is a null both in the value and in the list of literal, it throws an exception and it behaves in the same way also when IN is used with subqueries.

Summarizing:

  • Spark is the only framework which behaves differently with the same data when IN contains literals and subquery;
  • Hive and Presto behave as Spark before the PR when literals are involved (but Presto throws exception when null is present in both sides...);
  • Oracle, Postgres behaves like Spark after the PR;
  • Hive doesn't support In with subqueries containing more than 1 output field and Presto behaves differently from Spark, Oracle, Postgres when IN is used with subqueries (but it is coherent with its behavior with literals).

override def nullable: Boolean = value.dataType match {
case _: StructType if !SQLConf.get.inFalseForNullField =>
children.exists(_.nullable) ||
children.exists(_.dataType.asInstanceOf[StructType].exists(_.nullable))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mgaido91 . Ur, this asInstanceOf[StructType] looks unsafe for non-nullable primitive type. Could you add an additional test case for the nested StructType to provide the coverage for this line?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you think it is unsafe? We are doing it only if the dataType of value match StructType and according also to what is enforced in the checkInputDataTypes, therefore all the children must be StructTypes.

I will add anyway a test case for the scenario you described, thanks.

"important especially when using NOT IN as in the second case, it filters out the rows " +
"when a null is present in a filed; while in the first one, those rows are returned.")
.booleanConf
.createWithDefault(false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we set this true by default in Spark 2.4 at least?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it can be done, WDYT @cloud-fan @gatorsmile ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kindly ping @cloud-fan @gatorsmile

@dongjoon-hyun
Copy link
Member

Thank you for pinging me, @cloud-fan . For me, new way suggested in this PR looks reasonable. Also, this PR is configurable. We may be able to start to provide this as a non-default way in Spark 2.4.

@cloud-fan
Copy link
Contributor

@mgaido91 do you mean the current behavior is same with Hive and Presto?

@mgaido91
Copy link
Contributor Author

mgaido91 commented Aug 9, 2018

@cloud-fan if we consider only the expression IN with literals, yes, the behavior is very similar, with the following difference: Presto throws exception when null is present on both sides.

But what I'd really like to highlight is that there is a huge difference: both Hive and Presto (as well as the RDBMs are coherent in their behavior between IN with literals and IN with subquery). Actually, Hive doesn't support subqueries with more than one output field; Presto instead behaves in the same way in the case with literals and subqueries, which means it behaves like Spark before this PR with literals, but different from Spark when subqueries are involved.

@SparkQA
Copy link

SparkQA commented Aug 9, 2018

Test build #94480 has finished for PR 22029 at commit 9b28842.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 9, 2018

Test build #94482 has finished for PR 22029 at commit 655eaa4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mgaido91
Copy link
Contributor Author

@cloud-fan @gatorsmile any thoughts on this?

@mgaido91
Copy link
Contributor Author

kindly ping @cloud-fan @gatorsmile

@mgaido91
Copy link
Contributor Author

kindly ping @cloud-fan @gatorsmile

@SparkQA
Copy link

SparkQA commented Sep 13, 2018

Test build #96035 has finished for PR 22029 at commit 062c9fd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mgaido91
Copy link
Contributor Author

mgaido91 commented Oct 8, 2018

@cloud-fan @dongjoon-hyun @gatorsmile anymore comments on this?

// to FalseLiteral which is tested in OptimizeInSuite.scala
If(IsNotNull(v), FalseLiteral, Literal(null, BooleanType))
case expr @ In(v, list) if expr.inSetConvertible =>
If(IsNotNull(i.value), FalseLiteral, Literal(null, BooleanType))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs to look at inFalseForNullField right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rigth, I fixed it and added a test, thanks.

&& !expr.value.isInstanceOf[CreateNamedStructLike]
&& !newList.head.isInstanceOf[CreateNamedStructLike]) {
EqualTo(v, newList.head)
EqualTo(expr.value, newList.head)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, we do this only when value is not a CreateNamedStructLike, so we don't go here if there are multi-values

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we update the match here? I think it should be In(Seq(vaue) ...) now

Copy link
Contributor Author

@mgaido91 mgaido91 Oct 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, sorry, we can't do that, otherwise we would skip the other possible optimizations here, eg. converting to InSet, reducing the list of values, etc.etc.

What should be done, instead, is doing the same change to InSet, so that the way nulls are handled is coherent.

// TODO: `EqualTo` for structural types are not working. Until SPARK-24443 is addressed,
// TODO: we exclude them in this rule.
&& !v.isInstanceOf[CreateNamedStructLike]
&& !expr.value.isInstanceOf[CreateNamedStructLike]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  @transient protected lazy val isMultiValued = values.length > 1
  @transient lazy val value: Expression = if (isMultiValued) {
    CreateNamedStruct(values.zipWithIndex.flatMap {
      case (v: NamedExpression, _) => Seq(Literal(v.name), v)
      case (v, idx) => Seq(Literal(s"_$idx"), v)
    })
  } else {
    values.head
  }
}

According to the implementation, expr.value.isInstanceOf[CreateNamedStructLike] means expr.values.length > 1, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, rigth

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we use expr.values.length == 1 here to make it more clear?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not really, because expr.value.isInstanceOf[CreateNamedStructLike] means:

  • either expr.values.length == 1;
  • or expr.values.head.isInstanceOf[CreateNamedStructLike];

Basically there are 2 cases: one where we have several attributes in the value before IN; the other when there is a single value before IN but the value is a struct. expr.value.isInstanceOf[CreateNamedStructLike] catches both. I can add a comment explaining these 2 cases if you think is needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, you mean
expr.values.length > 1 => expr.value.isInstanceOf[CreateNamedStructLike]
but expr.value.isInstanceOf[CreateNamedStructLike] can't => expr.values.length > 1

Can you give an example?

Based on my understanding, the code here is trying to optimize a case when it's not a multi-value in and the list has only one element.

Copy link
Contributor Author

@mgaido91 mgaido91 Oct 31, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I mean that. An example is:

select 1 from (select struct('a', 1, 'b', '2') as a1) t1 where a1 in ((...), ...);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for your case, it's not CreateNamedStructLike, but just a struct type column?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, because of optimizations, it is a CreateNamedStructLike

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, I think for this case we should optimize it.

Anyway it follows the previous behavior, we can change it later.

@SparkQA
Copy link

SparkQA commented Oct 26, 2018

Test build #98084 has finished for PR 22029 at commit 1933e9d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • trait InBase extends Predicate
  • case class InSet(values: Seq[Expression], hset: Set[Any]) extends InBase

@SparkQA
Copy link

SparkQA commented Oct 26, 2018

Test build #98089 has finished for PR 22029 at commit 809e80c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mgaido91
Copy link
Contributor Author

anymore comments on this @cloud-fan ? Thanks.

usage = "expr1 _FUNC_(expr2, expr3, ...) - Returns true if `expr` equals to any valN.",
usage = """
expr1 _FUNC_(expr2, expr3, ...) - Returns true if `expr` equals to any valN. Otherwise, if
spark.sql.legacy.inOperator.falseForNullField is false and any of the elements or fields of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any of the elements or fields ...

We should explicitly mention multi-column IN, which is different from a in (b, c, ...) while a is struct type.

"important especially when using NOT IN as in the second case, it filters out the rows " +
"when a null is present in a field; while in the first one, those rows are returned.")
.booleanConf
.createWithDefault(true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we set false as default to follow SQL standard? and be consistent with in-subquery

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I agree, let me switch it, thanks.


override def eval(input: InternalRow): Any = {
val inputValue = value.eval(input)
if (checkNullEval(inputValue)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we change behavior here? seems null inset (null, xxx) returns true previously.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, because previously we were overriding nullSafeEval and the child of InSet, so null inset (whatever) was returning null

IsNotNull(i.value)
} else {
val valuesNotNull: Seq[Expression] = values.map(IsNotNull)
valuesNotNull.tail.foldLeft(valuesNotNull.head)(And)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: values.map(IsNotNull).reduce(And)

@SparkQA
Copy link

SparkQA commented Oct 31, 2018

Test build #98318 has finished for PR 22029 at commit 389e6de.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

Do you know how Presto supports multi-value in subquery? By reading the PR description, it seems impossible if Preso treats (a, b) as a struct value. How Preso distinguish (a, b) IN (select x,y ...) and struct_col IN (select x,y ...)?

@mgaido91
Copy link
Contributor Author

mgaido91 commented Nov 1, 2018

@cloud-fan Presto doesn't have structs. Presto has ROW. (a, b) IN (select (x, y) ...) is a valid subquery in Presto, while for a struct (ie. a row) the valid syntax is row_col IN (select ROW(x,y) ...). So (a, b) is never considered a struct value in Presto. Did this answer?

@cloud-fan
Copy link
Contributor

which Presto version did you test? I tried 0.203 and it fails

presto:default> select * from t2 where (1, 2) in (select x, y from t);
Query 20181101_085707_00012_n644a failed: line 1:31: Multiple columns returned by subquery are not yet supported. Found 2

@mgaido91
Copy link
Contributor Author

mgaido91 commented Nov 1, 2018

@cloud-fan sorry, I mistyped. I meant (a, b) IN (select (x, y) ...) is a valid subquery, but probably my understanding that it was a multivalued subquery is wrong, as I realize now that (x, y) returns Presto's equivalent of a struct. So I have to update the PR description that Presto behaves as Hive, ie. it doesn't support multivalue subqueries. Sorry for the confusion, I am updating the PR description.

@cloud-fan
Copy link
Contributor

Another point: I think it's also important to make the behavior of IN be consistent with EQUAL. I tried PostgreSQL and (1, 2) = (3, null) returns null.

Shall we update EQUAL first? The behavior of IN will be updated accordingly after we update EQUAL.

@mgaido91
Copy link
Contributor Author

mgaido91 commented Nov 1, 2018

@cloud-fan yes, but the behavior of EQUAL is not so consistent among the different DBs. In Hive EQUAL on struct behaves like Spark as of now, in Presto throws exception if there is a null, Postgres is different and it is as you mentioned. So I am not sure we have strong reasons to update how EQUAL works for struct. That doesn't seem to be handled consistently in other DBs. That's why I proposed this change, in order to handle IN consistently without modifying EQUAL for structs.

@cloud-fan
Copy link
Contributor

If we want to follow PostgreSQL/Oracle for the IN behavior, why don't we follow the EQUAL behavior as well?

@mgaido91
Copy link
Contributor Author

mgaido91 commented Nov 1, 2018

Oracle doesn't support EQUAL between structs: (1, 'a') = (1, 'b') doesn't work on Oracle. Postgres is the only one returning NULL in the case (1, 'a') = (1, null).
My main reason here is to make IN working in the same way (assuming that the same data is involved) both when literals and when subqueries are involved: this is my goal. One of the main reasons is that often users who want to debug queries try to put the expected output in the subquery in the IN expression so having a different behavior may be deceived and anyway a user would be at least surprised by this behavior.

@cloud-fan
Copy link
Contributor

If we decide to follow PostgreSQL about the EQUAL behavior eventually, then it will be much easier to fix the IN behavior, right?

@mgaido91
Copy link
Contributor Author

mgaido91 commented Nov 1, 2018

I think so, in that case we wouldn't need to distinguish between a struct and a mutli-value IN.

@mgaido91
Copy link
Contributor Author

@cloud-fan do you think we can go ahead with this or change EQUAL behavior for structs? thanks.

@github-actions
Copy link

github-actions bot commented Jan 8, 2020

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Jan 8, 2020
@github-actions github-actions bot closed this Jan 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants