-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-24395][SQL] IN operator should return NULL when comparing struct with NULL fields #22029
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
54ee21a
5974aea
9b28842
655eaa4
062c9fd
3b00217
cf6b3b0
59ff46b
321b9a8
1933e9d
809e80c
389e6de
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -212,27 +212,33 @@ object ReorderAssociativeOperator extends Rule[LogicalPlan] { | |
| * 1. Converts the predicate to false when the list is empty and | ||
| * the value is not nullable. | ||
| * 2. Removes literal repetitions. | ||
| * 3. Replaces [[In (value, seq[Literal])]] with optimized version | ||
| * [[InSet (value, HashSet[Literal])]] which is much faster. | ||
| * 3. Replaces [[In (values, seq[Literal])]] with optimized version | ||
| * [[InSet (values, HashSet[Literal])]] which is much faster. | ||
| */ | ||
| object OptimizeIn extends Rule[LogicalPlan] { | ||
| def apply(plan: LogicalPlan): LogicalPlan = plan transform { | ||
| case q: LogicalPlan => q transformExpressionsDown { | ||
| case In(v, list) if list.isEmpty => | ||
| // When v is not nullable, the following expression will be optimized | ||
| case i @ In(values, list) if list.isEmpty => | ||
| // When values are not nullable, the following expression will be optimized | ||
| // to FalseLiteral which is tested in OptimizeInSuite.scala | ||
| If(IsNotNull(v), FalseLiteral, Literal(null, BooleanType)) | ||
| case expr @ In(v, list) if expr.inSetConvertible => | ||
| val isNotNull = if (SQLConf.get.inFalseForNullField) { | ||
| IsNotNull(i.value) | ||
| } else { | ||
| values.map(IsNotNull).reduce(And) | ||
| } | ||
| If(isNotNull, FalseLiteral, Literal(null, BooleanType)) | ||
| case expr @ In(values, list) if expr.inSetConvertible => | ||
| // if we have more than one element in the values, we have to skip this optimization | ||
| val newList = ExpressionSet(list).toSeq | ||
| if (newList.length == 1 | ||
| // TODO: `EqualTo` for structural types are not working. Until SPARK-24443 is addressed, | ||
| // TODO: we exclude them in this rule. | ||
| && !v.isInstanceOf[CreateNamedStructLike] | ||
| && !expr.value.isInstanceOf[CreateNamedStructLike] | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. According to the implementation,
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes, rigth
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. shall we use
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. not really, because
Basically there are 2 cases: one where we have several attributes in the value before IN; the other when there is a single value before IN but the value is a struct.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. IIUC, you mean Can you give an example? Based on my understanding, the code here is trying to optimize a case when it's not a multi-value in and the list has only one element.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes, I mean that. An example is:
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. for your case, it's not
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. no, because of optimizations, it is a
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. well, I think for this case we should optimize it. Anyway it follows the previous behavior, we can change it later. |
||
| && !newList.head.isInstanceOf[CreateNamedStructLike]) { | ||
| EqualTo(v, newList.head) | ||
| EqualTo(expr.value, newList.head) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ditto
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. no, we do this only when
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. shall we update the match here? I think it should be
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. no, sorry, we can't do that, otherwise we would skip the other possible optimizations here, eg. converting to What should be done, instead, is doing the same change to |
||
| } else if (newList.length > SQLConf.get.optimizerInSetConversionThreshold) { | ||
| val hSet = newList.map(e => e.eval(EmptyRow)) | ||
| InSet(v, HashSet() ++ hSet) | ||
| InSet(values, HashSet() ++ hSet) | ||
| } else if (newList.length < list.length) { | ||
| expr.copy(list = newList) | ||
| } else { // newList.length == list.length && newList.length > 1 | ||
|
|
@@ -527,7 +533,7 @@ object NullPropagation extends Rule[LogicalPlan] { | |
| } | ||
|
|
||
| // If the value expression is NULL then transform the In expression to null literal. | ||
| case In(Literal(null, _), _) => Literal.create(null, BooleanType) | ||
| case In(Seq(Literal(null, _)), _) => Literal.create(null, BooleanType) | ||
| case InSubquery(Seq(Literal(null, _)), _) => Literal.create(null, BooleanType) | ||
|
|
||
| // Non-leaf NullIntolerant expressions will return null, if at least one of its children is | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -1561,6 +1561,16 @@ object SQLConf { | |
| .booleanConf | ||
| .createWithDefault(false) | ||
|
|
||
| val LEGACY_IN_FALSE_FOR_NULL_FIELD = | ||
| buildConf("spark.sql.legacy.inOperator.falseForNullField") | ||
| .internal() | ||
| .doc("When set to true, the IN operator returns false when comparing multiple values " + | ||
| "containing a null. When set to false (default), it returns null, instead. This is " + | ||
| "important especially when using NOT IN as in the second case, it filters out the rows " + | ||
| "when a null is present in a field; while in the first one, those rows are returned.") | ||
| .booleanConf | ||
| .createWithDefault(false) | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we set this
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. it can be done, WDYT @cloud-fan @gatorsmile ?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. kindly ping @cloud-fan @gatorsmile |
||
|
|
||
| val LEGACY_INTEGRALDIVIDE_RETURN_LONG = buildConf("spark.sql.legacy.integralDivide.returnBigint") | ||
| .doc("If it is set to true, the div operator returns always a bigint. This behavior was " + | ||
| "inherited from Hive. Otherwise, the return type is the data type of the operands.") | ||
|
|
@@ -1978,6 +1988,8 @@ class SQLConf extends Serializable with Logging { | |
|
|
||
| def setOpsPrecedenceEnforced: Boolean = getConf(SQLConf.LEGACY_SETOPS_PRECEDENCE_ENABLED) | ||
|
|
||
| def inFalseForNullField: Boolean = getConf(SQLConf.LEGACY_IN_FALSE_FOR_NULL_FIELD) | ||
|
|
||
| def integralDivideReturnLong: Boolean = getConf(SQLConf.LEGACY_INTEGRALDIVIDE_RETURN_LONG) | ||
|
|
||
| /** ********************** SQLConf functionality methods ************ */ | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we change behavior here? seems
null inset (null, xxx)returns true previously.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, because previously we were overriding
nullSafeEvaland thechildofInSet, sonull inset (whatever)was returning null