[SPARK-24781][SQL] Using a reference from Dataset in Filter/Sort might not work #21745

viirya · 2018-07-11T09:04:38Z

What changes were proposed in this pull request?

When we use a reference from Dataset in filter or sort, which was not used in the prior select, an AnalysisException occurs, e.g.,

val df = Seq(("test1", 0), ("test2", 1)).toDF("name", "id")
df.select(df("name")).filter(df("id") === 0).show()

org.apache.spark.sql.AnalysisException: Resolved attribute(s) id#6 missing from name#5 in operator !Filter (id#6 = 0).;;
!Filter (id#6 = 0)
   +- AnalysisBarrier
      +- Project [name#5]
         +- Project [_1#2 AS name#5, _2#3 AS id#6]
            +- LocalRelation [_1#2, _2#3]

This change updates the rule ResolveMissingReferences so Filter and Sort with non-empty missingInputs will also be transformed.

How was this patch tested?

Added tests.

…om plan.

viirya · 2018-07-11T09:22:15Z

cc @ueshin @cloud-fan

ueshin · 2018-07-11T09:18:00Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

            val maybeResolvedExprs = exprs.map(resolveExpression(_, p))
            val (newExprs, newChild) = resolveExprsAndAddMissingAttrs(maybeResolvedExprs, p.child)
-            val missingAttrs = AttributeSet(newExprs) -- AttributeSet(maybeResolvedExprs)
+            val missingAttrs = AttributeSet(newExprs) --


We should also fix in Aggregate case?

I might miss something, but how about val missingAttrs = AttributeSet(newExprs) -- p.outputSet?

For Aggregate, I've tested it. Seems ResolveAggregateFunctions already covers it.

Yeah, I think using p.outputSet is simpler. Will update later.

ueshin · 2018-07-11T09:22:13Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/plans/LogicalPlanSuite.scala

+    // but a valid query like `df.select(df("name")).filter(df("id") === 0)` can make a query
+    // like this.
+    val relation = LocalRelation(AttributeReference("a", IntegerType, nullable = true)())
+    val plan = Project(Stream(AttributeReference("b", IntegerType, nullable = true)()), relation)


Why Stream?

No special reason. Just following above test case.

SparkQA · 2018-07-11T11:46:55Z

Test build #92851 has finished for PR 21745 at commit 97837a4.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-07-11T14:41:36Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala

   */
-  lazy val resolved: Boolean = expressions.forall(_.resolved) && childrenResolved
+  lazy val resolved: Boolean = expressions.forall(_.resolved) && childrenResolved &&
+    missingInput.isEmpty


missingInput is special, mostly we can't resolve it. I think that's why we didn't consider it in the resolved at the first place.

We can update the if condition in ResolveMissingReferences to take missingInput into consideration.

Yeah, I found that this change causes one test failure.

gatorsmile · 2018-07-11T15:34:32Z

Which PR caused this regression?

CC @jerryshao We need to block 2.3.2 release before addressing this issue

cloud-fan · 2018-07-12T03:34:16Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

            val maybeResolvedExprs = exprs.map(resolveExpression(_, p))
            val (newExprs, newChild) = resolveExprsAndAddMissingAttrs(maybeResolvedExprs, p.child)
-            val missingAttrs = AttributeSet(newExprs) -- AttributeSet(maybeResolvedExprs)
+            // The resolved attributes might not come from `p.child`. Need to filter it.


how can this happen? if the resolved attributes do not exist in child, then the plan is invalid, isn't it?

At least, this case was resolved in ResolveMissingReferences in spark-v2.2.

viirya · 2018-07-12T03:43:48Z

Sorry replying via email. The previously failed test case has a GROUPING with resolved references. Since it's unresolved itself, the rule will go through underlying Project and newExprs have resolved references coming from parents of this Project.

…

On Thu, Jul 12, 2018, 12:35 PM Wenchen Fan ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala <#21745 (comment)>: > @@ -1163,7 +1165,8 @@ class Analyzer( case p: Project => val maybeResolvedExprs = exprs.map(resolveExpression(_, p)) val (newExprs, newChild) = resolveExprsAndAddMissingAttrs(maybeResolvedExprs, p.child) - val missingAttrs = AttributeSet(newExprs) -- AttributeSet(maybeResolvedExprs) + // The resolved attributes might not come from `p.child`. Need to filter it. how can this happen? if the resolved attributes do not exist in child, then the plan is invalid, isn't it? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#21745 (review)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEM9343BTd9ldDog_ZVBhQ-oWsXltktks5uFsPmgaJpZM4VKwgt> .

maropu · 2018-07-12T03:50:29Z

@gatorsmile It seems the AnalysisBarrier commit causes this error, so v2.2 does not have this issue;


scala> df.select(df("name")).filter(df("id") === 0).explain(true)
== Parsed Logical Plan ==
!Filter (id#26 = 0)
+- Project [name#25]
   +- Project [_1#22 AS name#25, _2#23 AS id#26]
      +- LocalRelation [_1#22, _2#23]

== Analyzed Logical Plan ==
name: string
Project [name#25]
+- Filter (id#26 = 0)
   +- Project [name#25, id#26]
      +- Project [_1#22 AS name#25, _2#23 AS id#26]
         +- LocalRelation [_1#22, _2#23]
...

=== Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveMissingReferences ===
!!Filter (id#26 = 0)                                Project [name#25]
!+- Project [name#25]                               +- Filter (id#26 = 0)
!   +- Project [_1#22 AS name#25, _2#23 AS id#26]      +- Project [name#25, id#26]
!      +- LocalRelation [_1#22, _2#23]                    +- Project [_1#22 AS name#25, _2#23 AS id#26]
!                                                            +- LocalRelation [_1#22, _2#23]

gatorsmile · 2018-07-12T05:14:18Z

We might need to get rid of AnalysisBarrier in the next release. This already caused at least three regressions in 2.3

viirya · 2018-07-12T05:39:14Z

I tried to checkout the commit 82183f7 which is before AnalysisBarrier commit.

scala> val df = Seq(("test1", 0), ("test2", 1)).toDF("name", "id")
18/07/12 05:36:52 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
df: org.apache.spark.sql.DataFrame = [name: string, id: int]

scala> df.select(df("name")).filter(df("id") === 0).show()
org.apache.spark.sql.AnalysisException: Resolved attribute(s) id#6 missing from name#5 in operator !Filter (id#6 = 0).;;
!Filter (id#6 = 0)
+- Project [name#5]
   +- Project [_1#2 AS name#5, _2#3 AS id#6]
      +- LocalRelation [_1#2, _2#3]

  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:41)
  at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:89)
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:291)
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:80)
  at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:80)
  at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:89)
  at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:53)
  at org.apache.spark.sql.Dataset.<init>(Dataset.scala:168)
  at org.apache.spark.sql.Dataset.<init>(Dataset.scala:174)
  at org.apache.spark.sql.Dataset$.apply(Dataset.scala:65)
  at org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:3240)
  at org.apache.spark.sql.Dataset.filter(Dataset.scala:1403)
  ... 49 elided

Looks like it is already failed at that time?

ueshin · 2018-07-12T05:45:25Z

Actually, the very first time we introduced this regression was at 7463a88.
We added !f.resolved && in ResolveMissingReferences rule there, but after that the problem became complicated because we added the AnalysisBarrier and refactored some times.

SparkQA · 2018-07-12T05:53:55Z

Test build #92911 has finished for PR 21745 at commit b99d0c7.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

ueshin · 2018-07-12T07:13:13Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

            val (newExprs, newChild) = resolveExprsAndAddMissingAttrs(maybeResolvedExprs, p.child)
-            val missingAttrs = AttributeSet(newExprs) -- AttributeSet(maybeResolvedExprs)
+            // Only add missing attributes coming from `newChild`.
+            val missingAttrs = (AttributeSet(newExprs) -- p.outputSet).intersect(newChild.outputSet)


This is a second time, but we need to fix in Aggregate case? The logic seems completely different. Or can we remove Aggregate case if ResolveAggregateFunctions can handle this? I don't think we have any reason to keep a wrong logic.

Thanks. I think it's better to have a re-producible test case before changing Aggregate case. I'm trying to create a test case for it. Then it can be more confident to change Aggregate case.

Actually I found another place we need to fix. Seems we don't have enough test coverage for similar features.

The logic gets convoluted here and we need to add comments. Basically we need to explain when we should expand the project list.

SparkQA · 2018-07-12T10:18:02Z

Test build #92917 has finished for PR 21745 at commit eff3af2.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
public class JavaSummarizerExample
trait ComplexTypeMergingExpression extends Expression
case class Size(child: Expression) extends UnaryExpression with ExpectsInputTypes
case class MapConcat(children: Seq[Expression]) extends Expression
case class StreamingGlobalLimitStrategy(outputMode: OutputMode) extends Strategy
case class StreamingGlobalLimitExec(
sealed trait MultipleWatermarkPolicy
case class WatermarkTracker(policy: MultipleWatermarkPolicy) extends Logging
trait MemorySinkBase extends BaseStreamingSink
class MemorySink(val schema: StructType, outputMode: OutputMode) extends Sink
class MemoryWriter(sink: MemorySinkV2, batchId: Long, outputMode: OutputMode)
class MemoryStreamWriter(val sink: MemorySinkV2, outputMode: OutputMode)

SparkQA · 2018-07-12T10:53:46Z

Test build #92918 has finished for PR 21745 at commit 6eda8d2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-07-12T11:56:48Z

Test build #92920 has finished for PR 21745 at commit 8432b00.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-07-12T14:26:36Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

+    /**
+     * This method tries to resolve expressions and find missing attributes recursively. Specially,
+     * when the expressions used in `Sort` or `Filter` contain unresolved attributes or resolved
+     * attributes which are missed from SELECT clause. This method tries to find the missing


which are missed from child output

cloud-fan · 2018-07-12T14:28:56Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

-            val missingAttrs = AttributeSet(newExprs) -- AttributeSet(maybeResolvedExprs)
+            // If some attributes used by expressions are resolvable only on the rewritten child
+            // plan, we need to add them into original projection.
+            val missingAttrs = (AttributeSet(newExprs) -- p.outputSet).intersect(newChild.outputSet)


what if we do not do the .intersect(newChild.outputSet)?

Without this intersect, some tests fail, e.g., group-analytics.sql in SQLQueryTestSuite. Some attributes are resolved on parent plans, not on child plans. We can't add them as missing attributes here.

SparkQA · 2018-07-12T16:57:23Z

Test build #92935 has finished for PR 21745 at commit 860d433.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-07-13T01:23:31Z

LGTM

gatorsmile · 2018-07-13T03:40:02Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala

+    val sort2 = df.select(col("name")).orderBy(col("id"))
+    checkAnswer(sort1, sort2.collect())
+
+    withSQLConf(SQLConf.DATAFRAME_RETAIN_GROUP_COLUMNS.key -> "false") {


This test case should be split to two.

Will update it in next commit.

gatorsmile · 2018-07-13T03:41:20Z

LGTM

SparkQA · 2018-07-13T04:30:42Z

Test build #92958 has finished for PR 21745 at commit a98f416.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-07-13T06:59:37Z

Test build #92961 has finished for PR 21745 at commit 9e00db9.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2018-07-13T07:02:26Z

retest this please.

SparkQA · 2018-07-13T11:01:55Z

Test build #92964 has finished for PR 21745 at commit 9e00db9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-07-13T15:24:35Z

Thanks! Merged to master/2.3

…t not work ## What changes were proposed in this pull request? When we use a reference from Dataset in filter or sort, which was not used in the prior select, an AnalysisException occurs, e.g., ```scala val df = Seq(("test1", 0), ("test2", 1)).toDF("name", "id") df.select(df("name")).filter(df("id") === 0).show() ``` ```scala org.apache.spark.sql.AnalysisException: Resolved attribute(s) id#6 missing from name#5 in operator !Filter (id#6 = 0).;; !Filter (id#6 = 0) +- AnalysisBarrier +- Project [name#5] +- Project [_1#2 AS name#5, _2#3 AS id#6] +- LocalRelation [_1#2, _2#3] ``` This change updates the rule `ResolveMissingReferences` so `Filter` and `Sort` with non-empty `missingInputs` will also be transformed. ## How was this patch tested? Added tests. Author: Liang-Chi Hsieh <[email protected]> Closes #21745 from viirya/SPARK-24781. (cherry picked from commit dfd7ac9) Signed-off-by: Xiao Li <[email protected]>

Resolved references from Dataset should be checked if it is missed fr…

97837a4

…om plan.

ueshin reviewed Jul 11, 2018

View reviewed changes

cloud-fan reviewed Jul 11, 2018

View reviewed changes

Address comments and fix bug.

b99d0c7

cloud-fan reviewed Jul 12, 2018

View reviewed changes

viirya added 3 commits July 12, 2018 06:51

Fix bug.

38a935d

Merge remote-tracking branch 'upstream/master' into SPARK-24781

eff3af2

Remove added test.

6eda8d2

ueshin reviewed Jul 12, 2018

View reviewed changes

Add more tests and deal with aggregate.

8432b00

Add comments.

860d433

cloud-fan reviewed Jul 12, 2018

View reviewed changes

Update comment.

a98f416

gatorsmile reviewed Jul 13, 2018

View reviewed changes

Split original test case to two test cases.

9e00db9

asfgit closed this in dfd7ac9 Jul 13, 2018

viirya deleted the SPARK-24781 branch December 27, 2023 18:21

[SPARK-24781][SQL] Using a reference from Dataset in Filter/Sort might not work #21745

[SPARK-24781][SQL] Using a reference from Dataset in Filter/Sort might not work #21745

Uh oh!

Conversation

viirya commented Jul 11, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

viirya commented Jul 11, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya Jul 11, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 11, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile commented Jul 11, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya commented Jul 12, 2018 via email

Uh oh!

maropu commented Jul 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gatorsmile commented Jul 12, 2018

Uh oh!

viirya commented Jul 12, 2018

Uh oh!

ueshin commented Jul 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Jul 12, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 12, 2018

Uh oh!

SparkQA commented Jul 12, 2018

Uh oh!

SparkQA commented Jul 12, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 12, 2018

Uh oh!

cloud-fan commented Jul 13, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile commented Jul 13, 2018

Uh oh!

SparkQA commented Jul 13, 2018

viirya commented Jul 11, 2018 •

edited

Loading

viirya Jul 11, 2018 •

edited

Loading

maropu commented Jul 12, 2018 •

edited

Loading

ueshin commented Jul 12, 2018 •

edited

Loading