-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-24781][SQL] Using a reference from Dataset in Filter/Sort might not work #21745
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| val maybeResolvedExprs = exprs.map(resolveExpression(_, p)) | ||
| val (newExprs, newChild) = resolveExprsAndAddMissingAttrs(maybeResolvedExprs, p.child) | ||
| val missingAttrs = AttributeSet(newExprs) -- AttributeSet(maybeResolvedExprs) | ||
| val missingAttrs = AttributeSet(newExprs) -- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also fix in Aggregate case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might miss something, but how about val missingAttrs = AttributeSet(newExprs) -- p.outputSet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For Aggregate, I've tested it. Seems ResolveAggregateFunctions already covers it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think using p.outputSet is simpler. Will update later.
| // but a valid query like `df.select(df("name")).filter(df("id") === 0)` can make a query | ||
| // like this. | ||
| val relation = LocalRelation(AttributeReference("a", IntegerType, nullable = true)()) | ||
| val plan = Project(Stream(AttributeReference("b", IntegerType, nullable = true)()), relation) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why Stream?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No special reason. Just following above test case.
|
Test build #92851 has finished for PR 21745 at commit
|
| */ | ||
| lazy val resolved: Boolean = expressions.forall(_.resolved) && childrenResolved | ||
| lazy val resolved: Boolean = expressions.forall(_.resolved) && childrenResolved && | ||
| missingInput.isEmpty |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missingInput is special, mostly we can't resolve it. I think that's why we didn't consider it in the resolved at the first place.
We can update the if condition in ResolveMissingReferences to take missingInput into consideration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I found that this change causes one test failure.
|
Which PR caused this regression? CC @jerryshao We need to block 2.3.2 release before addressing this issue |
| val maybeResolvedExprs = exprs.map(resolveExpression(_, p)) | ||
| val (newExprs, newChild) = resolveExprsAndAddMissingAttrs(maybeResolvedExprs, p.child) | ||
| val missingAttrs = AttributeSet(newExprs) -- AttributeSet(maybeResolvedExprs) | ||
| // The resolved attributes might not come from `p.child`. Need to filter it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how can this happen? if the resolved attributes do not exist in child, then the plan is invalid, isn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least, this case was resolved in ResolveMissingReferences in spark-v2.2.
|
Sorry replying via email.
The previously failed test case has a GROUPING with resolved references.
Since it's unresolved itself, the rule will go through underlying Project
and newExprs have resolved references coming from parents of this Project.
…On Thu, Jul 12, 2018, 12:35 PM Wenchen Fan ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
<#21745 (comment)>:
> @@ -1163,7 +1165,8 @@ class Analyzer(
case p: Project =>
val maybeResolvedExprs = exprs.map(resolveExpression(_, p))
val (newExprs, newChild) = resolveExprsAndAddMissingAttrs(maybeResolvedExprs, p.child)
- val missingAttrs = AttributeSet(newExprs) -- AttributeSet(maybeResolvedExprs)
+ // The resolved attributes might not come from `p.child`. Need to filter it.
how can this happen? if the resolved attributes do not exist in child,
then the plan is invalid, isn't it?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#21745 (review)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEM9343BTd9ldDog_ZVBhQ-oWsXltktks5uFsPmgaJpZM4VKwgt>
.
|
|
@gatorsmile It seems the AnalysisBarrier commit causes this error, so v2.2 does not have this issue; |
|
We might need to get rid of AnalysisBarrier in the next release. This already caused at least three regressions in 2.3 |
|
I tried to checkout the commit 82183f7 which is before AnalysisBarrier commit. scala> val df = Seq(("test1", 0), ("test2", 1)).toDF("name", "id")
18/07/12 05:36:52 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
df: org.apache.spark.sql.DataFrame = [name: string, id: int]
scala> df.select(df("name")).filter(df("id") === 0).show()
org.apache.spark.sql.AnalysisException: Resolved attribute(s) id#6 missing from name#5 in operator !Filter (id#6 = 0).;;
!Filter (id#6 = 0)
+- Project [name#5]
+- Project [_1#2 AS name#5, _2#3 AS id#6]
+- LocalRelation [_1#2, _2#3]
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:41)
at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:89)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:291)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:80)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:80)
at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:89)
at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:53)
at org.apache.spark.sql.Dataset.<init>(Dataset.scala:168)
at org.apache.spark.sql.Dataset.<init>(Dataset.scala:174)
at org.apache.spark.sql.Dataset$.apply(Dataset.scala:65)
at org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:3240)
at org.apache.spark.sql.Dataset.filter(Dataset.scala:1403)
... 49 elidedLooks like it is already failed at that time? |
|
Actually, the very first time we introduced this regression was at 7463a88. |
|
Test build #92911 has finished for PR 21745 at commit
|
| val (newExprs, newChild) = resolveExprsAndAddMissingAttrs(maybeResolvedExprs, p.child) | ||
| val missingAttrs = AttributeSet(newExprs) -- AttributeSet(maybeResolvedExprs) | ||
| // Only add missing attributes coming from `newChild`. | ||
| val missingAttrs = (AttributeSet(newExprs) -- p.outputSet).intersect(newChild.outputSet) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a second time, but we need to fix in Aggregate case? The logic seems completely different. Or can we remove Aggregate case if ResolveAggregateFunctions can handle this? I don't think we have any reason to keep a wrong logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I think it's better to have a re-producible test case before changing Aggregate case. I'm trying to create a test case for it. Then it can be more confident to change Aggregate case.
Actually I found another place we need to fix. Seems we don't have enough test coverage for similar features.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic gets convoluted here and we need to add comments. Basically we need to explain when we should expand the project list.
|
Test build #92917 has finished for PR 21745 at commit
|
|
Test build #92918 has finished for PR 21745 at commit
|
|
Test build #92920 has finished for PR 21745 at commit
|
| /** | ||
| * This method tries to resolve expressions and find missing attributes recursively. Specially, | ||
| * when the expressions used in `Sort` or `Filter` contain unresolved attributes or resolved | ||
| * attributes which are missed from SELECT clause. This method tries to find the missing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which are missed from child output
| val missingAttrs = AttributeSet(newExprs) -- AttributeSet(maybeResolvedExprs) | ||
| // If some attributes used by expressions are resolvable only on the rewritten child | ||
| // plan, we need to add them into original projection. | ||
| val missingAttrs = (AttributeSet(newExprs) -- p.outputSet).intersect(newChild.outputSet) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if we do not do the .intersect(newChild.outputSet)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without this intersect, some tests fail, e.g., group-analytics.sql in SQLQueryTestSuite. Some attributes are resolved on parent plans, not on child plans. We can't add them as missing attributes here.
|
Test build #92935 has finished for PR 21745 at commit
|
|
LGTM |
| val sort2 = df.select(col("name")).orderBy(col("id")) | ||
| checkAnswer(sort1, sort2.collect()) | ||
|
|
||
| withSQLConf(SQLConf.DATAFRAME_RETAIN_GROUP_COLUMNS.key -> "false") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test case should be split to two.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will update it in next commit.
|
LGTM |
|
Test build #92958 has finished for PR 21745 at commit
|
|
Test build #92961 has finished for PR 21745 at commit
|
|
retest this please. |
|
Test build #92964 has finished for PR 21745 at commit
|
|
Thanks! Merged to master/2.3 |
…t not work
## What changes were proposed in this pull request?
When we use a reference from Dataset in filter or sort, which was not used in the prior select, an AnalysisException occurs, e.g.,
```scala
val df = Seq(("test1", 0), ("test2", 1)).toDF("name", "id")
df.select(df("name")).filter(df("id") === 0).show()
```
```scala
org.apache.spark.sql.AnalysisException: Resolved attribute(s) id#6 missing from name#5 in operator !Filter (id#6 = 0).;;
!Filter (id#6 = 0)
+- AnalysisBarrier
+- Project [name#5]
+- Project [_1#2 AS name#5, _2#3 AS id#6]
+- LocalRelation [_1#2, _2#3]
```
This change updates the rule `ResolveMissingReferences` so `Filter` and `Sort` with non-empty `missingInputs` will also be transformed.
## How was this patch tested?
Added tests.
Author: Liang-Chi Hsieh <[email protected]>
Closes #21745 from viirya/SPARK-24781.
(cherry picked from commit dfd7ac9)
Signed-off-by: Xiao Li <[email protected]>
What changes were proposed in this pull request?
When we use a reference from Dataset in filter or sort, which was not used in the prior select, an AnalysisException occurs, e.g.,
This change updates the rule
ResolveMissingReferencessoFilterandSortwith non-emptymissingInputswill also be transformed.How was this patch tested?
Added tests.