Skip to content

Conversation

@cloud-fan
Copy link
Contributor

@cloud-fan cloud-fan commented Apr 19, 2023

What changes were proposed in this pull request?

Update ListQuery to only store the number of columns of the original plan, instead of directly storing the original plan output attributes.

Why are the changes needed?

Storing the plan output attributes is troublesome as we have to maintain them and keep them in sync with the plan. For example, DeduplicateRelations may change the plan output, and today we do not update ListQuery.childOutputs to keep sync.

ListQuery.childOutputs was added by #18968 . It's only used to track the original plan output attributes as subquery de-correlation may add more columns. We can do the same thing by storing the number of columns of the plan.

Does this PR introduce any user-facing change?

No, there is no user-facing bug exposed.

How was this patch tested?

a new plan test

@github-actions github-actions bot added the SQL label Apr 19, 2023
val listQuery2 = ListQuery(testRelation2.select($"b"))
val plan = testRelation3.where($"f".in(listQuery1) && $"f".in(listQuery2)).analyze
val resolvedCondition = plan.expressions.head
val finalPlan = testRelation2.join(testRelation3).where(resolvedCondition).analyze
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test uses the resolved ListQuery to build a new plan and resolve it, to trigger the bug. Otherwise the bug is hidden because DeduplicateRelations runs before ResolveSubqueries, and the plan output of ListQuery won't be changed again.

@cloud-fan
Copy link
Contributor Author

cc @viirya

@viirya
Copy link
Member

viirya commented Apr 20, 2023

Looks good to me.

@cloud-fan
Copy link
Contributor Author

thanks for the review, merging to master!

@cloud-fan cloud-fan closed this in 9e17731 Apr 20, 2023
LuciferYang pushed a commit to LuciferYang/spark that referenced this pull request Apr 21, 2023
…ild output

### What changes were proposed in this pull request?

Update `ListQuery` to only store the number of columns of the original plan, instead of directly storing the original plan output attributes.

### Why are the changes needed?

Storing the plan output attributes is troublesome as we have to maintain them and keep them in sync with the plan. For example, `DeduplicateRelations` may change the plan output, and today we do not update `ListQuery.childOutputs` to keep sync.

`ListQuery.childOutputs` was added by apache#18968 . It's only used to track the original plan output attributes as subquery de-correlation may add more columns. We can do the same thing by storing the number of columns of the plan.

### Does this PR introduce _any_ user-facing change?

No, there is no user-facing bug exposed.

### How was this patch tested?

a new plan test

Closes apache#40851 from cloud-fan/list_query.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
@PengleiShi
Copy link

PengleiShi commented Nov 6, 2024

@cloud-fan it seems that this patch resolved issue https://issues.apache.org/jira/browse/SPARK-41191?. In SPARK-41191, nested cache table does not work because of different attribute exprId @mcdull-zhang https://github.com/apache/spark/pull/38703/files#r1027906540

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants