Skip to content

Conversation

@allisonwang-db
Copy link
Contributor

@allisonwang-db allisonwang-db commented Apr 27, 2022

What changes were proposed in this pull request?

Backport #36216 to branch-3.1.

Why are the changes needed?

To fix a bug in SchemaPruning.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit test

… that do not belong to the current relation

This PR updates `ProjectionOverSchema`  to use the outputs of the data source relation to filter the attributes in the nested schema pruning. This is needed because the attributes in the schema do not necessarily belong to the current data source relation. For example, if a filter contains a correlated subquery, then the subquery's children can contain attributes from both the inner query and the outer query. Since the `RewriteSubquery` batch happens after early scan pushdown rules, nested schema pruning can wrongly use the inner query's attributes to prune the outer query data schema, thus causing wrong results and unexpected exceptions.

To fix a bug in `SchemaPruning`.

No

Unit test

Closes apache#36216 from allisonwang-db/spark-38918-nested-column-pruning.

Authored-by: allisonwang-db <[email protected]>
Signed-off-by: Liang-Chi Hsieh <[email protected]>
(cherry picked from commit 150434b)
Signed-off-by: Liang-Chi Hsieh <[email protected]>
(cherry picked from commit 793ba60)
Signed-off-by: allisonwang-db <[email protected]>
@github-actions github-actions bot added the SQL label Apr 27, 2022
@viirya
Copy link
Member

viirya commented Apr 28, 2022

Seems related test failure?

checkScan(query,
"struct<name:struct<first:string,middle:string,last:string>," +
"employer:struct<id:int,company:struct<name:string,address:string>>>",
"struct<name:struct<first:string>,employer:struct<company:struct<name:string>>>",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, pruned schema is different in 3.1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes looks like starting from 3.2 it failed to prune some nested fields (for example in this case name.middle and name.last are not used)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, okay, but I think it is unrelated to this change.

@viirya
Copy link
Member

viirya commented Apr 28, 2022

Thanks. Merging to branch-3.1.

viirya pushed a commit that referenced this pull request Apr 28, 2022
…butes that do not belong to the current relation

### What changes were proposed in this pull request?

Backport #36216 to branch-3.1.

### Why are the changes needed?

To fix a bug in `SchemaPruning`.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unit test

Closes #36387 from allisonwang-db/spark-38918-branch-3.1.

Authored-by: allisonwang-db <[email protected]>
Signed-off-by: Liang-Chi Hsieh <[email protected]>
@viirya viirya closed this Apr 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants