RFC: Do not prune out unnecessary columns with unqualified references #619

alamb · 2021-06-25T15:42:20Z

Which issue does this PR close?

Closes #617 but I am still not sure if this is a bug or not (explained below)

Rationale for this change

As explained on #617 the projection pushdown operation removes columns from a scan when the LogicalPlan::Projection had unqualified columns (so an expression like a, rather than table.a).

In my case in IOx this was occuring due to an extension node, when my code was creating the exprs without qualifiers. When I updated to the code to create the expression with qualifiers things are fine again

         let exprs = input
             .schema()
             .fields()
             .iter()
-            .map(|field| logical_plan::col(field.name()))
+            .map(|field| Expr::Column(field.qualified_column()))
             .collect::<Vec<_>>();

Thus I am not sure if the projection pushdown code should actually handle this case, or if we should have some sort of error / warning instead. Having it silently produce the wrong answer (which is what happened in IOx) was hard to debug

What changes are included in this PR?

Only check relational qualifier if there a relational qualifier on the expression, otherwise just check field name
Ensure no duplicate field names by using BTreeSet

Are there any user-facing changes?

No

alamb · 2021-06-25T15:43:06Z

datafusion/src/optimizer/projection_push_down.rs

+    //
+    // Use BTreeSet to remove potential duplicates (e.g. union) as
+    // well as to sort the projection to ensure deterministic behavior
+    let mut projection: BTreeSet<usize> = required_columns


When I did not use a BTreeSet one of the UNION ALL tests failed due to a duplicate column being projected.

curious when should we use a HashSet v.s. BTreeSet? EDIT: nvm, saw you removed the sort afterwards :P

Yeah - exactly - I used the BTreeSet as we needed the ids sorted anyways

alamb · 2021-06-25T15:44:25Z

datafusion/src/optimizer/projection_push_down.rs

+    let mut projection: BTreeSet<usize> = required_columns
        .iter()
-        .filter(|c| c.relation.as_ref() == table_name)
+        .filter(|c| c.relation.is_none() || c.relation.as_ref() == table_name)


This is the core change -- don't compare the relation qualifier if there is none -- otherwise if c = Column { relation: None, name: "a"} and the table name is Some("foo") the column will be filtered, even if foo has a column named a

good catch 👍

alamb · 2021-06-25T15:44:36Z

FYI @houqp

houqp

I do think this makes the projection push down logic more robust against manually constructed plan nodes.

In majority of the cases, I think users and datafusion core should avoid manual plan constructions and use the plan builder. It comes with validations and expression normalization logic to make sure constructed plan nodes are consistent and correct. In the past, I have already fixed more than 3 bugs in our planner and optimizer by replacing manual plan construction code with the plan builder.

I also think there is value in adding extra test/debug time validation code around optimizers and planners to go through the full plan tree and make sure everything checks up. This should help reduce debug/troubleshooting time for developers.

For example, it's easy for users to accidentally create invalid projection node when multiple relations are involved. Let's say both table1 and table2 contain the same column id, there is nothing in place today to warn users when a projection node is created to project a single ambiguous unqualified id column. In other words, only columns with unique names across relations can be projected with unqualified references.

jorgecarleitao

LGTM. Thanks a lot, @alamb , great catch also; non-trivial to identify this.

Also +1 on the comments by @houqp

Dandandan

👍

Do not prune out unecessary columns with unqualified references

32f9019

github-actions bot added the datafusion label Jun 25, 2021

alamb commented Jun 25, 2021

View reviewed changes

alamb mentioned this pull request Jun 25, 2021

Projection pushdown removes unqualified column names even when they are used #617

Closed

houqp approved these changes Jun 25, 2021

View reviewed changes

jorgecarleitao approved these changes Jun 25, 2021

View reviewed changes

Dandandan approved these changes Jun 25, 2021

View reviewed changes

Dandandan merged commit 50b1120 into apache:master Jun 26, 2021

alamb deleted the alamb/fix_projection_pushdown branch June 27, 2021 10:51

houqp added the bug Something isn't working label Jul 31, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RFC: Do not prune out unnecessary columns with unqualified references #619

RFC: Do not prune out unnecessary columns with unqualified references #619

Uh oh!

alamb commented Jun 25, 2021

Uh oh!

alamb Jun 25, 2021

Uh oh!

houqp Jun 25, 2021 •

edited

Loading

Uh oh!

alamb Jun 26, 2021 •

edited

Loading

Uh oh!

alamb Jun 25, 2021

Uh oh!

houqp Jun 25, 2021

Uh oh!

alamb commented Jun 25, 2021

Uh oh!

houqp left a comment •

edited

Loading

Uh oh!

jorgecarleitao left a comment

Uh oh!

Dandandan left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

RFC: Do not prune out unnecessary columns with unqualified references #619

RFC: Do not prune out unnecessary columns with unqualified references #619

Uh oh!

Conversation

alamb commented Jun 25, 2021

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Uh oh!

alamb Jun 25, 2021

Choose a reason for hiding this comment

Uh oh!

houqp Jun 25, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb Jun 26, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb Jun 25, 2021

Choose a reason for hiding this comment

Uh oh!

houqp Jun 25, 2021

Choose a reason for hiding this comment

Uh oh!

alamb commented Jun 25, 2021

Uh oh!

houqp left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jorgecarleitao left a comment

Choose a reason for hiding this comment

Uh oh!

Dandandan left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

houqp Jun 25, 2021 •

edited

Loading

alamb Jun 26, 2021 •

edited

Loading

houqp left a comment •

edited

Loading