Skip to content

Conversation

@alamb
Copy link
Contributor

@alamb alamb commented Jun 25, 2021

Which issue does this PR close?

Closes #617 but I am still not sure if this is a bug or not (explained below)

Rationale for this change

As explained on #617 the projection pushdown operation removes columns from a scan when the LogicalPlan::Projection had unqualified columns (so an expression like a, rather than table.a).

In my case in IOx this was occuring due to an extension node, when my code was creating the exprs without qualifiers. When I updated to the code to create the expression with qualifiers things are fine again

         let exprs = input
             .schema()
             .fields()
             .iter()
-            .map(|field| logical_plan::col(field.name()))
+            .map(|field| Expr::Column(field.qualified_column()))
             .collect::<Vec<_>>();

Thus I am not sure if the projection pushdown code should actually handle this case, or if we should have some sort of error / warning instead. Having it silently produce the wrong answer (which is what happened in IOx) was hard to debug

What changes are included in this PR?

  1. Only check relational qualifier if there a relational qualifier on the expression, otherwise just check field name
  2. Ensure no duplicate field names by using BTreeSet

Are there any user-facing changes?

No

//
// Use BTreeSet to remove potential duplicates (e.g. union) as
// well as to sort the projection to ensure deterministic behavior
let mut projection: BTreeSet<usize> = required_columns
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I did not use a BTreeSet one of the UNION ALL tests failed due to a duplicate column being projected.

Copy link
Member

@houqp houqp Jun 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious when should we use a HashSet v.s. BTreeSet? EDIT: nvm, saw you removed the sort afterwards :P

Copy link
Contributor Author

@alamb alamb Jun 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah - exactly - I used the BTreeSet as we needed the ids sorted anyways

let mut projection: BTreeSet<usize> = required_columns
.iter()
.filter(|c| c.relation.as_ref() == table_name)
.filter(|c| c.relation.is_none() || c.relation.as_ref() == table_name)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the core change -- don't compare the relation qualifier if there is none -- otherwise if c = Column { relation: None, name: "a"} and the table name is Some("foo") the column will be filtered, even if foo has a column named a

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch 👍

@alamb
Copy link
Contributor Author

alamb commented Jun 25, 2021

FYI @houqp

Copy link
Member

@houqp houqp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do think this makes the projection push down logic more robust against manually constructed plan nodes.

In majority of the cases, I think users and datafusion core should avoid manual plan constructions and use the plan builder. It comes with validations and expression normalization logic to make sure constructed plan nodes are consistent and correct. In the past, I have already fixed more than 3 bugs in our planner and optimizer by replacing manual plan construction code with the plan builder.

I also think there is value in adding extra test/debug time validation code around optimizers and planners to go through the full plan tree and make sure everything checks up. This should help reduce debug/troubleshooting time for developers.

For example, it's easy for users to accidentally create invalid projection node when multiple relations are involved. Let's say both table1 and table2 contain the same column id, there is nothing in place today to warn users when a projection node is created to project a single ambiguous unqualified id column. In other words, only columns with unique names across relations can be projected with unqualified references.

Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks a lot, @alamb , great catch also; non-trivial to identify this.

Also +1 on the comments by @houqp

Copy link
Contributor

@Dandandan Dandandan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@Dandandan Dandandan merged commit 50b1120 into apache:master Jun 26, 2021
@alamb alamb deleted the alamb/fix_projection_pushdown branch June 27, 2021 10:51
@houqp houqp added the bug Something isn't working label Jul 31, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Projection pushdown removes unqualified column names even when they are used

4 participants