-
Notifications
You must be signed in to change notification settings - Fork 1.8k
RFC: Do not prune out unnecessary columns with unqualified references #619
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| // | ||
| // Use BTreeSet to remove potential duplicates (e.g. union) as | ||
| // well as to sort the projection to ensure deterministic behavior | ||
| let mut projection: BTreeSet<usize> = required_columns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I did not use a BTreeSet one of the UNION ALL tests failed due to a duplicate column being projected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
curious when should we use a HashSet v.s. BTreeSet? EDIT: nvm, saw you removed the sort afterwards :P
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah - exactly - I used the BTreeSet as we needed the ids sorted anyways
| let mut projection: BTreeSet<usize> = required_columns | ||
| .iter() | ||
| .filter(|c| c.relation.as_ref() == table_name) | ||
| .filter(|c| c.relation.is_none() || c.relation.as_ref() == table_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the core change -- don't compare the relation qualifier if there is none -- otherwise if c = Column { relation: None, name: "a"} and the table name is Some("foo") the column will be filtered, even if foo has a column named a
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch 👍
|
FYI @houqp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do think this makes the projection push down logic more robust against manually constructed plan nodes.
In majority of the cases, I think users and datafusion core should avoid manual plan constructions and use the plan builder. It comes with validations and expression normalization logic to make sure constructed plan nodes are consistent and correct. In the past, I have already fixed more than 3 bugs in our planner and optimizer by replacing manual plan construction code with the plan builder.
I also think there is value in adding extra test/debug time validation code around optimizers and planners to go through the full plan tree and make sure everything checks up. This should help reduce debug/troubleshooting time for developers.
For example, it's easy for users to accidentally create invalid projection node when multiple relations are involved. Let's say both table1 and table2 contain the same column id, there is nothing in place today to warn users when a projection node is created to project a single ambiguous unqualified id column. In other words, only columns with unique names across relations can be projected with unqualified references.
jorgecarleitao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dandandan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Which issue does this PR close?
Closes #617 but I am still not sure if this is a bug or not (explained below)
Rationale for this change
As explained on #617 the projection pushdown operation removes columns from a scan when the
LogicalPlan::Projectionhad unqualified columns (so an expression likea, rather thantable.a).In my case in IOx this was occuring due to an extension node, when my code was creating the exprs without qualifiers. When I updated to the code to create the expression with qualifiers things are fine again
let exprs = input .schema() .fields() .iter() - .map(|field| logical_plan::col(field.name())) + .map(|field| Expr::Column(field.qualified_column())) .collect::<Vec<_>>();Thus I am not sure if the projection pushdown code should actually handle this case, or if we should have some sort of error / warning instead. Having it silently produce the wrong answer (which is what happened in IOx) was hard to debug
What changes are included in this PR?
Are there any user-facing changes?
No