-
Notifications
You must be signed in to change notification settings - Fork 392
Fix projected fields predicate evaluation #2029
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
5aa9940 to
3bb325f
Compare
kevinjqliu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3bb325f to
a93b4eb
Compare
|
removed milestone tag since the referenced issue (#2028) is already tagged |
|
i merged main and added the projection logic to |
|
i just realized that i/we had concerns about the hmmm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR fixes a bug in predicate evaluation for projected fields that are missing from the parquet file schema. Previously, when a column used in a predicate was not found in the file schema, the _ColumnNameTranslator would return AlwaysFalse, resulting in no matching rows. The fix implements proper column projection evaluation by passing projected field values to the translator and evaluating them against predicates before falling back to initial default values.
- Enhanced
_ColumnNameTranslatorto accept and use projected field values for missing columns - Modified predicate evaluation order to check projected values first, then initial defaults
- Updated the PyArrow integration to pass projected field values during row filter translation
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| pyiceberg/expressions/visitors.py | Enhanced _ColumnNameTranslator class and translate_column_names function to handle projected field values |
| pyiceberg/io/pyarrow.py | Modified row filter translation to pass projected field values from column projection |
| tests/expressions/test_visitors.py | Added comprehensive test cases for translate_column_names with various projected field scenarios |
| tests/io/test_pyarrow.py | Added integration test for row filter with partition value projection |
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
|
ptal @sungwy @gabeiglio |
|
Like why you disappeared for weeks to then take over the PR, this doesn't look right to be honest. |
hey @Erigara sorry about the chain of event. I started a new job recently and didnt have time to review this PR before. I pushed to the PR because I wanted to get the 0.10 release out to the community. We started the discussion in the beginning of July and multiple community members has asked about the release date. This was one of the last remaining bugs that I think would be good to include in the release. Hope you can understand that Im trying to balance community goals and individual PR contributions. Regarding the implementation using |
|
@kevinjqliu ok, you can ping me if there is anything left to update PR with, but looks good to go |
|
I agree that we should consider an alternative approach to avoid including these changes in However, since this is an important bug fix that needs to be included in this release, I think it's fine to proceed as is IMO. But we should open an issue to refactor this code in the future. Thanks for working on this @Erigara @kevinjqliu |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Erigara @kevinjqliu . Logically the changes make sense to me, just had a small structural comment and some possibly additional test cases worth adding. I'd also recommend holding til @Fokko is back, to get his feedback on this as well
| # In the order described by the "Column Projection" section of the Iceberg spec: | ||
| # https://iceberg.apache.org/spec/#column-projection | ||
| # Evaluate column projection first if it exists | ||
| if projected_field_value := self.projected_field_values.get(field.name): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious, what's the rationale for not including default value handling inside the logic that produces the projected_field_values? Seems like the intent of _get_column_projection_values is to apply all the projection rules based on the comment but it looks like we apply most of them and then here we fall through to applying the initial default on 928. May be better if all of that logic is self contained in the function so that in case things move around you don't have a separate place where default values are propagated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about condensing this into:
# In the order described by the "Column Projection" section of the Iceberg spec:
# https://iceberg.apache.org/spec/#column-projection
# Evaluate column projection first if it exists, otherwise default to the initial-default-value
field_value = (
self.projected_field_values[field.name] if field.name in self.projected_field_values else field.initial_default
)
return (
AlwaysTrue()
if expression_evaluator(Schema(field), pred, case_sensitive=self.case_sensitive)(Record(field_value))
else AlwaysFalse()
)
pyiceberg/expressions/visitors.py
Outdated
|
|
||
| def __init__(self, file_schema: Schema, case_sensitive: bool) -> None: | ||
| def __init__( | ||
| self, file_schema: Schema, case_sensitive: bool, projected_field_values: Optional[Dict[str, Any]] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if it's idiomatic python, so up to you @kevinjqliu @Fokko , but is it possible to just make the default value here an empty dictionary, and then self.projected_field_values = self.projected_field_values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the current way is preferred over
projected_field_values: Dict[str, Any] = {}
The current way avoids using mutable default (projected_field_values={}), which is considered bad practice because it can lead to unexpected shared state across multiple calls or instances.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is where the FROZEN_DICT comes in:
iceberg-python/pyiceberg/typedef.py
Lines 47 to 58 in 14ee8da
| class FrozenDict(Dict[Any, Any]): | |
| def __setitem__(self, instance: Any, value: Any) -> None: | |
| """Assign a value to a FrozenDict.""" | |
| raise AttributeError("FrozenDict does not support assignment") | |
| def update(self, *args: Any, **kwargs: Any) -> None: | |
| raise AttributeError("FrozenDict does not support .update()") | |
| UTF8 = "utf-8" | |
| EMPTY_DICT = FrozenDict() |
This avoids null-checking :)
| self, file_schema: Schema, case_sensitive: bool, projected_field_values: Optional[Dict[str, Any]] = None | |
| self, file_schema: Schema, case_sensitive: bool, projected_field_values: Dict[str, Any] = EMPTY_DICT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
applied! just to double check, we still want to use the
self.projected_field_values = projected_field_values or {}
right?
| assert expression_evaluator(schema, NotStartsWith("a", 1), case_sensitive=True)(struct) is True | ||
|
|
||
|
|
||
| def test_translate_column_names_simple_case(table_schema_simple: Schema) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all the test cases, I've stepped through them and they cover the cases I'd expect.
Some other additions that may be worth it:
1.) Disjunctive/Conjunctive cases (Or, and, etc) where one field is missing from the file and one field is not. Maybe mix this in with the rename case where the file on disk has the field but with a different name (I see that's already tested in the single predicate case, but just to sanity check combined cases)
2.) Maybe a nested field case though it's really no different
Down the line, when say Spark can support the DDL. for default values then we can have end to end verification tests as well
Fokko
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Erigara for your patience here, left some style suggestions, but I think this is good to go 👍
| # In the order described by the "Column Projection" section of the Iceberg spec: | ||
| # https://iceberg.apache.org/spec/#column-projection | ||
| # Evaluate column projection first if it exists | ||
| if projected_field_value := self.projected_field_values.get(field.name): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about condensing this into:
# In the order described by the "Column Projection" section of the Iceberg spec:
# https://iceberg.apache.org/spec/#column-projection
# Evaluate column projection first if it exists, otherwise default to the initial-default-value
field_value = (
self.projected_field_values[field.name] if field.name in self.projected_field_values else field.initial_default
)
return (
AlwaysTrue()
if expression_evaluator(Schema(field), pred, case_sensitive=self.case_sensitive)(Record(field_value))
else AlwaysFalse()
)Co-authored-by: Fokko Driesprong <[email protected]>
|
@Fokko i applied this refactor locally and I think we can remove this test since i cannot think of a scenario where the "projected field value doesn't match but initial_default does". If this happens, then the table is most likely corrupted. Right? |
|
Good one, after some thought, I think the test is incorrect. When we pass in the |
|
Let me try to add a test in a separate PR, I'm comfortable merging this in |
|
thanks for the PR @Erigara and thank you @amogh-jahagirdar @Fokko for the review! |
<!--
Thanks for opening a pull request!
-->
<!-- In the case this PR will resolve an issue, please replace
${GITHUB_ISSUE_ID} below with the actual Github issue id. -->
Closes apache#2028
# Rationale for this change
Provide expected result aligned with `spark` implementation.
This PR fixes a bug where predicate evaluation for a column that is
missing from the parquet file schema will return no result. This is due
to `_ColumnNameTranslator` visitor returning `AlwaysFalse` when the
column cannot be found in the file schema. The solution is to pass in
the projected field value for evaluation. This follows the order of
operation described in
https://iceberg.apache.org/spec/#column-projection
# Are these changes tested?
I've checked it on script attached to issue + new test was added.
Yes, added some unit tests for
`_ColumnNameTranslator`/`translate_column_names`
Added a test for predicate evaluation for projected columns.
# Are there any user-facing changes?
Kinda yes, because results of some scans now different.
<!-- In the case of user-facing changes, please add the changelog label.
-->
---------
Co-authored-by: Roman Shanin <[email protected]>
Co-authored-by: Kevin Liu <[email protected]>
Co-authored-by: Kevin Liu <[email protected]>
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Fokko Driesprong <[email protected]>
Closes #2028
Rationale for this change
Provide expected result aligned with
sparkimplementation.This PR fixes a bug where predicate evaluation for a column that is missing from the parquet file schema will return no result. This is due to
_ColumnNameTranslatorvisitor returningAlwaysFalsewhen the column cannot be found in the file schema. The solution is to pass in the projected field value for evaluation. This follows the order of operation described in https://iceberg.apache.org/spec/#column-projectionAre these changes tested?
I've checked it on script attached to issue + new test was added.
Yes, added some unit tests for
_ColumnNameTranslator/translate_column_namesAdded a test for predicate evaluation for projected columns.
Are there any user-facing changes?
Kinda yes, because results of some scans now different.