Skip to content

Conversation

@Erigara
Copy link
Contributor

@Erigara Erigara commented May 20, 2025

Closes #2028

Rationale for this change

Provide expected result aligned with spark implementation.

This PR fixes a bug where predicate evaluation for a column that is missing from the parquet file schema will return no result. This is due to _ColumnNameTranslator visitor returning AlwaysFalse when the column cannot be found in the file schema. The solution is to pass in the projected field value for evaluation. This follows the order of operation described in https://iceberg.apache.org/spec/#column-projection

Are these changes tested?

I've checked it on script attached to issue + new test was added.

Yes, added some unit tests for _ColumnNameTranslator/translate_column_names
Added a test for predicate evaluation for projected columns.

Are there any user-facing changes?

Kinda yes, because results of some scans now different.

@Erigara Erigara requested a review from Fokko May 21, 2025 17:57
@Erigara Erigara force-pushed the projected_field_predicate branch 2 times, most recently from 5aa9940 to 3bb325f Compare June 12, 2025 15:26
@Fokko Fokko added this to the PyIceberg 0.10.0 milestone Jun 24, 2025
Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @Erigara. I just reviewed #1644 as well to help move this forward.

I have a question about how we'd want to structure this feature in the codebase

@Erigara Erigara force-pushed the projected_field_predicate branch from 3bb325f to a93b4eb Compare July 1, 2025 08:45
@Erigara Erigara requested a review from kevinjqliu July 1, 2025 08:46
@kevinjqliu kevinjqliu removed this from the PyIceberg 0.10.0 milestone Jul 22, 2025
@kevinjqliu
Copy link
Contributor

removed milestone tag since the referenced issue (#2028) is already tagged

@kevinjqliu
Copy link
Contributor

@Erigara now that #1644 is merged, could you rebase this PR?

@kevinjqliu kevinjqliu closed this Jul 25, 2025
@kevinjqliu kevinjqliu reopened this Jul 25, 2025
@kevinjqliu
Copy link
Contributor

i merged main and added the projection logic to _ColumnNameTranslator
I was able to use the code in #2028 to verify that the counts are now the same.

@kevinjqliu
Copy link
Contributor

i just realized that i/we had concerns about the _ColumnNameTranslator approach in #2029 (comment)

hmmm

@kevinjqliu
Copy link
Contributor

I opened #2254 to resolve this issue. Sorry for taking over this PR @Erigara. I want to move this along so we can cut a release for 0.10. Please take a look at the other PR if you get a chance.

@kevinjqliu kevinjqliu requested a review from Copilot July 27, 2025 16:41
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes a bug in predicate evaluation for projected fields that are missing from the parquet file schema. Previously, when a column used in a predicate was not found in the file schema, the _ColumnNameTranslator would return AlwaysFalse, resulting in no matching rows. The fix implements proper column projection evaluation by passing projected field values to the translator and evaluating them against predicates before falling back to initial default values.

  • Enhanced _ColumnNameTranslator to accept and use projected field values for missing columns
  • Modified predicate evaluation order to check projected values first, then initial defaults
  • Updated the PyArrow integration to pass projected field values during row filter translation

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File Description
pyiceberg/expressions/visitors.py Enhanced _ColumnNameTranslator class and translate_column_names function to handle projected field values
pyiceberg/io/pyarrow.py Modified row filter translation to pass projected field values from column projection
tests/expressions/test_visitors.py Added comprehensive test cases for translate_column_names with various projected field scenarios
tests/io/test_pyarrow.py Added integration test for row filter with partition value projection

@kevinjqliu
Copy link
Contributor

ptal @sungwy @gabeiglio

@Erigara
Copy link
Contributor Author

Erigara commented Jul 28, 2025

Like why you disappeared for weeks to then take over the PR, this doesn't look right to be honest.

@kevinjqliu
Copy link
Contributor

Like why you disappeared for weeks to then take over the PR, this doesn't look right to be honest.

hey @Erigara sorry about the chain of event. I started a new job recently and didnt have time to review this PR before.

I pushed to the PR because I wanted to get the 0.10 release out to the community. We started the discussion in the beginning of July and multiple community members has asked about the release date. This was one of the last remaining bugs that I think would be good to include in the release. Hope you can understand that Im trying to balance community goals and individual PR contributions.

Regarding the implementation using _ColumnNameTranslator. Originally I had concerns about this. #1644 was then merged with initial-defaults projection in the same place. So it make sense to now colocate the two.
I still think its still not the best place for the column projection logic. Happy to iterate on this together. Im trying to make sure we hit all the corner cases so we can fix the original bug and release 0.10. We can come back to refactor this code.

@Erigara
Copy link
Contributor Author

Erigara commented Jul 28, 2025

@kevinjqliu ok, you can ping me if there is anything left to update PR with, but looks good to go

@gabeiglio
Copy link
Contributor

gabeiglio commented Jul 28, 2025

I agree that we should consider an alternative approach to avoid including these changes in _ColumnNameTranslator. Specially since projection rules would be in different places in the code base.

However, since this is an important bug fix that needs to be included in this release, I think it's fine to proceed as is IMO. But we should open an issue to refactor this code in the future.

Thanks for working on this @Erigara @kevinjqliu

Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Erigara @kevinjqliu . Logically the changes make sense to me, just had a small structural comment and some possibly additional test cases worth adding. I'd also recommend holding til @Fokko is back, to get his feedback on this as well

# In the order described by the "Column Projection" section of the Iceberg spec:
# https://iceberg.apache.org/spec/#column-projection
# Evaluate column projection first if it exists
if projected_field_value := self.projected_field_values.get(field.name):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious, what's the rationale for not including default value handling inside the logic that produces the projected_field_values? Seems like the intent of _get_column_projection_values is to apply all the projection rules based on the comment but it looks like we apply most of them and then here we fall through to applying the initial default on 928. May be better if all of that logic is self contained in the function so that in case things move around you don't have a separate place where default values are propagated

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about condensing this into:

# In the order described by the "Column Projection" section of the Iceberg spec:
# https://iceberg.apache.org/spec/#column-projection
# Evaluate column projection first if it exists, otherwise default to the initial-default-value
field_value = (
    self.projected_field_values[field.name] if field.name in self.projected_field_values else field.initial_default
)
return (
    AlwaysTrue()
    if expression_evaluator(Schema(field), pred, case_sensitive=self.case_sensitive)(Record(field_value))
    else AlwaysFalse()
)


def __init__(self, file_schema: Schema, case_sensitive: bool) -> None:
def __init__(
self, file_schema: Schema, case_sensitive: bool, projected_field_values: Optional[Dict[str, Any]] = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if it's idiomatic python, so up to you @kevinjqliu @Fokko , but is it possible to just make the default value here an empty dictionary, and then self.projected_field_values = self.projected_field_values

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the current way is preferred over

projected_field_values: Dict[str, Any] = {}

The current way avoids using mutable default (projected_field_values={}), which is considered bad practice because it can lead to unexpected shared state across multiple calls or instances.

Copy link
Contributor

@Fokko Fokko Aug 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where the FROZEN_DICT comes in:

class FrozenDict(Dict[Any, Any]):
def __setitem__(self, instance: Any, value: Any) -> None:
"""Assign a value to a FrozenDict."""
raise AttributeError("FrozenDict does not support assignment")
def update(self, *args: Any, **kwargs: Any) -> None:
raise AttributeError("FrozenDict does not support .update()")
UTF8 = "utf-8"
EMPTY_DICT = FrozenDict()

This avoids null-checking :)

Suggested change
self, file_schema: Schema, case_sensitive: bool, projected_field_values: Optional[Dict[str, Any]] = None
self, file_schema: Schema, case_sensitive: bool, projected_field_values: Dict[str, Any] = EMPTY_DICT

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

applied! just to double check, we still want to use the

        self.projected_field_values = projected_field_values or {}

right?

https://github.com/apache/iceberg-python/pull/2029/files#diff-aa0e94ae2c31c3e2e7cfb1a4cff7d83422a47dcf8715709febcd5ff4aa662908R877-R881

assert expression_evaluator(schema, NotStartsWith("a", 1), case_sensitive=True)(struct) is True


def test_translate_column_names_simple_case(table_schema_simple: Schema) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the test cases, I've stepped through them and they cover the cases I'd expect.
Some other additions that may be worth it:

1.) Disjunctive/Conjunctive cases (Or, and, etc) where one field is missing from the file and one field is not. Maybe mix this in with the rename case where the file on disk has the field but with a different name (I see that's already tested in the single predicate case, but just to sanity check combined cases)

2.) Maybe a nested field case though it's really no different

Down the line, when say Spark can support the DDL. for default values then we can have end to end verification tests as well

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Erigara for your patience here, left some style suggestions, but I think this is good to go 👍

# In the order described by the "Column Projection" section of the Iceberg spec:
# https://iceberg.apache.org/spec/#column-projection
# Evaluate column projection first if it exists
if projected_field_value := self.projected_field_values.get(field.name):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about condensing this into:

# In the order described by the "Column Projection" section of the Iceberg spec:
# https://iceberg.apache.org/spec/#column-projection
# Evaluate column projection first if it exists, otherwise default to the initial-default-value
field_value = (
    self.projected_field_values[field.name] if field.name in self.projected_field_values else field.initial_default
)
return (
    AlwaysTrue()
    if expression_evaluator(Schema(field), pred, case_sensitive=self.case_sensitive)(Record(field_value))
    else AlwaysFalse()
)

@kevinjqliu
Copy link
Contributor

kevinjqliu commented Aug 4, 2025

@Fokko i applied this refactor locally and test_translate_column_names_missing_column_projected_field_fallbacks_to_initial_default fails

I think we can remove this test since i cannot think of a scenario where the "projected field value doesn't match but initial_default does". If this happens, then the table is most likely corrupted. Right?

@Fokko
Copy link
Contributor

Fokko commented Aug 5, 2025

Good one, after some thought, I think the test is incorrect. When we pass in the projected_field_values the field should not be considered missing anymore. If you import a table from Hive, where the partition field is not part of the DataFile, then we pass it in through the projected_field_values but it isn't considered missing.

@Fokko
Copy link
Contributor

Fokko commented Aug 5, 2025

Let me try to add a test in a separate PR, I'm comfortable merging this in

@kevinjqliu kevinjqliu merged commit cd7d8c7 into apache:main Aug 5, 2025
10 checks passed
@kevinjqliu
Copy link
Contributor

thanks for the PR @Erigara and thank you @amogh-jahagirdar @Fokko for the review!

gabeiglio pushed a commit to Netflix/iceberg-python that referenced this pull request Aug 13, 2025
<!--
Thanks for opening a pull request!
-->

<!-- In the case this PR will resolve an issue, please replace
${GITHUB_ISSUE_ID} below with the actual Github issue id. -->
Closes apache#2028 

# Rationale for this change

Provide expected result aligned with `spark` implementation.

This PR fixes a bug where predicate evaluation for a column that is
missing from the parquet file schema will return no result. This is due
to `_ColumnNameTranslator` visitor returning `AlwaysFalse` when the
column cannot be found in the file schema. The solution is to pass in
the projected field value for evaluation. This follows the order of
operation described in
https://iceberg.apache.org/spec/#column-projection

# Are these changes tested?

I've checked it on script attached to issue + new test was added.

Yes, added some unit tests for
`_ColumnNameTranslator`/`translate_column_names`
Added a test for predicate evaluation for projected columns. 


# Are there any user-facing changes?

Kinda yes, because results of some scans now different.

<!-- In the case of user-facing changes, please add the changelog label.
-->

---------

Co-authored-by: Roman Shanin <[email protected]>
Co-authored-by: Kevin Liu <[email protected]>
Co-authored-by: Kevin Liu <[email protected]>
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Fokko Driesprong <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Scan with filtering on projected field rerurn empty table

5 participants