Fix projected fields predicate evaluation #2029

Erigara · 2025-05-20T12:15:24Z

Closes #2028

Rationale for this change

Provide expected result aligned with spark implementation.

This PR fixes a bug where predicate evaluation for a column that is missing from the parquet file schema will return no result. This is due to _ColumnNameTranslator visitor returning AlwaysFalse when the column cannot be found in the file schema. The solution is to pass in the projected field value for evaluation. This follows the order of operation described in https://iceberg.apache.org/spec/#column-projection

Are these changes tested?

I've checked it on script attached to issue + new test was added.

Yes, added some unit tests for _ColumnNameTranslator/translate_column_names
Added a test for predicate evaluation for projected columns.

Are there any user-facing changes?

Kinda yes, because results of some scans now different.

pyiceberg/expressions/visitors.py

kevinjqliu

Thanks for working on this @Erigara. I just reviewed #1644 as well to help move this forward.

I have a question about how we'd want to structure this feature in the codebase

pyiceberg/expressions/visitors.py

kevinjqliu · 2025-07-22T18:30:19Z

removed milestone tag since the referenced issue (#2028) is already tagged

kevinjqliu · 2025-07-25T16:34:17Z

@Erigara now that #1644 is merged, could you rebase this PR?

…cate

pyiceberg/expressions/visitors.py

kevinjqliu · 2025-07-25T19:22:18Z

i merged main and added the projection logic to _ColumnNameTranslator
I was able to use the code in #2028 to verify that the counts are now the same.

kevinjqliu · 2025-07-25T19:24:09Z

i just realized that i/we had concerns about the _ColumnNameTranslator approach in #2029 (comment)

hmmm

kevinjqliu · 2025-07-26T04:40:52Z

I opened #2254 to resolve this issue. Sorry for taking over this PR @Erigara. I want to move this along so we can cut a release for 0.10. Please take a look at the other PR if you get a chance.

Copilot

Pull Request Overview

This PR fixes a bug in predicate evaluation for projected fields that are missing from the parquet file schema. Previously, when a column used in a predicate was not found in the file schema, the _ColumnNameTranslator would return AlwaysFalse, resulting in no matching rows. The fix implements proper column projection evaluation by passing projected field values to the translator and evaluating them against predicates before falling back to initial default values.

Enhanced _ColumnNameTranslator to accept and use projected field values for missing columns
Modified predicate evaluation order to check projected values first, then initial defaults
Updated the PyArrow integration to pass projected field values during row filter translation

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File	Description
pyiceberg/expressions/visitors.py	Enhanced `_ColumnNameTranslator` class and `translate_column_names` function to handle projected field values
pyiceberg/io/pyarrow.py	Modified row filter translation to pass projected field values from column projection
tests/expressions/test_visitors.py	Added comprehensive test cases for `translate_column_names` with various projected field scenarios
tests/io/test_pyarrow.py	Added integration test for row filter with partition value projection

pyiceberg/expressions/visitors.py

tests/expressions/test_visitors.py

Co-authored-by: Copilot <[email protected]>

kevinjqliu · 2025-07-27T17:02:35Z

ptal @sungwy @gabeiglio

Erigara · 2025-07-28T13:56:35Z

Like why you disappeared for weeks to then take over the PR, this doesn't look right to be honest.

kevinjqliu · 2025-07-28T15:39:54Z

Like why you disappeared for weeks to then take over the PR, this doesn't look right to be honest.

hey @Erigara sorry about the chain of event. I started a new job recently and didnt have time to review this PR before.

I pushed to the PR because I wanted to get the 0.10 release out to the community. We started the discussion in the beginning of July and multiple community members has asked about the release date. This was one of the last remaining bugs that I think would be good to include in the release. Hope you can understand that Im trying to balance community goals and individual PR contributions.

Regarding the implementation using _ColumnNameTranslator. Originally I had concerns about this. #1644 was then merged with initial-defaults projection in the same place. So it make sense to now colocate the two.
I still think its still not the best place for the column projection logic. Happy to iterate on this together. Im trying to make sure we hit all the corner cases so we can fix the original bug and release 0.10. We can come back to refactor this code.

Erigara · 2025-07-28T16:12:43Z

@kevinjqliu ok, you can ping me if there is anything left to update PR with, but looks good to go

gabeiglio · 2025-07-28T17:08:31Z

I agree that we should consider an alternative approach to avoid including these changes in _ColumnNameTranslator. Specially since projection rules would be in different places in the code base.

However, since this is an important bug fix that needs to be included in this release, I think it's fine to proceed as is IMO. But we should open an issue to refactor this code in the future.

Thanks for working on this @Erigara @kevinjqliu

amogh-jahagirdar

Thanks @Erigara @kevinjqliu . Logically the changes make sense to me, just had a small structural comment and some possibly additional test cases worth adding. I'd also recommend holding til @Fokko is back, to get his feedback on this as well

amogh-jahagirdar · 2025-07-30T17:34:09Z

pyiceberg/expressions/visitors.py

+            # In the order described by the "Column Projection" section of the Iceberg spec:
+            # https://iceberg.apache.org/spec/#column-projection
+            # Evaluate column projection first if it exists
+            if projected_field_value := self.projected_field_values.get(field.name):


Curious, what's the rationale for not including default value handling inside the logic that produces the projected_field_values? Seems like the intent of _get_column_projection_values is to apply all the projection rules based on the comment but it looks like we apply most of them and then here we fall through to applying the initial default on 928. May be better if all of that logic is self contained in the function so that in case things move around you don't have a separate place where default values are propagated

How about condensing this into:

# In the order described by the "Column Projection" section of the Iceberg spec: # https://iceberg.apache.org/spec/#column-projection # Evaluate column projection first if it exists, otherwise default to the initial-default-value field_value = ( self.projected_field_values[field.name] if field.name in self.projected_field_values else field.initial_default ) return ( AlwaysTrue() if expression_evaluator(Schema(field), pred, case_sensitive=self.case_sensitive)(Record(field_value)) else AlwaysFalse() )

amogh-jahagirdar · 2025-07-30T19:26:17Z

pyiceberg/expressions/visitors.py


-    def __init__(self, file_schema: Schema, case_sensitive: bool) -> None:
+    def __init__(
+        self, file_schema: Schema, case_sensitive: bool, projected_field_values: Optional[Dict[str, Any]] = None


Not sure if it's idiomatic python, so up to you @kevinjqliu @Fokko , but is it possible to just make the default value here an empty dictionary, and then self.projected_field_values = self.projected_field_values

the current way is preferred over

projected_field_values: Dict[str, Any] = {}

The current way avoids using mutable default (projected_field_values={}), which is considered bad practice because it can lead to unexpected shared state across multiple calls or instances.

This is where the FROZEN_DICT comes in:

iceberg-python/pyiceberg/typedef.py

Lines 47 to 58 in 14ee8da

class FrozenDict(Dict[Any, Any]):

def __setitem__(self, instance: Any, value: Any) -> None:

"""Assign a value to a FrozenDict."""

raise AttributeError("FrozenDict does not support assignment")

def update(self, *args: Any, **kwargs: Any) -> None:

raise AttributeError("FrozenDict does not support .update()")

UTF8 = "utf-8"

EMPTY_DICT = FrozenDict()

This avoids null-checking :)

Suggested change

self, file_schema: Schema, case_sensitive: bool, projected_field_values: Optional[Dict[str, Any]] = None

self, file_schema: Schema, case_sensitive: bool, projected_field_values: Dict[str, Any] = EMPTY_DICT

applied! just to double check, we still want to use the

self.projected_field_values = projected_field_values or {}

right?

https://github.com/apache/iceberg-python/pull/2029/files#diff-aa0e94ae2c31c3e2e7cfb1a4cff7d83422a47dcf8715709febcd5ff4aa662908R877-R881

amogh-jahagirdar · 2025-07-30T19:32:52Z

tests/expressions/test_visitors.py

    assert expression_evaluator(schema, NotStartsWith("a", 1), case_sensitive=True)(struct) is True
+
+
+def test_translate_column_names_simple_case(table_schema_simple: Schema) -> None:


Thanks for all the test cases, I've stepped through them and they cover the cases I'd expect.
Some other additions that may be worth it:

1.) Disjunctive/Conjunctive cases (Or, and, etc) where one field is missing from the file and one field is not. Maybe mix this in with the rename case where the file on disk has the field but with a different name (I see that's already tested in the single predicate case, but just to sanity check combined cases)

2.) Maybe a nested field case though it's really no different

Down the line, when say Spark can support the DDL. for default values then we can have end to end verification tests as well

Fokko

Thanks @Erigara for your patience here, left some style suggestions, but I think this is good to go 👍

Fokko · 2025-08-04T21:16:57Z

pyiceberg/expressions/visitors.py

+            # In the order described by the "Column Projection" section of the Iceberg spec:
+            # https://iceberg.apache.org/spec/#column-projection
+            # Evaluate column projection first if it exists
+            if projected_field_value := self.projected_field_values.get(field.name):


How about condensing this into:

# In the order described by the "Column Projection" section of the Iceberg spec: # https://iceberg.apache.org/spec/#column-projection # Evaluate column projection first if it exists, otherwise default to the initial-default-value field_value = ( self.projected_field_values[field.name] if field.name in self.projected_field_values else field.initial_default ) return ( AlwaysTrue() if expression_evaluator(Schema(field), pred, case_sensitive=self.case_sensitive)(Record(field_value)) else AlwaysFalse() )

Co-authored-by: Fokko Driesprong <[email protected]>

kevinjqliu · 2025-08-04T22:32:22Z

@Fokko i applied this refactor locally and test_translate_column_names_missing_column_projected_field_fallbacks_to_initial_default fails

I think we can remove this test since i cannot think of a scenario where the "projected field value doesn't match but initial_default does". If this happens, then the table is most likely corrupted. Right?

Fokko · 2025-08-05T08:52:43Z

Good one, after some thought, I think the test is incorrect. When we pass in the projected_field_values the field should not be considered missing anymore. If you import a table from Hive, where the partition field is not part of the DataFile, then we pass it in through the projected_field_values but it isn't considered missing.

Fokko · 2025-08-05T12:02:25Z

Let me try to add a test in a separate PR, I'm comfortable merging this in

kevinjqliu · 2025-08-05T15:04:29Z

thanks for the PR @Erigara and thank you @amogh-jahagirdar @Fokko for the review!

Closes apache#2028 # Rationale for this change Provide expected result aligned with `spark` implementation. This PR fixes a bug where predicate evaluation for a column that is missing from the parquet file schema will return no result. This is due to `_ColumnNameTranslator` visitor returning `AlwaysFalse` when the column cannot be found in the file schema. The solution is to pass in the projected field value for evaluation. This follows the order of operation described in https://iceberg.apache.org/spec/#column-projection # Are these changes tested? I've checked it on script attached to issue + new test was added. Yes, added some unit tests for `_ColumnNameTranslator`/`translate_column_names` Added a test for predicate evaluation for projected columns. # Are there any user-facing changes? Kinda yes, because results of some scans now different.  --------- Co-authored-by: Roman Shanin <[email protected]> Co-authored-by: Kevin Liu <[email protected]> Co-authored-by: Kevin Liu <[email protected]> Co-authored-by: Copilot <[email protected]> Co-authored-by: Fokko Driesprong <[email protected]>

Erigara mentioned this pull request May 20, 2025

Scan with filtering on projected field rerurn empty table #2028

Closed

3 tasks

Erigara commented May 20, 2025

View reviewed changes

pyiceberg/expressions/visitors.py Outdated Show resolved Hide resolved

Erigara requested a review from Fokko May 21, 2025 17:57

Erigara force-pushed the projected_field_predicate branch 2 times, most recently from 5aa9940 to 3bb325f Compare June 12, 2025 15:26

Fokko added this to the PyIceberg 0.10.0 milestone Jun 24, 2025

kevinjqliu reviewed Jun 29, 2025

View reviewed changes

pyiceberg/expressions/visitors.py Outdated Show resolved Hide resolved

Roman Shanin added 4 commits July 1, 2025 11:18

extend signatue of translate_column_names

a8dbf6b

add test to check projected field predicate evaluator

e88a6a2

evaluate projected fields in predicate

befa05c

extract projected columns evaluator into separate visitor

a93b4eb

Erigara force-pushed the projected_field_predicate branch from 3bb325f to a93b4eb Compare July 1, 2025 08:45

Erigara requested a review from kevinjqliu July 1, 2025 08:46

kevinjqliu removed this from the PyIceberg 0.10.0 milestone Jul 22, 2025

Merge remote-tracking branch 'origin/main' into projected_field_predi…

10afbb8

…cate

kevinjqliu reviewed Jul 25, 2025

View reviewed changes

pyiceberg/expressions/visitors.py Outdated Show resolved Hide resolved

project value in _ColumnNameTranslator

1ce4889

kevinjqliu closed this Jul 25, 2025

kevinjqliu reopened this Jul 25, 2025

kevinjqliu mentioned this pull request Jul 26, 2025

Fix column projection predicate evaluation #2254

Closed

kevinjqliu added 5 commits July 26, 2025 11:43

remove other changes

7b2ecbb

fix

682afc5

fix

f9b53e0

fix logic

bc8d5c9

comments

7de4744

kevinjqliu requested a review from Copilot July 27, 2025 16:41

Copilot AI reviewed Jul 27, 2025

View reviewed changes

pyiceberg/expressions/visitors.py Show resolved Hide resolved

tests/expressions/test_visitors.py Outdated Show resolved Hide resolved

tests/expressions/test_visitors.py Outdated Show resolved Hide resolved

kevinjqliu and others added 2 commits July 27, 2025 09:45

Update tests/expressions/test_visitors.py

3d58180

Co-authored-by: Copilot <[email protected]>

Update tests/expressions/test_visitors.py

4757d6f

Co-authored-by: Copilot <[email protected]>

amogh-jahagirdar approved these changes Jul 30, 2025

View reviewed changes

Fokko approved these changes Aug 4, 2025

View reviewed changes

kevinjqliu and others added 2 commits August 4, 2025 15:21

Update pyiceberg/expressions/visitors.py

69a850e

Co-authored-by: Fokko Driesprong <[email protected]>

use EMPTY_DICT

fa78926

kevinjqliu merged commit cd7d8c7 into apache:main Aug 5, 2025
10 checks passed

kevinjqliu mentioned this pull request Aug 6, 2025

Convert _get_column_projection_values to use Field-IDs #2293

Merged

	class FrozenDict(Dict[Any, Any]):
	def __setitem__(self, instance: Any, value: Any) -> None:
	"""Assign a value to a FrozenDict."""
	raise AttributeError("FrozenDict does not support assignment")

	def update(self, args: Any, *kwargs: Any) -> None:
	raise AttributeError("FrozenDict does not support .update()")


	UTF8 = "utf-8"

	EMPTY_DICT = FrozenDict()

	self, file_schema: Schema, case_sensitive: bool, projected_field_values: Optional[Dict[str, Any]] = None
	self, file_schema: Schema, case_sensitive: bool, projected_field_values: Dict[str, Any] = EMPTY_DICT

		assert expression_evaluator(schema, NotStartsWith("a", 1), case_sensitive=True)(struct) is True


		def test_translate_column_names_simple_case(table_schema_simple: Schema) -> None:

Fix projected fields predicate evaluation #2029

Fix projected fields predicate evaluation #2029

Uh oh!

Conversation

Erigara commented May 20, 2025 • edited by kevinjqliu Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Uh oh!

kevinjqliu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kevinjqliu commented Jul 22, 2025

Uh oh!

kevinjqliu commented Jul 25, 2025

Uh oh!

Uh oh!

kevinjqliu commented Jul 25, 2025

Uh oh!

kevinjqliu commented Jul 25, 2025

Uh oh!

kevinjqliu commented Jul 26, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kevinjqliu commented Jul 27, 2025

Uh oh!

Erigara commented Jul 28, 2025

Uh oh!

kevinjqliu commented Jul 28, 2025

Uh oh!

Erigara commented Jul 28, 2025

Uh oh!

gabeiglio commented Jul 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amogh-jahagirdar left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amogh-jahagirdar Jul 30, 2025

Choose a reason for hiding this comment

Uh oh!

Fokko Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

amogh-jahagirdar Jul 30, 2025

Choose a reason for hiding this comment

Uh oh!

kevinjqliu Jul 30, 2025

Choose a reason for hiding this comment

Uh oh!

Fokko Aug 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kevinjqliu Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

amogh-jahagirdar Jul 30, 2025

Choose a reason for hiding this comment

Uh oh!

Fokko left a comment

Choose a reason for hiding this comment

Uh oh!

Fokko Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

kevinjqliu commented Aug 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Erigara commented May 20, 2025 •

edited by kevinjqliu

Loading

gabeiglio commented Jul 28, 2025 •

edited

Loading

amogh-jahagirdar left a comment •

edited

Loading

Fokko Aug 4, 2025 •

edited

Loading

kevinjqliu commented Aug 4, 2025 •

edited

Loading