Skip to content

Conversation

@kevinjqliu
Copy link
Contributor

Rationale for this change

Closes #2028
Heavily inspired by #2029 (thanks @Erigara)

This PR fixes a bug where predicate evaluation for a column that is not in the parquet file schema will return no result. This is due to _ColumnNameTranslator visitor returning AlwaysFalse when the column cannot be found in the file schema.

This PR follows the change in #1644 to evaluate predicates based on a field's initial_default. When the evaluated field does not have initial_default, _ColumnNameTranslator will now return AlwaysTrue to scan the entire parquet file and pass the result along for further evaluation.

Are these changes tested?

Yes, added some unit tests for _ColumnNameTranslator/translate_column_names
Added a test for predicate evaluation for projected columns.

Are there any user-facing changes?

This comment was marked as outdated.

@kevinjqliu kevinjqliu requested a review from sungwy July 26, 2025 04:45
@kevinjqliu
Copy link
Contributor Author

kevinjqliu commented Jul 26, 2025

cc @gabeiglio since you implemented #1443 :) could you also take a look?

@kevinjqliu kevinjqliu requested a review from Copilot July 26, 2025 04:48
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes a bug in predicate evaluation for column projection where predicates on columns not present in the Parquet file schema would incorrectly return no results. The fix changes the behavior to return AlwaysTrue when a field has no initial_default, allowing for proper row-level filtering.

Key changes:

  • Modified _ColumnNameTranslator to return AlwaysTrue instead of AlwaysFalse for missing fields without initial_default
  • Added comprehensive unit tests for the translate_column_names function
  • Added integration tests for partition value projection with predicates

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
pyiceberg/expressions/visitors.py Updated _ColumnNameTranslator to handle missing fields without initial_default by returning AlwaysTrue
tests/expressions/test_visitors.py Added comprehensive unit tests for translate_column_names function covering various scenarios
tests/io/test_pyarrow.py Added integration tests for partition value projection with predicates
Comments suppressed due to low confidence (1)

tests/expressions/test_visitors.py:1684

  • The test creates a field with initial_default="default" but never tests the behavior when this field is missing from the file schema. Consider adding a test case to verify that fields with string defaults are handled correctly.
        NestedField(field_id=1, name="existing_col", field_type=StringType(), required=False, initial_default="default"),

@kevinjqliu kevinjqliu force-pushed the kevinjqliu/translate_column_names branch from 06c4a4d to 50b7b39 Compare July 26, 2025 17:06
@kevinjqliu
Copy link
Contributor Author

this isnt right. we should evaluate the column projection predicates

@kevinjqliu
Copy link
Contributor Author

Closing in favor of #2029

@kevinjqliu kevinjqliu closed this Jul 27, 2025
@kevinjqliu kevinjqliu deleted the kevinjqliu/translate_column_names branch July 27, 2025 16:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Scan with filtering on projected field rerurn empty table

1 participant