Add ResidualVisitor to compute residuals #1388

tusharchou · 2024-11-28T18:18:57Z

closes issue: Count rows as a metadata-only operation #1223

…y-op add count in data scan and test in catalog sql

pyiceberg/table/__init__.py

jayceslesar · 2024-12-02T20:06:04Z

Question: Does it make sense to expose this as the __len__ dunder method because python? It would just return the self.count()

Residual Evaluator with test

* added residual evaluator in plan files * tested counts with positional deletes * merged main

pyiceberg/table/__init__.py

* added residual evaluator in plan files * tested counts with positional deletes * merged main * implemented batch reader in count * breaking integration test * fixed integration test * git pull main * revert * revert * revert test_partitioning_key.py * revert test_parser.py * added residual evaluator in visitor * deleted residual_evaluator.py * removed test count from test_sql.py * ignored lint type * fixed lint * working on plan_files * type ignored * make lint

tusharchou · 2025-01-06T10:20:42Z

Hi @Fokko @kevinjqliu @gli-chris-hao ,

I have implemented these suggestions with my best understanding.

residual evaluator
positional deletes
batch processing of files larger than 512mb

It would be helpful to get fresh review

pyiceberg/expressions/visitors.py

pyiceberg/table/__init__.py

* added residual evaluator in plan files * tested counts with positional deletes * merged main * implemented batch reader in count * breaking integration test * fixed integration test * git pull main * revert * revert * revert test_partitioning_key.py * revert test_parser.py * added residual evaluator in visitor * deleted residual_evaluator.py * removed test count from test_sql.py * ignored lint type * fixed lint * working on plan_files * type ignored * make lint * explicit delete files len is zero * residual eval only if manifest is true * default residual is always true * used projection schema * refactored residual in plan files * fixed lint issue with isnan * simplified count if else conditions * implemented refactoring comments on residual visitor

Sometime I'm seeing this: ``` ImportError while loading conftest '/home/runner/work/iceberg-python/iceberg-python/tests/conftest.py'. tests/conftest.py:52: in <module> from pyiceberg.catalog import Catalog, load_catalog pyiceberg/catalog/__init__.py:51: in <module> from pyiceberg.serializers import ToOutputFile pyiceberg/serializers.py:25: in <module> from pyiceberg.table.metadata import TableMetadata, TableMetadataUtil pyiceberg/table/__init__.py:65: in <module> from pyiceberg.io.pyarrow import ArrowScan, schema_to_pyarrow pyiceberg/io/pyarrow.py:141: in <module> from pyiceberg.table.locations import load_location_provider pyiceberg/table/locations.py:25: in <module> from pyiceberg.table import TableProperties E ImportError: cannot import name 'TableProperties' from partially initialized module 'pyiceberg.table' (most likely due to a circular import) (/home/runner/work/iceberg-python/iceberg-python/pyiceberg/table/__init__.py) ``` Also observed in: apache#1388 I prefer the imports at the top, but I think this is a small price to pay to avoid having circular imports.

Sometime I'm seeing this: ``` ImportError while loading conftest '/home/runner/work/iceberg-python/iceberg-python/tests/conftest.py'. tests/conftest.py:52: in <module> from pyiceberg.catalog import Catalog, load_catalog pyiceberg/catalog/__init__.py:51: in <module> from pyiceberg.serializers import ToOutputFile pyiceberg/serializers.py:25: in <module> from pyiceberg.table.metadata import TableMetadata, TableMetadataUtil pyiceberg/table/__init__.py:65: in <module> from pyiceberg.io.pyarrow import ArrowScan, schema_to_pyarrow pyiceberg/io/pyarrow.py:141: in <module> from pyiceberg.table.locations import load_location_provider pyiceberg/table/locations.py:25: in <module> from pyiceberg.table import TableProperties E ImportError: cannot import name 'TableProperties' from partially initialized module 'pyiceberg.table' (most likely due to a circular import) (/home/runner/work/iceberg-python/iceberg-python/pyiceberg/table/__init__.py) ``` Also observed in: #1388 I prefer the imports at the top, but I think this is a small price to pay to avoid having circular imports.

tests/expressions/test_residual_evaluator.py

Fokko

def test_fokko(session_catalog):
    import pyarrow.parquet as pq

    df = pq.read_table("/Users/fokko.driesprong/Downloads/yellow_tripdata_2024-01.parquet")

    session_catalog.drop_table("default.taxis")
    tbl = session_catalog.create_table("default.taxis", schema=df.schema)
    with tbl.update_spec() as spec:
        spec.add_field("tpep_pickup_datetime", DayTransform())
    tbl.append(df)

    # tbl = session_catalog.load_table("default.taxis")

    pred = LessThan("tpep_pickup_datetime", datetime.datetime(2024, 1, 3, 12, 0, 0))

    cnt = tbl.scan(row_filter=pred).count()
    assert cnt == len(df.filter(expression_to_pyarrow(pred.bind(tbl.schema()))))

Works very well:

pyiceberg/table/__init__.py

Fokko · 2025-02-11T12:08:23Z

Thanks @tusharchou for working on this 🚀

Seems not being used. Less is more! Noticed this while reviewing #1388

mrutunjay-kinagi · 2025-02-11T19:17:21Z

🚀

tusharchou added 7 commits November 28, 2024 19:02

Create test_scan_count.py

731542e

moved test_scan_count.py to tests

c6c971e

implemented count in data scan

da18837

tested table scan count in test_sql catalog

3104a2f

refactoring

c2740ea

make lint

90bca84

Merge pull request #1 from tusharchou/gh-1223-count-rows-metadata-onl…

f7202b9

…y-op add count in data scan and test in catalog sql

kevinjqliu reviewed Nov 28, 2024

View reviewed changes

pyiceberg/table/__init__.py Show resolved Hide resolved

Merge branch 'apache:main' into main

c7205b3

tusharchou mentioned this pull request Dec 3, 2024

How do I find if there is residual in the table scan/plan files? #785

Closed

tusharchou added 10 commits December 11, 2024 11:39

Merge branch 'apache:main' into main

09f9c10

Merge branch 'apache:main' into main

1e9da22

Merge branch 'apache:main' into main

3ab20d4

implemeted residual_evaluator.py with tests

091c0af

added license

3cd797d

fixed lint

6b0924e

fixed lint errors

96cb4e9

Merge pull request #3 from tusharchou/gh-1223-metadata-only-row-count

212c83b

Residual Evaluator with test

Merge branch 'apache:main' into main

8bc65fa

Gh 1223 metadata only row count (#4)

8bb039f

* added residual evaluator in plan files * tested counts with positional deletes * merged main

gli-chris-hao reviewed Dec 31, 2024

View reviewed changes

pyiceberg/table/__init__.py Show resolved Hide resolved

tusharchou added 3 commits January 4, 2025 11:32

Merge branch 'apache:main' into main

0019f92

Merge branch 'apache:main' into main

a372a93

tusharchou requested review from Fokko, gli-chris-hao and kevinjqliu January 6, 2025 10:20

Fokko changed the title ~~Count rows as a metadata only operation~~ Add ResidualVisitor to compute residuals Jan 13, 2025