Count rows as a metadata-only operation

### Feature Request / Improvement

Hello!

I'm using PyIceberg 0.7.1

I have a use-case where I need to count rows given a certain filter, and I was expecting it to be doable with PyIceberg as a metadata-only operation, given that manifest files contain counts of rows in each data file.

I figured out this code to count rows:
```
query = "col1 = 'val_X' AND col2 = 'val_Y' AND ..."
scan = table.scan(row_filter=query)
df = scan.to_duckdb("data")
res = df.sql("SELECT count(*) FROM data")
```
but this is loading the data filtered (using the `query` expression) into memory first, and then does the calculation of the count.
I couldn't figure out the code that would return the result without converting either to `duckdb` or to `pyarrow` dataframe first.

Is there a way to do such operation without loading data into memory - as a metadata-only operation?
If not, I believe this would be a good feature to have in PyIceberg.

I have tried Daft, which is supposed to be a `fully lazily optimized query engine interface on top of PyIceberg tables`, but it still seems to need to load data into memory, even when I do `.limit(1)`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Count rows as a metadata-only operation #1223

Feature Request / Improvement

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Count rows as a metadata-only operation #1223

Description

Feature Request / Improvement

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions