Skip to content

Count rows as a metadata-only operation #1223

@Visorgood

Description

@Visorgood

Feature Request / Improvement

Hello!

I'm using PyIceberg 0.7.1

I have a use-case where I need to count rows given a certain filter, and I was expecting it to be doable with PyIceberg as a metadata-only operation, given that manifest files contain counts of rows in each data file.

I figured out this code to count rows:

query = "col1 = 'val_X' AND col2 = 'val_Y' AND ..."
scan = table.scan(row_filter=query)
df = scan.to_duckdb("data")
res = df.sql("SELECT count(*) FROM data")

but this is loading the data filtered (using the query expression) into memory first, and then does the calculation of the count.
I couldn't figure out the code that would return the result without converting either to duckdb or to pyarrow dataframe first.

Is there a way to do such operation without loading data into memory - as a metadata-only operation?
If not, I believe this would be a good feature to have in PyIceberg.

I have tried Daft, which is supposed to be a fully lazily optimized query engine interface on top of PyIceberg tables, but it still seems to need to load data into memory, even when I do .limit(1).

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions