-
Notifications
You must be signed in to change notification settings - Fork 393
Description
Feature Request / Improvement
Hello!
I'm using PyIceberg 0.7.1
I have a use-case where I need to count rows given a certain filter, and I was expecting it to be doable with PyIceberg as a metadata-only operation, given that manifest files contain counts of rows in each data file.
I figured out this code to count rows:
query = "col1 = 'val_X' AND col2 = 'val_Y' AND ..."
scan = table.scan(row_filter=query)
df = scan.to_duckdb("data")
res = df.sql("SELECT count(*) FROM data")
but this is loading the data filtered (using the query expression) into memory first, and then does the calculation of the count.
I couldn't figure out the code that would return the result without converting either to duckdb or to pyarrow dataframe first.
Is there a way to do such operation without loading data into memory - as a metadata-only operation?
If not, I believe this would be a good feature to have in PyIceberg.
I have tried Daft, which is supposed to be a fully lazily optimized query engine interface on top of PyIceberg tables, but it still seems to need to load data into memory, even when I do .limit(1).