Skip to content

Conversation

@nuno-faria
Copy link
Contributor

Which issue does this PR close?

Rationale for this change

When the max_predicate_cache_size is set to 0 there is no need to select multiple data pages until batch_size is reached.

What changes are included in this PR?

  • Make ReaderFactory::compute_cache_projection return None if the cache is disabled, which will end up not retrieving multiple pages unnecessarily.
  • Added a unit test to confirm the new behavior.

Are these changes tested?

Yes.

Are there any user-facing changes?

No.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @nuno-faria -- this looks very great to me

.unwrap();
let parquet_schema = metadata.file_metadata().schema_descr_ptr();

// the filter is not clone-able, so we use a lambda to simplify
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, this is something that makes the filters very tricky to handle internally. Nothing to change for this PR, I am just observing

@alamb
Copy link
Contributor

alamb commented Oct 6, 2025

FYI @XiangpengHao

Copy link
Contributor

@XiangpengHao XiangpengHao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thank you @nuno-faria

@alamb alamb merged commit 84a7e35 into apache:main Oct 7, 2025
16 checks passed
@alamb
Copy link
Contributor

alamb commented Oct 7, 2025

Thanks again @nuno-faria and @XiangpengHao

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Parquet] Avoid fetching multiple pages when max_predicate_cache_sizeis 0

3 participants