Skip to content

Conversation

@alamb
Copy link
Contributor

@alamb alamb commented Oct 31, 2025

Which issue does this PR close?

Rationale for this change

I am reworking how the parquet decoder's state machine works in #8159

One of the unit tests, test_cache_projection_excludes_nested_columns uses non-public APIs that I am changing

Rather than rewrite them into other non public APIs I think it would be better if this test is in terms of public APIs

What changes are included in this PR?

  1. refactor test_cache_projection_excludes_nested_columns to use high level APIs

Are these changes tested?

They are run in CI

I also verified this test covers the intended functionality by commenting it out:

--- a/parquet/src/arrow/async_reader/mod.rs
+++ b/parquet/src/arrow/async_reader/mod.rs
@@ -724,7 +724,9 @@ where
             cache_projection.union(predicate.projection());
         }
         cache_projection.intersect(projection);
-        self.exclude_nested_columns_from_cache(&cache_projection)
+        // TEMP don't exclude nested columns
+        //self.exclude_nested_columns_from_cache(&cache_projection)
+        Some(cache_projection)
     }

     /// Exclude leaves belonging to roots that span multiple parquet leaves (i.e. nested columns)

And then running the test:

cargo test --all-features --test arrow_reader

And the test fails (as expected)

---- predicate_cache::test_cache_projection_excludes_nested_columns stdout ----

thread 'predicate_cache::test_cache_projection_excludes_nested_columns' panicked at parquet/tests/arrow_reader/predicate_cache.rs:244:9:
assertion `left == right` failed: Expected 0 records read from cache, but got 100
  left: 100
 right: 0
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace


failures:
    predicate_cache::test_cache_projection_excludes_nested_columns

test result: FAILED. 88 passed; 1 failed; 1 ignored; 0 measured; 0 filtered out; finished in 0.20s

Are there any user-facing changes?

No, this is only test changes

@github-actions github-actions bot added the parquet Changes to the parquet crate label Oct 31, 2025
@alamb alamb marked this pull request as ready for review October 31, 2025 12:40
@alamb
Copy link
Contributor Author

alamb commented Oct 31, 2025

@XiangpengHao can I trouble you for a review of this PR (as you wrote the original version of the test)?

Copy link
Contributor

@XiangpengHao XiangpengHao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! This is much more readable than before 💯

@alamb alamb merged commit 06f7f07 into apache:main Oct 31, 2025
32 checks passed
@alamb
Copy link
Contributor Author

alamb commented Oct 31, 2025

Tahnk you @XiangpengHao

@alamb alamb deleted the alamb/rewrite_tests branch October 31, 2025 21:29
alamb added a commit that referenced this pull request Nov 10, 2025
# Which issue does this PR close?

- Part of #7983
- Part of #8000
- closes #8677

I am also working on a blog post about this
- #8035

# TODOs
- [x] Rewrite `test_cache_projection_excludes_nested_columns` in terms
of higher level APIs (#8754)
- [x] Benchmarks
- [x] Benchmarks with DataFusion:
apache/datafusion#18385

# Rationale for this change

A new ParquetPushDecoder was implemented here
- #7997

I need to refactor the async and sync readers to use the new push
decoder in order to:

1. avoid the [xkcd standards effect](https://xkcd.com/927/) (aka there
are now three control loops)
3. Prove that the push decoder works (by passing all the tests of the
other two)
4. Set the stage for improving filter pushdown more with a single
control loop

<img width="400" alt="image"
src="https://github.com/user-attachments/assets/e6886ee9-58b3-4a1e-8e88-9d2d03132b19"
/>


# What changes are included in this PR?

1. Refactor the `ParquetRecordBatchStream` to use `ParquetPushDecoder`

# Are these changes tested?
Yes, by the existing CI tests

I also ran several benchmarks, both in arrow-rs and in DataFusion and I
do not see any substantial performance difference (as expected):
- apache/datafusion#18385

# Are there any user-facing changes?

No

---------

Co-authored-by: Vukasin Stefanovic <[email protected]>
alamb added a commit that referenced this pull request Nov 10, 2025
# Which issue does this PR close?

- Follow on to #8754


# Rationale for this change


While working on #8754 I found
the current formulation a bit akward so let's fix that


# What changes are included in this PR?

Move the builder configuration to a trait so the tests read more
fluently.

This is totally unecessary, it just makes me feel better about the tests

# Are these changes tested?

Yes by CI

# Are there any user-facing changes?
No this is test only
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants