Skip to content

Conversation

@alamb
Copy link
Contributor

@alamb alamb commented Nov 18, 2025

Note this PR contains a 1 line fix, and the rest is tests , comments, and reorganization to support the tests

Which issue does this PR close?

Rationale for this change

#8657 is a regression

The check for "is this column nested" did not work correctly for Lists in Parquet, due to the somewhat wacky way Lists are encoded

What changes are included in this PR?

  1. Fix the bug
  2. Move the code into ProjectionMask::without_nested_types, mostly so I could write better tests for it
  3. Write a lot of tests

Are these changes tested?

Yes, both the reproducer from #8657 and a bunch of tests are added

Are there any user-facing changes?

There is a new API

@github-actions github-actions bot added the parquet Changes to the parquet crate label Nov 18, 2025
@alamb alamb force-pushed the alamb/fix_parquet_nesting_issue branch from c5f8afe to 0956202 Compare November 18, 2025 17:58
}

#[tokio::test]
async fn test_nested_lists() -> Result<()> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

} else {
Some(ProjectionMask::leaves(schema, included_leaves))
}
mask.without_nested_types(self.metadata.file_metadata().schema_descr())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the logic into ProjectionMask

if self.leaf_included(leaf_idx) {
let root = schema.get_column_root(leaf_idx);
let root_idx = schema.get_column_root_idx(leaf_idx);
if root_leaf_counts[root_idx] == 1 && !root.is_list() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only code difference here is to add the check for !root.is_list()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense to me!

#[test]
fn test_mask_from_column_names() {
let message_type = "
let schema = parse_schema(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just reduced some duplication

}

#[test]
fn test_projection_mask_without_nested_list() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test fails without the code change

@alamb
Copy link
Contributor Author

alamb commented Nov 18, 2025

FYI @XiangpengHao

Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, nice sleuthing @alamb!


writer.close().await?;

println!("Parquet file written successfully!");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes good call! That was (also) left over from the test 🤦

@alamb
Copy link
Contributor Author

alamb commented Nov 18, 2025

I plan to merge this tomorrow in case @lewiszlw would also like a chance to review

@lewiszlw
Copy link
Member

Thank you for the fix! I added this patch to parquet 56 and verified in my project, everything works fine.

@alamb
Copy link
Contributor Author

alamb commented Nov 19, 2025

Thank you @etseidl @XiangpengHao and @lewiszlw -- sorry this one took so long

@alamb alamb merged commit 389f404 into apache:main Nov 19, 2025
16 checks passed
@alamb alamb deleted the alamb/fix_parquet_nesting_issue branch November 19, 2025 11:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Parquet 56: encounter error: item_reader def levels are None when reading nested field with row filter

4 participants