Skip to content

Conversation

@vustef
Copy link
Contributor

@vustef vustef commented Nov 18, 2025

Which issue does this PR close?

Closes #8864.

Rationale for this change

#8715 introduced row numbers feature last week. However, it had a bug, which luckily for us @scovich pointed out soon after the merge. The issue is that the row numbers are produced in ordinal-based order of the row groups, instead of user-requested order of row groups. The former is wrong, and is being fixed here by switching to user-requested order.

What changes are included in this PR?

Just fixing the bug as explained above, and adding test. Also addressing two small comments from post-merge review: #8715 (review)

Are these changes tested?

Yes.

Are there any user-facing changes?

No, this wasn't released yet.

@github-actions github-actions bot added the parquet Changes to the parquet crate label Nov 18, 2025
@vustef
Copy link
Contributor Author

vustef commented Nov 18, 2025

@scovich @alamb this is a bug fix on top of previous PR. I hope we can review it in time for 57.1.0 release candidate that was supposed to be carved out yesterday.

/// # Ok(())
/// # }
/// ```
pub fn with_virtual_columns(self, virtual_columns: Vec<FieldRef>) -> Self {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also made this comment in the original PR, but maybe we could change this to

pub fn set_virtual_columns(&mut self, virtual_columns: Vec<FieldRef>) -> Result<()> {

to avoid the panic. If we squeak this in before 57.1.0 ships it wouldn't be an API change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Attaching a link for easier discussion: #8715 (comment)

Not having an ability to chain seems a bit degrading, though users can just use ? operator for that (in which case it just panics, but that's fine, it's a choice).
Having a proper builder would be nice, but it's then definitely an API change, and a bit less quick fix for the main thing that this PR is intended to solve.

I'm fine with changing to Result(). Would you prefer that? Are others on board?

Copy link
Contributor

@scovich scovich Nov 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is Result<Self> bad, sorry? Call it try_with_xxx if naming conventions are the problem?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I was upthumbing

Not having an ability to chain seems a bit degrading, though users can just use ? operator for that

Meaning I'd be on board with Result<Self>.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with changing to Result(). Would you prefer that? Are others on board?

I agree that sounds better to me

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

)
})?;

let offset = ordinal_to_offset.get(&ordinal).ok_or_else(|| {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tiny nit, to simplify L68 below?

Suggested change
let offset = ordinal_to_offset.get(&ordinal).ok_or_else(|| {
let offset = *ordinal_to_offset.get(&ordinal).ok_or_else(|| {

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @vustef -- I think this is a nice fix. I also wrote another "end to end" style test (passing in row numbers to the reader options) which I will push directly to this branch

Thank you @scovich for finding this bug before release 🙏

Given the suggestions from @etseidl and @scovich about avoiding panics, which seems reasonable, I will try and make a follow on PR to do that, as I think this one stands alone as is (no reason to also include the API change there too)

Update: PR is

@alamb alamb changed the title Fix RowNumberReader Fix RowNumberReader when not all row groups are selected Nov 18, 2025
@alamb alamb merged commit c45d238 into apache:main Nov 18, 2025
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

RowNumber reader has wrong row group ordering

4 participants