Skip to content

Conversation

lilianm
Copy link
Contributor

@lilianm lilianm commented Sep 1, 2025

Which issue does this PR close?

Rationale for this change

Use ArrowRowGroupWriter helper class for write row group when you use API get_column_writers / append_row_group in ArrowWriter implemented in issue

What changes are included in this PR?

Set public ArrowRowGroupWriter and move memory_size, get_estimated_total_bytes and rows_count from ArrowWriter

Are these changes tested?

Yes

Are there any user-facing changes?

Yes add function in ArrowRowGroupWriter and expose it

@github-actions github-actions bot added the parquet Changes to the parquet crate label Sep 1, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this @lilianm -- I think this idea seems reasonable to me

If we want to make this a public API, I think we should add some more documentation -- specifically, can we please add a doc test that shows how a user will use the ArrowRowGroupWriter?

Specifically, I am thinking about something like this https://docs.rs/parquet/latest/parquet/arrow/arrow_writer/struct.ArrowColumnWriter.html#example-encoding-two-arrow-arrays-in-parallel

@alamb alamb marked this pull request as draft September 6, 2025 10:19
@alamb
Copy link
Contributor

alamb commented Sep 6, 2025

Marking as draft as I think this PR is no longer waiting on feedback and I am trying to make it easier to find PRs in need of review. Please mark it as ready for review when it is ready for another look

@adamreeve
Copy link
Contributor

I came across this PR via #8162. We'd initially wanted to expose ArrowRowGroupWriter in #7818 to allow access to column writers that can use the internal FileEncryptor to support finer-grained control and multi-threading when writing encrypted Parquet. But we went with a different approach (#8029 and then #8162) to avoid exposing too many details that might be refactored soon, based on https://github.com/apache/arrow-rs/pull/7818/files#r2201592759. Is that still a concern?

And is exposing ArrowRowGroupWriter still necessary? #8162 exposes ArrowRowGroupWriterFactory instead and adds a method to get a vector of ArrowColumnWriter, so this might not be needed.

@alamb
Copy link
Contributor

alamb commented Sep 18, 2025

And is exposing ArrowRowGroupWriter still necessary? #8162 exposes ArrowRowGroupWriterFactory instead and adds a method to get a vector of ArrowColumnWriter, so this might not be needed.

In my opinion the best thing we can do is to write up some examples showing how to write parquet using multiple cores / threads, which will help guide the API design

This is what I was getting at above with asking for doc examples.

Basically, if we are going to add new APIs, we should also have examples showing how they are used which will have the double benefit of

  1. Make sure the new APIs are sufficient for the new usecase
  2. Make it easier for people to find and use them

@alamb
Copy link
Contributor

alamb commented Sep 18, 2025

After more careful review of #8162, I think it should enable all the necessary APIs for multi-threaded writing of parquet with encryption

@lilianm
Copy link
Contributor Author

lilianm commented Sep 18, 2025

@alamb @adamreeve thanks for return i think ticket #8162 it's better approch for me. I will review it and add feedback on it.

@alamb I agree to improve document about multi core/thread writing.

I everybody it's agree to close this ticket and to concentrate effort on ticket #8162

@lilianm
Copy link
Contributor Author

lilianm commented Sep 18, 2025

Sorry i read to fast #8162
@adamreeve I already needed to expose 'ArrowRowGroupWriter' actual change allow to parallelize column encoding but not row group encoding. This can allow to encoding different row group in different thread or dispatch different kind of data (after partitioning) in different row group for improve push down filters.

I think it's better way to expose 'ArrowRowGroupWriter' and add function into_writers
ArrowRowGroupWriter it's helpers structure for use Vec<ArrowColumnWriter>

And for ticket #8162 expose create_row_group_writer in 'ArrowRowGroupWriterFactory' instead create new function create_column_writers

@alamb
Copy link
Contributor

alamb commented Sep 18, 2025

I think it's better way to expose 'ArrowRowGroupWriter' and add function into_writers
ArrowRowGroupWriter it's helpers structure for use Vec<ArrowColumnWriter>

Sorry I put a response to this on a different PR: #8162 (comment)

Basically, I am not sure that ArrowRowGroupWriter adds a lot of value so I am not sure it needs to be pub. Maybe we can make an documentation example showing how parallel writing to row groups / column chunks would look if we made the ArrowRowGroupWriter pub 🤔.

Given how much effort we go through in arrow to keep the API stable, I am hesitant to add anything more to the API than necessary.

I think we can make ArrowRowGroupWriter as a follow on PR / in a follow on release too. The code in #8162 doesn't prevent us from doing so

@rok
Copy link
Member

rok commented Sep 18, 2025

For completeness, here's a parallelized (over columns and row groups) encrypted parquet writer in data fusion PR. It uses ArrowRowGroupWriterFactory in multiple threads to create column writers etc. Getting correct encryptors to multiple threads took us a couple of iterations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Parquet] Expose ArrowRowGroupWriter
4 participants