-
Notifications
You must be signed in to change notification settings - Fork 1k
[Parquet] expose ArrowRowGroupWriter #8260
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this @lilianm -- I think this idea seems reasonable to me
If we want to make this a public API, I think we should add some more documentation -- specifically, can we please add a doc test that shows how a user will use the ArrowRowGroupWriter
?
Specifically, I am thinking about something like this https://docs.rs/parquet/latest/parquet/arrow/arrow_writer/struct.ArrowColumnWriter.html#example-encoding-two-arrow-arrays-in-parallel
Marking as draft as I think this PR is no longer waiting on feedback and I am trying to make it easier to find PRs in need of review. Please mark it as ready for review when it is ready for another look |
I came across this PR via #8162. We'd initially wanted to expose And is exposing |
In my opinion the best thing we can do is to write up some examples showing how to write parquet using multiple cores / threads, which will help guide the API design This is what I was getting at above with asking for doc examples. Basically, if we are going to add new APIs, we should also have examples showing how they are used which will have the double benefit of
|
After more careful review of #8162, I think it should enable all the necessary APIs for multi-threaded writing of parquet with encryption |
@alamb @adamreeve thanks for return i think ticket #8162 it's better approch for me. I will review it and add feedback on it. @alamb I agree to improve document about multi core/thread writing. I everybody it's agree to close this ticket and to concentrate effort on ticket #8162 |
Sorry i read to fast #8162 I think it's better way to expose 'ArrowRowGroupWriter' and add function And for ticket #8162 expose |
Sorry I put a response to this on a different PR: #8162 (comment) Basically, I am not sure that Given how much effort we go through in arrow to keep the API stable, I am hesitant to add anything more to the API than necessary. I think we can make |
For completeness, here's a parallelized (over columns and row groups) encrypted parquet writer in data fusion PR. It uses |
Which issue does this PR close?
Rationale for this change
Use ArrowRowGroupWriter helper class for write row group when you use API get_column_writers / append_row_group in ArrowWriter implemented in issue
What changes are included in this PR?
Set public ArrowRowGroupWriter and move memory_size, get_estimated_total_bytes and rows_count from ArrowWriter
Are these changes tested?
Yes
Are there any user-facing changes?
Yes add function in ArrowRowGroupWriter and expose it