-
Notifications
You must be signed in to change notification settings - Fork 1k
Support multi-threaded writing of Parquet files with modular encryption #7818
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
c2ac2d9
68275fa
2d42ca0
b57a5d4
b9396cc
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -457,7 +457,7 @@ pub struct WriterPropertiesBuilder { | |
|
||
impl WriterPropertiesBuilder { | ||
/// Returns default state of the builder. | ||
fn with_defaults() -> Self { | ||
pub fn with_defaults() -> Self { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We don't need to make this There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Or |
||
Self { | ||
data_page_size_limit: DEFAULT_PAGE_SIZE, | ||
data_page_row_count_limit: DEFAULT_DATA_PAGE_ROW_COUNT_LIMIT, | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason we can't use the existing low level APIs documented here: https://docs.rs/parquet/latest/parquet/arrow/arrow_writer/struct.ArrowColumnWriter.html#example-encoding-two-arrow-arrays-in-parallel
In other words, do we really need to make what has previously been an implementation detail part of the public API?
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the quick response @alamb ! :)
Looking at the example - we have to pass a single private
FileEncryptor
object (it has random state) to allArrowColumnWriter
s,SerializedRowGroupWriter
s andSerializedFileWriter
. Perhaps we could hide everything intoArrowWriter
. I'll have to do a quick check. If you have a cleaner idea we'd be most interested!Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do think avoiding exposing ArrowRowGroupWriter and the associated machinery that would be good
The rationale is that @XiangpengHao and @zhuqi-lucas and myself are likely to be reworking it in the next few releases and once it is part of the public API changing it becomes harder as it requires more coordination / backwards compatibility concerns