Skip to content

Conversation

@CTTY
Copy link
Contributor

@CTTY CTTY commented Jul 23, 2025

Which issue does this PR close?

What changes are included in this PR?

  • Added RollingFileWriter
  • Fix some minor typos in writer mod

Are these changes tested?

added unit tests

Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @CTTY for this pr, just finised first round of review.

if self.should_roll(input_size) {
if let Some(inner) = self.inner.take() {
// close the current writer, roll to a new file
let handle = spawn(async move { inner.close().await });
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an interesting optimization, but I would suggest not do it for now. A writer usually consumes some resources like memory, connections etc. Closing them in an async approach make things difficult to understand in production, for example, we may have a lot of unclosed writers which consumes a lot memory and leading to oom.

Copy link
Contributor Author

@CTTY CTTY Jul 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point!

I have some rough idea to further improve this: we can use a config to control the maximum parallelism here:

struct RollingWriter {
...
  buffer: Vec<DataFileBuilder>
}

...
while close_handles.len() >= self.max_parallelism() {
  // wait until some closers complete, and store the data files in a buffer
  self.buffer.extend(future::select(self.close_handles))
}
self.close_handles.push(new_handle);

I have not thought very clearly on how to prevent buffer from eating up the memory as of now, or do we even need it?

Either way I agree this can be completed as a follow-up

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed the close_handles from this PR and created an issue to track this potential optimization: #1551

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Memory control is a complex topic, and from what I've learned simply add a fix number of in fly doesn't work well when integrated into other systems. I would prefer not to spend too much time on this for now.


impl<B: FileWriterBuilder> FileWriter for RollingFileWriter<B> {
async fn write(&mut self, input: &RecordBatch) -> Result<()> {
let input_size = input.get_array_memory_size();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is incorrect, usually parquet files written size is much smaller than arrow's in memory array size since parquet will do a lot of compression. The target_size is not for exact control, so it's fine to write file a little larger that this size.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, I added some comment to explain this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I mean we should not use this input_size to determine if we should roll, but the writer's current_written_size only.

Copy link
Contributor Author

@CTTY CTTY Jul 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @liurenjie1024 , after testing with the suggested changes, I found an interesting issue:

tldr: the existing ParquetWriter can only get the correct current_written_size when it's closing and flushing data, not when writing data.

This can cause the following case to fail:

let writer: RollingWriter = ...

// should create 1 file
// but won't update current_written_size because we won't close the writer in write()
writer.write(batch1).await? 

// if this write should rollover, but since the inner.current_written_size is not updated
// it will try to write the data to the same file as the previous batch
writer.write(batch2).await? 

A more detailed analysis:

  • ParquetWriter uses ArrowAsyncWriter as its inner writer
  • ArrowAsyncWriter has async_writer (ArrowRowGroupWriter) and sync_writer (TrackWriter in this case)
  • ArrowAsyncWriter's sync_writer will buffer rows based on the config value max_row_group_size (default is 1024 x 1024), causing TrackWriter won't be able to track the data in the buffer until closing

Basically this issue can happen a lot when the max_row_group_size is large and the target_file_size is small.

To fix this, I think we'll need to change the ParquetWriter's implementation of current_file_size() and use AsyncArrowWrite's in_progress_size to take buffered data into account. But again, in_progress_size is the in-memory size, not the physical size

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To fix this, I think we'll need to change the ParquetWriter's implementation of current_file_size() and use AsyncArrowWrite's in_progress_size to take buffered data into account. But again, in_progress_size is the in-memory size, not the physical size

This sounds reasonable to me. According to the doc , in_progress_size + bytes_written seems a better estimation of the current file size. Due to the complex encoding of parquet, it's hard to get accurate file size before finishing one row group, so an estimation is good enough.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've created an issue and will fix the parquet writer behavior in a separate PR

@CTTY
Copy link
Contributor Author

CTTY commented Jul 28, 2025

I've manually tested the rolling writer using some small data with the ParquetWriter::current_written_size fix. Now the generated file size is much closer to the configured target_file_size.

The difference between the configured target_file_size and the actual written file size can vary depending on the batch size. Generally speaking, rolling file writer rolls more precisely when the target file size is much larger compared to the size of each batch, which is expected.

target_file_size batch_rows batch_size_in_memory (Bytes) Written File Size (Bytes)
10 MB 500K 8194584 12968410
1 MB 50K 662424 1077332
300 KB 10K 145816 312345
30 KB 1K 16440 29595
30 KB 500 8360 28800
30 KB 100 1576 28218

@CTTY CTTY force-pushed the ctty/rolling-writer branch from ed6b0eb to ac22f27 Compare July 28, 2025 22:46
@yingjianwu98
Copy link
Contributor

Wondering what's your plan to make RollingFileWriter partition aware.
Looks like the parquet writer right now also is not partition aware?

@CTTY
Copy link
Contributor Author

CTTY commented Jul 29, 2025

Hi @stevie9868 , I hope my reply in a different thread can answer your question

@liurenjie1024
Copy link
Contributor

I've manually tested the rolling writer using some small data with the ParquetWriter::current_written_size fix. Now the generated file size is much closer to the configured target_file_size.

The difference between the configured target_file_size and the actual written file size can vary depending on the batch size. Generally speaking, rolling file writer rolls more precisely when the target file size is much larger compared to the size of each batch, which is expected.

target_file_size batch_rows batch_size_in_memory (Bytes) Written File Size (Bytes)
10 MB 500K 8194584 12968410
1 MB 50K 662424 1077332
300 KB 10K 145816 312345
30 KB 1K 16440 29595
30 KB 500 8360 28800
30 KB 100 1576 28218

Thanks @CTTY for the tests. The difference are acceptable to me.

Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @CTTY for this pr!

@liurenjie1024 liurenjie1024 merged commit 3ab45ee into apache:main Jul 29, 2025
17 checks passed
@CTTY CTTY deleted the ctty/rolling-writer branch July 29, 2025 17:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement RollingFileWriter: Helps split incoming data into multiple files

3 participants