refactor(writer): Refactor writers for the future partitioning writers #1657

CTTY · 2025-09-07T01:58:36Z

Which issue does this PR close?

Closes Decouple ParquetWriter and LocationGenerator #1650

What changes are included in this PR?

Refactored the writer layers; from a bird’s-eye view, the structure now looks like this:

flowchart TD
    subgraph PartitioningWriter
        PW[PartitioningWriter]

        subgraph DataFileWriter
            RW[DataFileWriter]

            subgraph RollingWriter
                DFW[RollingWriter]

                subgraph FileWriter
                    FW[FileWriter]
                end

                DFW --> FW
            end

            RW --> DFW
        end

        PW --> RW
    end

Key Changes

Modified RollingFileWriter to handle location generator, file name generator, and partition keys directly
Simplified ParquetWriterBuilder interface to accept output files during build
Restructured DataFileWriterBuilder to use RollingFileWriter with partition keys
Updated DataFusion integration to work with the new writer architecture
NOTE: Technically DataFusion or any engine should use TaskWriter -> PartitioningWriter -> RollingWriter -> ..., but TaskWriter and PartitioningWriter are not included in this draft so far

Are these changes tested?

Not yet, but changing the existing tests accordingly should be enough

crates/iceberg/src/writer/file_writer/rolling_writer.rs

crates/iceberg/src/writer/mod.rs

liurenjie1024 · 2025-09-16T09:58:40Z

Hi, @CTTY Seems this is not updated following our discussion?

CTTY · 2025-09-17T00:08:48Z

Hi @liurenjie1024 , do you mean that we should also include TaskWriter and have TaskWriter to split batches by partition? This draft mainly focuses on refactoring the existing layers and have RollingWriter to become the top-level writer as of now, and I haven't incoporated this with an actual partitioning writer or task writer yet. Or do you think it's better to have everything in one draft?

crates/iceberg/src/writer/mod.rs

liurenjie1024 · 2025-09-18T09:22:44Z

Hi @liurenjie1024 , do you mean that we should also include TaskWriter and have TaskWriter to split batches by partition? This draft mainly focuses on refactoring the existing layers and have RollingWriter to become the top-level writer as of now, and I haven't incoporated this with an actual partitioning writer or task writer yet. Or do you think it's better to have everything in one draft?

Hi, @CTTY I'm not saying we should include TaskWriter. Per our discussion, we should have following dependency:

PartitionedWriter
           |
 DataFileWriter(EqDeleateWriter, PositionDeleteWriter) -> This layer is IcebergWriter
          |
RollingFileWriter 
         |
FileWriter(Parquet, ORC)    -> This layer  is file format writer

CTTY · 2025-09-21T03:35:20Z

crates/iceberg/src/writer/partitioning/clustered.rs

+// ///
+// /// Once a partition has been written to and closed, any further attempts
+// /// to write to that partition will result in an error.
+// pub struct ClusteredWriter<B: IcebergWriterBuilder, I: Default + Send = DefaultInput, O: Default + Send = DefaultOutput>


Please ignore this for now, I think it's better to keep this draft/round of changes focused on the interfaces changes with existing writer

crates/iceberg/src/writer/file_writer/rolling_writer.rs

liurenjie1024

Thanks @CTTY for this pr. I think we are on the right track.

crates/iceberg/src/writer/base_writer/data_file_writer.rs

crates/iceberg/src/writer/file_writer/rolling_writer.rs

crates/iceberg/src/writer/partitioning/mod.rs

liurenjie1024

Thanks @CTTY for this pr! Generally LGTM, left some comments.

crates/iceberg/src/writer/base_writer/data_file_writer.rs

crates/iceberg/src/writer/base_writer/equality_delete_writer.rs

liurenjie1024 · 2025-09-26T08:56:26Z

crates/iceberg/src/writer/file_writer/rolling_writer.rs

+    file_name_generator: F,
+}
+
+impl<B: FileWriterBuilder, L: LocationGenerator, F: FileNameGenerator> Clone


Why we need to implement clone? This is a stateful struct since it contains generated data files, the semantics of clone is unclear.

This is because IcebergWriterBuilder trait requires Clone, so DataFileWriterBuilder and the inner RollingFileWriter have to derive Clone. The Clone here is not cloning the writer with its states like data files, but rather a helper to make building new data file writers easier.

I'm thinking maybe it's better to bring back the RollingFileWriterBuilder, so we can have something like:

pub struct DataFileWriterBuilder<B: FileWriterBuilder, L: LocationGenerator, F: FileNameGenerator> { // RollingFileWriter won't need to implement Clone because it's Option inner: Option<RollingFileWriter>, // builder derives Clone and only contains stateless objs like target_file_size, file_io, location_gen, etc. inner_builder: RollingFileWriterBuilder<B, L, F>, partition_key: Option<PartitionKey>, }

wdyt?

Or another option is to remove Clone in IcebergWriterBuilder? I don't see why we need to clone IcebergWriterBuilder.

Having Clone can be useful in the upcoming PartitioningWriter level when we need to spawn new writers with the same configuration. for example, if we have a fanout partitioning writer, and we will need to create a new writer whenever there is a new partition coming in.

With Clone, it would be simple:

let new_writer = self.iceberg_writer_builder.clone().build()?; // iceberg_writer can be generic type

Without Clone, we will need to re-populate the IcebergWriterBuilder and the inner writers all over again:

let parquet_writer = ...; let new_rolling_writer = RollingFileWriter::new(parquet_writer, /* pass down objs like file_io, target_file_size... */); let iceberg_writer_builder = DataFileWriterBuilder::new(...); // this has to be concrete type let new_writer = iceberg_writer_builder.build()?;

This is incorrect, clone means creating a new one with exactly same data. But you are creating a new IcebergWriterBuilder with following changes:

Partition key change

Rolling file writer changed

I don't think it's reasonable to put it in Clone

Yes I agree, the point is we still have to be able to clone all the writers via builders and I think bringing the RollingFileWriterBuilder back and removing this Clone implementation is the correct way to go. I can make the change tomorrow and we can review that

I have updated the code to bring back the RollingFileWriterBuilder and removed the Clone impl, it should make much more sense now :D

crates/iceberg/src/writer/base_writer/data_file_writer.rs

crates/iceberg/src/writer/base_writer/equality_delete_writer.rs

liurenjie1024

Thanks @CTTY for this pr, LGTM!

ZENOTME reviewed Sep 7, 2025

View reviewed changes

crates/iceberg/src/writer/file_writer/rolling_writer.rs Outdated Show resolved Hide resolved

CTTY force-pushed the ctty/idk-partition branch from ad66fa5 to ac264fc Compare September 9, 2025 18:23

CTTY mentioned this pull request Sep 9, 2025

Decouple ParquetWriter and LocationGenerator #1650

Closed

CTTY commented Sep 9, 2025

View reviewed changes

crates/iceberg/src/writer/mod.rs Show resolved Hide resolved

liurenjie1024 reviewed Sep 18, 2025

View reviewed changes

crates/iceberg/src/writer/mod.rs Outdated Show resolved Hide resolved

crates/iceberg/src/writer/mod.rs Outdated Show resolved Hide resolved

CTTY force-pushed the ctty/idk-partition branch from ac264fc to 2ac588f Compare September 21, 2025 03:26

CTTY commented Sep 21, 2025

View reviewed changes

crates/iceberg/src/writer/file_writer/rolling_writer.rs Outdated Show resolved Hide resolved

CTTY requested a review from liurenjie1024 September 21, 2025 03:37

liurenjie1024 reviewed Sep 22, 2025

View reviewed changes

crates/iceberg/src/writer/base_writer/data_file_writer.rs Outdated Show resolved Hide resolved

crates/iceberg/src/writer/file_writer/rolling_writer.rs Outdated Show resolved Hide resolved

crates/iceberg/src/writer/partitioning/mod.rs Outdated Show resolved Hide resolved

CTTY added 5 commits September 24, 2025 16:06

partitionhead

ac01a34

little clean up and add partitioning writer traits

0349d64

some cleanup

91242bd

fix compile issues

4d6e48c

fix test compilation, rebase

4532f1e

CTTY force-pushed the ctty/idk-partition branch from d887733 to 4532f1e Compare September 24, 2025 23:56

CTTY added 2 commits September 24, 2025 17:36

CI

917c259

Merge branch 'main' into ctty/idk-partition

cdf0ea4

CTTY marked this pull request as ready for review September 25, 2025 15:38

CTTY added 2 commits September 25, 2025 08:56

fix doc

ebe92b5

ci flaky?

9cef787

liurenjie1024 reviewed Sep 26, 2025

View reviewed changes

CTTY added 3 commits September 26, 2025 16:12

handSome(writer)

ef0aafb

give default values to datafile partition and spec_id

80c34b5

daily clippy fix

02ed7f6

liurenjie1024 reviewed Sep 29, 2025

View reviewed changes

crates/iceberg/src/writer/base_writer/data_file_writer.rs Outdated Show resolved Hide resolved

crates/iceberg/src/writer/base_writer/equality_delete_writer.rs Outdated Show resolved Hide resolved

Merge branch 'main' into ctty/idk-partition

b37d37a

expose datafile builder error

b87354e

mnpw mentioned this pull request Sep 30, 2025

Implement fanout partitioned data writer. #1572

Closed

Add rolling writer builder back, remove clone for rolling writer

99ea41b

liurenjie1024 approved these changes Oct 9, 2025

View reviewed changes

liurenjie1024 merged commit 42191e9 into apache:main Oct 9, 2025
16 checks passed

CTTY deleted the ctty/idk-partition branch October 10, 2025 16:37

refactor(writer): Refactor writers for the future partitioning writers #1657

refactor(writer): Refactor writers for the future partitioning writers #1657

Uh oh!

Conversation

CTTY commented Sep 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

What changes are included in this PR?

Key Changes

Are these changes tested?

Uh oh!

Uh oh!

Uh oh!

liurenjie1024 commented Sep 16, 2025

Uh oh!

CTTY commented Sep 17, 2025

Uh oh!

Uh oh!

Uh oh!

liurenjie1024 commented Sep 18, 2025

Uh oh!

CTTY Sep 21, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

liurenjie1024 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

liurenjie1024 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

liurenjie1024 Sep 26, 2025

Choose a reason for hiding this comment

Uh oh!

CTTY Sep 26, 2025

Choose a reason for hiding this comment

Uh oh!

liurenjie1024 Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

CTTY Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

liurenjie1024 Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

CTTY Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

CTTY Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

liurenjie1024 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

CTTY commented Sep 7, 2025 •

edited

Loading

CTTY Sep 30, 2025 •

edited

Loading