Skip to content

Conversation

@CTTY
Copy link
Contributor

@CTTY CTTY commented Sep 7, 2025

Which issue does this PR close?

What changes are included in this PR?

Refactored the writer layers; from a bird’s-eye view, the structure now looks like this:

flowchart TD
    subgraph PartitioningWriter
        PW[PartitioningWriter]

        subgraph DataFileWriter
            RW[DataFileWriter]

            subgraph RollingWriter
                DFW[RollingWriter]

                subgraph FileWriter
                    FW[FileWriter]
                end

                DFW --> FW
            end

            RW --> DFW
        end

        PW --> RW
    end


Loading

Key Changes

  • Modified RollingFileWriter to handle location generator, file name generator, and partition keys directly
  • Simplified ParquetWriterBuilder interface to accept output files during build
  • Restructured DataFileWriterBuilder to use RollingFileWriter with partition keys
  • Updated DataFusion integration to work with the new writer architecture
  • NOTE: Technically DataFusion or any engine should use TaskWriter -> PartitioningWriter -> RollingWriter -> ..., but TaskWriter and PartitioningWriter are not included in this draft so far

Are these changes tested?

Not yet, but changing the existing tests accordingly should be enough

@liurenjie1024
Copy link
Contributor

Hi, @CTTY Seems this is not updated following our discussion?

@CTTY
Copy link
Contributor Author

CTTY commented Sep 17, 2025

Hi @liurenjie1024 , do you mean that we should also include TaskWriter and have TaskWriter to split batches by partition? This draft mainly focuses on refactoring the existing layers and have RollingWriter to become the top-level writer as of now, and I haven't incoporated this with an actual partitioning writer or task writer yet. Or do you think it's better to have everything in one draft?

@liurenjie1024
Copy link
Contributor

Hi @liurenjie1024 , do you mean that we should also include TaskWriter and have TaskWriter to split batches by partition? This draft mainly focuses on refactoring the existing layers and have RollingWriter to become the top-level writer as of now, and I haven't incoporated this with an actual partitioning writer or task writer yet. Or do you think it's better to have everything in one draft?

Hi, @CTTY I'm not saying we should include TaskWriter. Per our discussion, we should have following dependency:

PartitionedWriter
           |
 DataFileWriter(EqDeleateWriter, PositionDeleteWriter) -> This layer is IcebergWriter
          |
RollingFileWriter 
         |
FileWriter(Parquet, ORC)    -> This layer  is file format writer

@CTTY CTTY force-pushed the ctty/idk-partition branch from ac264fc to 2ac588f Compare September 21, 2025 03:26
// ///
// /// Once a partition has been written to and closed, any further attempts
// /// to write to that partition will result in an error.
// pub struct ClusteredWriter<B: IcebergWriterBuilder, I: Default + Send = DefaultInput, O: Default + Send = DefaultOutput>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please ignore this for now, I think it's better to keep this draft/round of changes focused on the interfaces changes with existing writer

@CTTY CTTY requested a review from liurenjie1024 September 21, 2025 03:37
Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @CTTY for this pr. I think we are on the right track.

@CTTY CTTY force-pushed the ctty/idk-partition branch from d887733 to 4532f1e Compare September 24, 2025 23:56
@CTTY CTTY marked this pull request as ready for review September 25, 2025 15:38
Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @CTTY for this pr! Generally LGTM, left some comments.

file_name_generator: F,
}

impl<B: FileWriterBuilder, L: LocationGenerator, F: FileNameGenerator> Clone
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we need to implement clone? This is a stateful struct since it contains generated data files, the semantics of clone is unclear.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because IcebergWriterBuilder trait requires Clone, so DataFileWriterBuilder and the inner RollingFileWriter have to derive Clone. The Clone here is not cloning the writer with its states like data files, but rather a helper to make building new data file writers easier.

I'm thinking maybe it's better to bring back the RollingFileWriterBuilder, so we can have something like:

pub struct DataFileWriterBuilder<B: FileWriterBuilder, L: LocationGenerator, F: FileNameGenerator> {
    // RollingFileWriter won't need to implement Clone because it's Option
    inner: Option<RollingFileWriter>,
    // builder derives Clone and only contains stateless objs like target_file_size, file_io, location_gen, etc.
    inner_builder: RollingFileWriterBuilder<B, L, F>, 
    partition_key: Option<PartitionKey>,
}

wdyt?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or another option is to remove Clone in IcebergWriterBuilder? I don't see why we need to clone IcebergWriterBuilder.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having Clone can be useful in the upcoming PartitioningWriter level when we need to spawn new writers with the same configuration. for example, if we have a fanout partitioning writer, and we will need to create a new writer whenever there is a new partition coming in.

With Clone, it would be simple:

let new_writer = self.iceberg_writer_builder.clone().build()?; // iceberg_writer can be generic type

Without Clone, we will need to re-populate the IcebergWriterBuilder and the inner writers all over again:

let parquet_writer = ...;
let new_rolling_writer = RollingFileWriter::new(parquet_writer, /* pass down objs like file_io, target_file_size... */);
let iceberg_writer_builder = DataFileWriterBuilder::new(...); // this has to be concrete type
let new_writer = iceberg_writer_builder.build()?;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is incorrect, clone means creating a new one with exactly same data. But you are creating a new IcebergWriterBuilder with following changes:

  1. Partition key change
  2. Rolling file writer changed

I don't think it's reasonable to put it in Clone

Copy link
Contributor Author

@CTTY CTTY Sep 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I agree, the point is we still have to be able to clone all the writers via builders and I think bringing the RollingFileWriterBuilder back and removing this Clone implementation is the correct way to go. I can make the change tomorrow and we can review that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated the code to bring back the RollingFileWriterBuilder and removed the Clone impl, it should make much more sense now :D

Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @CTTY for this pr, LGTM!

@liurenjie1024 liurenjie1024 merged commit 42191e9 into apache:main Oct 9, 2025
16 checks passed
@CTTY CTTY deleted the ctty/idk-partition branch October 10, 2025 16:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Decouple ParquetWriter and LocationGenerator

3 participants