-
Notifications
You must be signed in to change notification settings - Fork 344
refactor(writer): Refactor writers for the future partitioning writers #1657
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
ad66fa5 to
ac264fc
Compare
|
Hi, @CTTY Seems this is not updated following our discussion? |
|
Hi @liurenjie1024 , do you mean that we should also include |
Hi, @CTTY I'm not saying we should include |
ac264fc to
2ac588f
Compare
| // /// | ||
| // /// Once a partition has been written to and closed, any further attempts | ||
| // /// to write to that partition will result in an error. | ||
| // pub struct ClusteredWriter<B: IcebergWriterBuilder, I: Default + Send = DefaultInput, O: Default + Send = DefaultOutput> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please ignore this for now, I think it's better to keep this draft/round of changes focused on the interfaces changes with existing writer
liurenjie1024
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @CTTY for this pr. I think we are on the right track.
d887733 to
4532f1e
Compare
liurenjie1024
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @CTTY for this pr! Generally LGTM, left some comments.
crates/iceberg/src/writer/base_writer/equality_delete_writer.rs
Outdated
Show resolved
Hide resolved
| file_name_generator: F, | ||
| } | ||
|
|
||
| impl<B: FileWriterBuilder, L: LocationGenerator, F: FileNameGenerator> Clone |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we need to implement clone? This is a stateful struct since it contains generated data files, the semantics of clone is unclear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is because IcebergWriterBuilder trait requires Clone, so DataFileWriterBuilder and the inner RollingFileWriter have to derive Clone. The Clone here is not cloning the writer with its states like data files, but rather a helper to make building new data file writers easier.
I'm thinking maybe it's better to bring back the RollingFileWriterBuilder, so we can have something like:
pub struct DataFileWriterBuilder<B: FileWriterBuilder, L: LocationGenerator, F: FileNameGenerator> {
// RollingFileWriter won't need to implement Clone because it's Option
inner: Option<RollingFileWriter>,
// builder derives Clone and only contains stateless objs like target_file_size, file_io, location_gen, etc.
inner_builder: RollingFileWriterBuilder<B, L, F>,
partition_key: Option<PartitionKey>,
}
wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or another option is to remove Clone in IcebergWriterBuilder? I don't see why we need to clone IcebergWriterBuilder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having Clone can be useful in the upcoming PartitioningWriter level when we need to spawn new writers with the same configuration. for example, if we have a fanout partitioning writer, and we will need to create a new writer whenever there is a new partition coming in.
With Clone, it would be simple:
let new_writer = self.iceberg_writer_builder.clone().build()?; // iceberg_writer can be generic typeWithout Clone, we will need to re-populate the IcebergWriterBuilder and the inner writers all over again:
let parquet_writer = ...;
let new_rolling_writer = RollingFileWriter::new(parquet_writer, /* pass down objs like file_io, target_file_size... */);
let iceberg_writer_builder = DataFileWriterBuilder::new(...); // this has to be concrete type
let new_writer = iceberg_writer_builder.build()?;There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is incorrect, clone means creating a new one with exactly same data. But you are creating a new IcebergWriterBuilder with following changes:
- Partition key change
- Rolling file writer changed
I don't think it's reasonable to put it in Clone
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I agree, the point is we still have to be able to clone all the writers via builders and I think bringing the RollingFileWriterBuilder back and removing this Clone implementation is the correct way to go. I can make the change tomorrow and we can review that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have updated the code to bring back the RollingFileWriterBuilder and removed the Clone impl, it should make much more sense now :D
crates/iceberg/src/writer/base_writer/equality_delete_writer.rs
Outdated
Show resolved
Hide resolved
liurenjie1024
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @CTTY for this pr, LGTM!
Which issue does this PR close?
What changes are included in this PR?
Refactored the writer layers; from a bird’s-eye view, the structure now looks like this:
flowchart TD subgraph PartitioningWriter PW[PartitioningWriter] subgraph DataFileWriter RW[DataFileWriter] subgraph RollingWriter DFW[RollingWriter] subgraph FileWriter FW[FileWriter] end DFW --> FW end RW --> DFW end PW --> RW endKey Changes
RollingFileWriterto handle location generator, file name generator, and partition keys directlyParquetWriterBuilderinterface to accept output files during buildDataFileWriterBuilderto useRollingFileWriterwith partition keysTaskWriter->PartitioningWriter->RollingWriter-> ..., butTaskWriterandPartitioningWriterare not included in this draft so farAre these changes tested?
Not yet, but changing the existing tests accordingly should be enough