-
Notifications
You must be signed in to change notification settings - Fork 344
Description
Is your feature request related to a problem or challenge?
Hi icy crustaceans,
While working on adding the [ClusteredWriter](https://github.com/apache/iceberg-rust/pull/1646), I noticed that the existing FileWriterBuilder trait is quite limiting for partition-aware writers. The main issue is that the location generator is coupled with the low-level file writer (such as ParquetWriter), which effectively requires ParquetWriter itself to be partition-aware.
In addition, the current FileWriterBuilder trait cannot pass partition information from outer writers to inner writers, making it very difficult to implement partitioning writers:
fn build(self) -> impl Future<Output = Result<Self::R>> + Send;A hacky workaround would be to have partitioning writers use the concrete type ParquetWriter as the inner writer, but I think this will lead to extensibility issues fairly quickly.
Describe the solution you'd like
I see a few aspects of this issue:
- Decouple
LocationGeneratorandParquetWriterby changing theFileWriterBuildertrait
By changing the FileWriterBuilder interface to the following, we can make ParquetWriter focus solely on writing a file without needing to know which partition it’s writing to:
fn build(self, output_file: OutputFile) -> impl Future<Output = Result<Self::R>> + Send;We will also need to adjust IcebergWriterBuilder accordingly:
async fn build(self, output_file: OutputFile) -> Result<Self::R>;- Make the functional writer the outermost layer
After modifying the builder traits, we still need to decide which layer of writers should hold the LocationGenerator. I propose moving functional writers such as RollingFileWriter and the upcoming ClusteredWriter to the outermost layer, and having them own the LocationGenerator.
Currently, the layering looks like this:
// DataFileWriter/IcebergWriter is the outermost layer
DataFileWriter(
RollingFileWriter( // currently implements FileWriter
ParquetWriter()
)
)
My proposal:
SomeFunctionalWriter( // no longer implements FileWriter
DataFileWriter(
ParquetWriter()
)
)
This new design provides several benefits:
-
Clear layering: Functional writers hold the
LocationGeneratorand generate newOutputFiles for the inner writers, while the inner writers focus only on writing a single file without worrying about path or location. -
Flexibility: Functional writers no longer need to implement the
FileWritertrait, making them easier to use and configure. They don’t necessarily need to implement any trait, though we could define something like:FunctionalWriterBuilder { async fn build( self, location_gen: LocationGenerator, filename_gen: FileNameGenerator ) -> Result<Self::R>; }
I’m still exploring and refining this idea, and would appreciate any feedback!
Willingness to contribute
I would be willing to contribute to this feature with guidance from the Iceberg Rust community