-
Notifications
You must be signed in to change notification settings - Fork 344
Description
Context
Make Datafile Serializable && Deserializable is useful, e.g. In distributed compute engine, it will create multiple writers in multiple machines and write the data in parallel and get the DataFile as the results, these DataFiles will be sent to a coordinator and append using transaction. In this case, DataFile should able to be Serializable && Deserializable.
Solution
For now, we support Serialize DataFile in _serde module and we should convert the DataFile to _serde::DataFile first, the interface looks like: pub fn try_from(value: super::DataFile, partition_type: &StructType,is_version_1: bool) -> _serde::DataFile. More detail:
iceberg-rust/crates/iceberg/src/spec/manifest.rs
Line 1361 in 98cd34d
| pub fn try_into( |
There is something we need to resolve to support Datafile Serializable && Deserializable:
- The related interface needs to be exposed to the public
- The interface is not friendly. If the DataFile can be self-contain, things will be easier, e.g. DataFile itself can be Serialize && Deserialize, the user doesn't need to convert it using an interface like
pub fn try_from(value: super::DataFile, partition_type: &StructType,is_version_1: bool) -> _serde::DataFile
To solve the above, I think there are two solutions:
- Make DateFile self-contain, store the partition type and version in DataFile directly so that it converts into _serde::DataFile directly and it can be Serialize && Deserialize.
- Provide something like
struct SerializableDataFile {
version: i32,
partition_type: StructType
data_file: DataFile
}
I prefer solution 1 because it looks more natural. Welcome to different opinions and solutions. cc @liurenjie1024 @Fokko @Xuanwo @c-thiel