Skip to content

Discussion: make DataFile Serializable && Deserializable #774

@ZENOTME

Description

@ZENOTME

Context

Make Datafile Serializable && Deserializable is useful, e.g. In distributed compute engine, it will create multiple writers in multiple machines and write the data in parallel and get the DataFile as the results, these DataFiles will be sent to a coordinator and append using transaction. In this case, DataFile should able to be Serializable && Deserializable.

Solution

For now, we support Serialize DataFile in _serde module and we should convert the DataFile to _serde::DataFile first, the interface looks like: pub fn try_from(value: super::DataFile, partition_type: &StructType,is_version_1: bool) -> _serde::DataFile. More detail:

.

There is something we need to resolve to support Datafile Serializable && Deserializable:

  1. The related interface needs to be exposed to the public
  2. The interface is not friendly. If the DataFile can be self-contain, things will be easier, e.g. DataFile itself can be Serialize && Deserialize, the user doesn't need to convert it using an interface like pub fn try_from(value: super::DataFile, partition_type: &StructType,is_version_1: bool) -> _serde::DataFile

To solve the above, I think there are two solutions:

  1. Make DateFile self-contain, store the partition type and version in DataFile directly so that it converts into _serde::DataFile directly and it can be Serialize && Deserialize.
  2. Provide something like
struct SerializableDataFile {
  version: i32,
  partition_type: StructType
  data_file: DataFile
}

I prefer solution 1 because it looks more natural. Welcome to different opinions and solutions. cc @liurenjie1024 @Fokko @Xuanwo @c-thiel

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions