Skip to content

Exploring Enhanced Compaction Support in Rust #657

@amitgilad3

Description

@amitgilad3

This discussion is related to issues #624 and #607. I have been investigating the compaction process in the Rust library, specifically comparing it to the Java implementation using Spark. During this investigation, I noticed a difference in how the FileScanTask class is handled between the two implementations.

In the Java version, the FileScanTask includes :

  1. DataFile object , which provides crucial information about partitions and specId, content. This information is necessary for the rewrite process in compaction. However, I am aware that @sdd previously raised a valid concern regarding the inclusion of this data in the FileScanTask(in this issue refactor: Store DataFile in FileScanTask instead #607 (comment))
  2. List - which is used to remove the necessary rows from existing files.

I would like to explore the preferred approach for adding the necessary data to facilitate the implementation of compaction in the Rust library. Here are a few potential options I am considering:

  1. Add the fields DataFile & List to FileScanTask.
  2. Propose a new API - that returns a more informative version (perhaps FileScanPlan?) of FileScanTask, which includes the required data but is not serializable.
  3. Other possible solutions? - I am open to suggestions on alternative approaches.

I Also tried to map the logic that is going on in the java + spark implementation to help us understand the flow in the hopes that we can do the same with rust and datafusion and maybe comet

Would love to get your input @sdd @Xuanwo & @ZENOTME

compaction-RewriteDataFilesSparkAction Diagram drawio (1)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions