-
Notifications
You must be signed in to change notification settings - Fork 344
Closed as not planned
Labels
Description
This discussion is related to issues #624 and #607. I have been investigating the compaction process in the Rust library, specifically comparing it to the Java implementation using Spark. During this investigation, I noticed a difference in how the FileScanTask class is handled between the two implementations.
In the Java version, the FileScanTask includes :
DataFileobject , which provides crucial information about partitions andspecId,content. This information is necessary for the rewrite process in compaction. However, I am aware that @sdd previously raised a valid concern regarding the inclusion of this data in theFileScanTask(in this issue refactor: Store DataFile in FileScanTask instead #607 (comment))- List - which is used to remove the necessary rows from existing files.
I would like to explore the preferred approach for adding the necessary data to facilitate the implementation of compaction in the Rust library. Here are a few potential options I am considering:
- Add the fields DataFile & List to
FileScanTask. - Propose a new API - that returns a more informative version (perhaps
FileScanPlan?) ofFileScanTask, which includes the required data but is not serializable. - Other possible solutions? - I am open to suggestions on alternative approaches.
I Also tried to map the logic that is going on in the java + spark implementation to help us understand the flow in the hopes that we can do the same with rust and datafusion and maybe comet
c-thiel, v-kessler, andheroe, r4ntix, flaneur2020 and 4 more
