-
Notifications
You must be signed in to change notification settings - Fork 346
Closed
Labels
epicEpic issueEpic issue
Description
What's the feature are you trying to implement?
Rationale
Some organizations want to use Iceberg as their data lake, but don't have the desire to run Spark
alongside every catalog deployment. Rust seems like a good target for "single-node" table
maintenance operations.
This issue lays out an implementation plan for implementing some of the standard table maintenance
tasks defined in the Iceberg docs: https://iceberg.apache.org/docs/1.9.1/maintenance/
Design Principles
- Follow the API and implementation convention set by Spark operations. Where possible, follow the
existing API conventions laid out by Spark procedures. Don't invent a new convention. - Incrementalize work where possible. Each operation will run on a single node. Since single-node
memory and disk is limited, the Rust implementation will "incrementalize" work by breaking
operations down into smaller chunks that can be committed / completed. For example: the Spark
implementation of "DeleteOrphanFiles" will first gather all files to be deleted, and then
once all files have been gathered it will concurrently delete files. These steps run sequentially.
In a single-node situation, for large tables, this operation may fail due to memory availability,
potentially after running for a long time and gathering up files. The Rust implementation of the
same maintenance operation will provide options to delete files as they are identified rather
than at the end of the job. There is some precedent for this with thepartial-progress.max-commits
configuration option in the "RewriteDataFiles" operation. - Develop a low-level API that can be compiled into a binary. Unlike Spark, where FileIOs can
be configured completely through configuration options, this will provide lower level primitives
in the form of traits. Configuring FileIOs, via, for example, a configuration file, is a separate
concern. - Allow for extensibility. Use traits with
dyn Xarguments to ensure that these operations
work with current and future FileIOs.
Implementation
This initial implementation will focus on three maintenance tasks:
- Expire Snapshots: Maintenance: Expire Snapshots Action #1454
- Rewrite Manifests
- Remove Orphan Files
Other operations are well-suited to re-implementation in Rust, but these are (in my view) the
critical baseline operations that must run to keep the Iceberg metadata in a healthy state.
I will open separate issues for each operation and attach PRs.
Willingness to contribute
I can contribute to this feature independently
CTTY
Metadata
Metadata
Assignees
Labels
epicEpic issueEpic issue