Skip to content

Proposal: Implement table maintenance operations #1453

@cmcarthur

Description

@cmcarthur

What's the feature are you trying to implement?

Rationale

Some organizations want to use Iceberg as their data lake, but don't have the desire to run Spark
alongside every catalog deployment. Rust seems like a good target for "single-node" table
maintenance operations.

This issue lays out an implementation plan for implementing some of the standard table maintenance
tasks defined in the Iceberg docs: https://iceberg.apache.org/docs/1.9.1/maintenance/

Design Principles

  1. Follow the API and implementation convention set by Spark operations. Where possible, follow the
    existing API conventions laid out by Spark procedures. Don't invent a new convention.
  2. Incrementalize work where possible. Each operation will run on a single node. Since single-node
    memory and disk is limited, the Rust implementation will "incrementalize" work by breaking
    operations down into smaller chunks that can be committed / completed. For example: the Spark
    implementation of "DeleteOrphanFiles" will first gather all files to be deleted, and then
    once all files have been gathered it will concurrently delete files. These steps run sequentially.
    In a single-node situation, for large tables, this operation may fail due to memory availability,
    potentially after running for a long time and gathering up files. The Rust implementation of the
    same maintenance operation will provide options to delete files as they are identified rather
    than at the end of the job. There is some precedent for this with the partial-progress.max-commits
    configuration option in the "RewriteDataFiles" operation.
  3. Develop a low-level API that can be compiled into a binary. Unlike Spark, where FileIOs can
    be configured completely through configuration options, this will provide lower level primitives
    in the form of traits. Configuring FileIOs, via, for example, a configuration file, is a separate
    concern.
  4. Allow for extensibility. Use traits with dyn X arguments to ensure that these operations
    work with current and future FileIOs.

Implementation

This initial implementation will focus on three maintenance tasks:

  1. Expire Snapshots: Maintenance: Expire Snapshots Action #1454
  2. Rewrite Manifests
  3. Remove Orphan Files
    Other operations are well-suited to re-implementation in Rust, but these are (in my view) the
    critical baseline operations that must run to keep the Iceberg metadata in a healthy state.

I will open separate issues for each operation and attach PRs.

Willingness to contribute

I can contribute to this feature independently

Metadata

Metadata

Assignees

No one assigned

    Labels

    epicEpic issue

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions