TableMetadataBuilder #587

c-thiel · 2024-08-26T18:47:48Z

This PR is now ready for first reviews.
Some Remarks:

In add_sort_order and add_partition_spec the Java code re-builds the added sort-order against the current schema by matching column names. This implementation currently does not do this. Adding this feature would require PartitionSpec (bound) to store the schema it was bound against (probably a good idea anyway) and split SortOrder in bound and unbound, where the bound SortOrder also stores the schema it was bound against. Instead, this implementation assumes that provided sort-orders and partition-specs are valid for the current schema. Compatibility with the current schema is tested.
In contrast to java, our add_schema method does not require a new_last_column_id argument. In java there is a todo to achieve the same. I put my reasoning in a comment in the code, feel free to comment on it.
I have a few specifics that would be good to discuss with a Java Maintainer with a Java Maintainer. Its not ready to be merged yet - please wait for my OK too :)
I had to change a few tests that used the builder under the hood - mostly due to the changed new() behaviour that now re-assignes field-ids to start from 0. Some tests started from 1 before. Re-assigning the ids, just like in Java, ensures that fresh metadata always has fresh and correct ids even if they are created manually or re-used from another metadata.

c-thiel · 2024-08-30T11:31:51Z

ToDos:

First PR as discussion basis
Reserved Property handling
Finish discussion about SortOrder and PartitionSpec re-binding
Review from 1-2 Java Maintainers
Review from 1-2 Rust Maintainers

c-thiel · 2024-09-03T06:54:49Z

Fixes #232

liurenjie1024 · 2024-09-06T01:33:09Z

Thanks @c-thiel for this pr, I've skimmed through it and it looks great to me. However this pr is too huge to review(3k lines), would you mind to split them into smaller onces? For example, we can add one pr for methods involved in one TableUpdate action and add enough tests for it? Also it would be better to put refactoring TableMetadataBuilder in a standalone module a pr?

c-thiel · 2024-09-06T20:46:23Z

Thanks for your Feedback @liurenjie1024. This isn't really a refactoring of the builder, it's more a complete rewrite. The old builder allowed to create corrupt metadata in various ways. Splitting it up by TableUpdate would not be straight forward - many tests also in other modules depend on building Metadata. Creating Metadata now always goes through builder methods - it's a different architecture that requires basic support for all methods from the beginning just to keep tests running.

I would currently prefer to keep it as a larger block mainly because:

I don't have much time currently and its going to be more effort
We would need to write auxiliary code to provide non-checked methods so that crate tests don't fail
The total timespan of merging 10 or so PRs is expected to be much larger than putting ~2 full days effort in from a Rust Maintainer and a Java Maintainer to review it as a block.
Patterns are repetitive and can be reviewed together in many cases
A lot of it are tests - the core builder are 1145 lines. Its long - but doable :)

We now have a vision of what it could look like in the end. Before putting any more effort in, we should answer the following questions:

Is the overall structure OK or should we head in a different direction?
My first point of the opening statement: Do we re-write our SortOrder and add the schema to PartitionSpec so that we can match on names like Java does or not?
My second point from the opening statement: How do we handle new_last_column_id

Those points might change the overall design quite a bit and might require a re-write of SortOrder first (split to bound and unbound).

After we answered those questions, and we still think splitting makes sense, I can try to find time to build stacked-PRs. Maybe just splitting normalization / validation in table_metadata.rs from the actual builder would be a leaner option than splitting every single TableUpdate?

c-thiel · 2024-09-08T09:40:20Z

@liurenjie1024 I tried to cut a few things out - but not along the lines of TalbeUpdate. I hope that's OK?

After they are all merged, I'll rebase this PR for the actual builder.

liurenjie1024 · 2024-09-09T03:46:22Z

Hi, @c-thiel Sorry for late reply.

Is the overall structure OK or should we head in a different direction?

I've went through the new builder and I think this is your design is the right direction.

My first point of the opening statement: Do we re-write our SortOrder and add the schema to PartitionSpec so that we can match on names like Java does or not?

To be honest, I don't quite understand the use case. We can ask for background of this in dev channel, but I think this is not a blocker of this pr, we can always add this later.

My second point from the opening statement: How do we handle new_last_column_id

I've took a look at the comments of these two prs: apache/iceberg#6701 apache/iceberg#7445

And I think the reason behavior is the last_column_id is optional, and we calculate it from highest field id when missing. But allowing user to pass last_column_id should be added to be compatible with current behavior, but this should be a feature which could be added later.

liurenjie1024 · 2024-09-09T03:47:46Z

Those points might change the overall design quite a bit and might require a re-write of SortOrder first (split to bound and unbound).

I agree that this should be required, as I mentioned in #550

liurenjie1024 · 2024-09-09T04:00:17Z

After we answered those questions, and we still think splitting makes sense, I can try to find time to build stacked-PRs. Maybe just splitting normalization / validation in table_metadata.rs from the actual builder would be a leaner option than splitting every single TableUpdate?

That sound reasonable to me. If one pr per table update is too much burden, could we split them by components, for example sort oder, partition spec, schema changes?

c-thiel · 2024-09-09T10:43:01Z

@liurenjie1024 thanks for the Feedback!

My first point of the opening statement: Do we re-write our SortOrder and add the schema to PartitionSpec so that we can match on names like Java does or not?

To be honest, I don't quite understand the use case. We can ask for background of this in dev channel, but I think this is not a blocker of this pr, we can always add this later.

The problem in changing it later is that it changes the semantic of the function. Right now we expect source_id to match the current_schema() (which is also the reason why I expose it during build). Java doesn't do this, instead, it looks up the name by id in the schema bound to a SortOrder or PartitionSpec, and then searches for the same name in the new schema to use it.

In my opinion ids are much cleaner than names (we might have dropped and re-added a column with the same name in the meantime), so I am OK with going forward. However, moving over to java semantics will require new endpoints (i.e. add_migrate_partition_spec or so), which takes a bound partition spec in contrast to the unbound spec we currently pass in.

Give me a thumbs up if that's OK for you. I'll also open a discussion in the dev channel to get some more opinions.

c-thiel · 2024-09-09T10:54:08Z

My second point from the opening statement: How do we handle new_last_column_id

I've took a look at the comments of these two prs: apache/iceberg#6701 apache/iceberg#7445

And I think the reason behavior is the last_column_id is optional, and we calculate it from highest field id when missing. But allowing user to pass last_column_id should be added to be compatible with current behavior, but this should be a feature which could be added later.

I don't think we should add the argument to be honest. My reasoning is as follows:
If specified, it could be a way to artificially increase last_assigned_field_id of a TableMetadata. I can't see any motivation behind that. Its just adds a source of confusion of what to specify here - and what to do if its wrong.
The only useful thing to do with it is to check for outdated TableMetadata at the time of constructing the Schema.
I added this check here:
https://github.com/apache/iceberg-rust/pull/587/files#diff-04f26c83b3c614be6f6d6cfb6c4cefef9e01ec2d31395ac487cdcdff2dbae729R442-R451

Maybe add @nastra or @Fokko could add some comments on the intention of that parameter?

Xuanwo · 2024-09-09T12:46:34Z

I have reviewed most PRs that I am confident can be merged. The only one left is #615, for which I need more input.

c-thiel · 2024-11-07T06:41:29Z

@Xuanwo, @liurenjie1024 this PR is ready for another round of review. It's now rebased on the 6 PRs we merged during the last months. The core logic is ~1100 lines of code, including quite a bit of comments.
Its not very straightforward to split it up further, as even the new function already requires a lot of the overall logic.

Fokko

@c-thiel I left a few comments, and suggestions for improvement, LMKWYT. Apart from that it looks great and good to go 👍

crates/iceberg/src/spec/table_metadata_builder.rs

Fokko · 2024-11-08T13:22:23Z

crates/iceberg/src/spec/table_metadata_builder.rs

+            return Ok(self);
+        }
+
+        // ToDo Discuss: Java builds a fresh spec here:


We just do this once, after that the schema is evolved using the UpdateSchema, that's implemented by the SchemaUpdate. There you pass in the last-column-id, and new fields will increment from that.

That is actually a very important point, and I think we should fix rust.
Currently in rust we use the SchemaBuilder for everything.
To get to it, we use an existing Schema and use the into_builder(self) method.

I believe we should modify this to into_builder(self, last_column_id: Option<i64>) so that users are forced to think about the last_column_id. Most likely they will want to use the last_column_id of the metadata they got the Schema from.

CC @liurenjie1024

I'm leaning toward the other way, where the last-field-id is tracked by Iceberg-rust itself and nog exposed by the user. The last-schema-id is a monotonically increasing ID that keeps track of which field-IDs are already being used.

If we do it completely internally, then we should probably not allow or discourage into_builder on schema.

I am unsure how an alternative API should look like. @liurenjie1024 some input would be very helpful.
My thoughts:

Introduce TableMetadata.update_schema that returns the current SchemaBuilder, however with a fixed last-column-id (set to the TableMetadata value). We document that add_schema should only be used for schemas modified in this way to ensure that last_column_id is correct.

If documentation is not enough, we could Introduce a new type SchemaUpdate that can only be created by calling TableMetadata.update_schema. It would hold the TableMetadataBuilder internally and have methods like add_field or remove_field. Upon build, it would return the TableMetadataBuilder.

When adding a word of caution to the method, I'm fine with keeping it public for now.

I know that this is quite late in the process here, but I think it might be complimentary to the builder we're seeing here. For PyIceberg we've taken a different approach, where we apply the MetadataUpdates (AddSchema, SetCurrentSchema, etc) to the metadata, rather than having a builder where you can modify the TableMetadata. Code can be found here: https://github.com/apache/iceberg-python/blob/60800d88c7e44fe2ed58a25f1b0fcd5927156adf/pyiceberg/table/update/__init__.py#L215-L494

The TableUpdate is currently only part of the rest catalog - not of the core spec.
I think we should keep them public unless we go switch to the python approach.

I added a word of warning that we do not check compatibility. Let me know if this is sufficient.
We should extend it with a link to the new method that is implemented in #697 .

The TableUpdate is currently only part of the rest catalog - not of the core spec.

Just to mention that it's part of core crate is rust:

iceberg-rust/crates/iceberg/src/catalog/mod.rs

Line 351 in d3b3ab1

pub enum TableUpdate {

.

I think we should keep them public unless we go switch to the python approach.

In fact, the original design was following java/python approach, see

iceberg-rust/crates/iceberg/src/transaction.rs

Line 31 in 8a3de4e

pub struct Transaction<'a> {

. I'm hesitating to make it public as making it crate private in future would be a breaking change. This method would be a foundation of implementing update schema transaction api.

@liurenjie1024 you are right, I thought it was part of rest.

I am still a bit hesitant to making it private. Let me try to curate my points. None of those are blockers, just want to take them into consideration.

I think it's quite useful in tests - good evidence is the rust crate itself where we use it at a few places. With Lakekeeper we are also using it for tests. Its handy to not require a TableUpdate wrapper around the schema that we want to add.

There are features that we would need to expose in other ways:
a) applying multiple TableUpdates to a TableMetadata in a single shot and obtaining the TableMetadataBuildResult, including changes and expired_metadata_logs. A Transaction isn't suitable for this. A Transaction works on a Table, not TableMetadata. TableMetadata is much lighter and we as a catalog never actually initialize Table - and we don't want to initialize it just to mutate some metadata. Transaction also calls the Catalog trait in the end - not applying the updates. So its not exposing the Metadata mutating functionality for us.
This could be solved by adding pub struct TableUpdates(Vec<TableUpdate>) or so instead and add an apply method to it. There are other options as well of course. Each of them would require another public struct, so we might also stick with the builder.
b) Initializing TableMetadata conveniently including all fields. Currently the only way to do this is via the builder. I don't think it would be ergonomic to have a basic new() method on TableMetadata and require user to use a completely different mechanism, such as TableUpdate.apply to go further (i.e. adding a snapshot or setting the Uuid). TableCreation doesn't offer this either in its current state.

My baseline is that mutating TableMetadata on its own is something that is valuable to external crates, such as our Catalog. So we should offer a public ergonomic interface for it. I believe the builder offers this interface - Transaction and TableUpdate on its own do not.

Let me know what you think!

I think it's reasonable to keep them public for reasons you mentioned. TableMetadataBuilder is doesn't need to do all validity checks for inputs, which are the responsibliity of transaction api.

I've added this to the spec a while ago: apache#7445 But I think this was a mistake, and we should not expose this to the public APIs, as it is much better to track this internally. I noticed this while reviewing apache/iceberg-rust#587 Removing this as part of the APIs in Java, and the Open-API update makes it much more resilient, and don't require the clients to compute this value

Okay, I've added this to the spec a while ago: apache#7445 But I think this was a mistake, and we should not expose this to the public APIs, as it is much better to track this internally. I noticed this while reviewing apache/iceberg-rust#587 Removing this as part of the APIs in Java, and the Open-API update makes it much more resilient, and don't require the clients to compute this value. For example. when there are two conflicting schema changes, the last-column-id must be recomputed correctly when doing the retry operation.

c-thiel · 2024-11-15T23:45:19Z

@liurenjie1024 for set_branch_snapshot I was missing an easy way to determine the type of a ref.
I introduced an enum ReferenceType, but if you don't like that, I can also just implement is_branch for SnapshotReference.

Co-authored-by: Renjie Liu <[email protected]>

c-thiel · 2024-11-16T17:16:37Z

@liurenjie1024 for set_branch_snapshot I was missing an easy way to determine the type of a ref. I introduced an enum ReferenceType, but if you don't like that, I can also just implement is_branch for SnapshotReference.

Slept over it and ReferenceType felt wrong - removed it.

c-thiel · 2024-11-16T19:01:16Z

crates/iceberg/src/spec/table_metadata_builder.rs

+    /// Remove snapshots by its ids from the table metadata.
+    /// Does nothing if a snapshot id is not present.
+    /// Keeps as changes only the snapshots that were actually removed.
+    pub fn remove_snapshots(mut self, snapshot_ids: &[i64]) -> Self {


Currently when removing snapshots we might have parent_snapshot_ids of other snapshots pointing to non-existing snapshots. This follows the behavior in Java.

Is this desirable? Or would it be more correct to set them to None as there is no parent available anymore?
@Fokko

@c-thiel I think it is more correct to set it to None. I'll follow up on the Java side to get this fixed.

liurenjie1024 · 2024-11-18T13:24:17Z

crates/iceberg/src/spec/table_metadata_builder.rs

+    /// # Errors
+    /// - The ref is unknown.
+    /// - Any of the preconditions of `self.add_snapshot` are not met.
+    pub fn append_snapshot(self, snapshot: Snapshot, ref_name: Option<&str>) -> Result<Self> {


Sorry for misclarification, I mean deprecating the append_snapshot method in TableMetadata, and planning to remove it in future release. I think adding #[deprecate] annotation would be enough since cargo will warn about this. Also I think we should remove this method in TableMetadataBuilder?

liurenjie1024 · 2024-11-18T13:34:57Z

crates/iceberg/src/catalog/mod.rs

    fn test_check_last_assigned_partition_id() {
        let metadata = metadata();
-
+        println!("{:?}", metadata.last_partition_id);


Remove this debug statement?

liurenjie1024 · 2024-11-18T13:37:43Z

crates/iceberg/src/spec/schema.rs

    }
+
+    /// Check if this schema is identical to another schema semantically - excluding schema id.
+    pub(crate) fn is_same_schema(&self, other: &SchemaRef) -> bool {


Why not implement PartialEq, Eq trait?

I thought about this as well, but opted for a method with documentation.
This method excludes the schema_id which I can describe here in the docstring. Using Eq I would expect the schema_id to be equal too - especially because its used in tests.

liurenjie1024 · 2024-11-18T13:47:51Z

crates/iceberg/src/spec/table_metadata_builder.rs

+    }
+}
+
+impl From<TableMetadataBuildResult> for TableMetadata {


Why we need this? metadata field already exposed as public field.

Just for convenience - should I remove it?

Fokko · 2024-11-19T12:27:01Z

crates/iceberg/src/spec/table_metadata_builder.rs

+            partition_spec,
+            sort_order.unwrap_or(SortOrder::unsorted_order()),
+            location,
+            FormatVersion::V1,


In Java, we create V2 tables by default.

We do now too :)

* Core,Open-API: Don't expose the `last-column-id` Okay, I've added this to the spec a while ago: #7445 But I think this was a mistake, and we should not expose this to the public APIs, as it is much better to track this internally. I noticed this while reviewing apache/iceberg-rust#587 Removing this as part of the APIs in Java, and the Open-API update makes it much more resilient, and don't require the clients to compute this value. For example. when there are two conflicting schema changes, the last-column-id must be recomputed correctly when doing the retry operation. * Update the tests as well * Add `deprecation` flag * Wording Co-authored-by: Eduard Tudenhoefner <[email protected]> * Wording Co-authored-by: Eduard Tudenhoefner <[email protected]> * Wording * Thanks Ryan! * Remove `LOG` --------- Co-authored-by: Eduard Tudenhoefner <[email protected]>

liurenjie1024

Thanks @c-thiel for this great pr, LGTM!

* Squash builder * Address comments * Address comments * Match on FormatVersion to fail for V3 * Fix examples * Fix tests * Address comments * Address comments * Update crates/iceberg/src/spec/table_metadata_builder.rs Co-authored-by: Renjie Liu <[email protected]> * Remove ReferenceType * Fix import * Remove current_schema and last_updated_ms accessors * Ensure main branch is not removed * Address comments * Fix tests * Do not ensure ensure_main_branch_not_removed * set_branch_snapshot create branch if not exists --------- Co-authored-by: Renjie Liu <[email protected]>

* Core,Open-API: Don't expose the `last-column-id` Okay, I've added this to the spec a while ago: apache#7445 But I think this was a mistake, and we should not expose this to the public APIs, as it is much better to track this internally. I noticed this while reviewing apache/iceberg-rust#587 Removing this as part of the APIs in Java, and the Open-API update makes it much more resilient, and don't require the clients to compute this value. For example. when there are two conflicting schema changes, the last-column-id must be recomputed correctly when doing the retry operation. * Update the tests as well * Add `deprecation` flag * Wording Co-authored-by: Eduard Tudenhoefner <[email protected]> * Wording Co-authored-by: Eduard Tudenhoefner <[email protected]> * Wording * Thanks Ryan! * Remove `LOG` --------- Co-authored-by: Eduard Tudenhoefner <[email protected]>

dentiny · 2025-07-10T22:12:05Z

crates/iceberg/src/spec/table_metadata_builder.rs

+use crate::error::{Error, ErrorKind, Result};
+use crate::{TableCreation, TableUpdate};
+
+const FIRST_FIELD_ID: u32 = 1;


Curious why do we start field id with 1, rather than 0?

c-thiel marked this pull request as draft August 26, 2024 18:48

c-thiel force-pushed the ft/table-metadata-builder branch from 6afeeb0 to 4193131 Compare August 30, 2024 10:48

c-thiel marked this pull request as ready for review August 30, 2024 11:33

ZENOTME mentioned this pull request Sep 1, 2024

Add apply interface in transaction #596

Closed

c-thiel changed the title ~~WIP: TableMetadataBuilder~~ TableMetadataBuilder Sep 3, 2024

This was referenced Sep 8, 2024

Feat: Normalize TableMetadata #611

Merged

feat: partition compatibility #612

Merged

feat: Reassign field ids for schema #615

Merged

c-thiel mentioned this pull request Sep 23, 2024

feat: Safer PartitionSpec & SchemalessPartitionSpec #645

Merged

c-thiel force-pushed the ft/table-metadata-builder branch from a3c1c89 to fea1817 Compare October 1, 2024 19:14

c-thiel force-pushed the ft/table-metadata-builder branch from e00450b to e36d834 Compare November 7, 2024 03:18

Fokko reviewed Nov 8, 2024

View reviewed changes

Fokko requested review from Xuanwo and liurenjie1024 November 8, 2024 13:24

Fokko mentioned this pull request Nov 11, 2024

Core,Open-API: Don't expose the last-column-id apache/iceberg#11514

Merged

Squash builder

dfd81bc

c-thiel and others added 2 commits November 16, 2024 09:56

Address comments

826fb08

Update crates/iceberg/src/spec/table_metadata_builder.rs

e733c30

Co-authored-by: Renjie Liu <[email protected]>

c-thiel mentioned this pull request Nov 16, 2024

Tracking issue: Writing iceberg tables #346

Closed

7 tasks

Remove ReferenceType

941cc4d

Fix import

53e30e0

c-thiel commented Nov 16, 2024

View reviewed changes

Fokko mentioned this pull request Nov 18, 2024

[EPIC] Iceberg-rust Write support #700

Closed

28 tasks

liurenjie1024 reviewed Nov 18, 2024

View reviewed changes

Fokko reviewed Nov 19, 2024

View reviewed changes

c-thiel added 6 commits November 20, 2024 14:20

Merge branch 'main' into ft/table-metadata-builder

e6301a2

Remove current_schema and last_updated_ms accessors

7ba1520

Ensure main branch is not removed

e4dcc7f

Address comments

2627c0d

Fix tests

f7b4fcf

Do not ensure ensure_main_branch_not_removed

dd06a8d

c-thiel mentioned this pull request Nov 22, 2024

Add TableMetadataBuilder::assign_current_snapshot_id #713

Closed

set_branch_snapshot create branch if not exists

72b5c0c

liurenjie1024 approved these changes Nov 26, 2024

View reviewed changes

liurenjie1024 merged commit 1138364 into apache:main Nov 26, 2024
16 checks passed

This was referenced Nov 27, 2024

No builder for TableMetadata and no public field #52

Closed

Discussion: Design of TableMetadataBuilder. #232

Closed

SergeiPatiakin mentioned this pull request Dec 2, 2024

chore: Align argument name with doc comment #750

Merged

dentiny reviewed Jul 10, 2025

View reviewed changes

twuebi mentioned this pull request Aug 21, 2025

refactor(metadata): remove newLastColumnID check in AddSchema method apache/iceberg-go#539

Merged

twuebi mentioned this pull request Sep 22, 2025

Schema update fails when dropping column with highest field ID apache/iceberg#13850

Open

3 tasks

TableMetadataBuilder #587

TableMetadataBuilder #587

Uh oh!

Conversation

c-thiel commented Aug 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

c-thiel commented Aug 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

c-thiel commented Sep 3, 2024

Uh oh!

liurenjie1024 commented Sep 6, 2024

Uh oh!

c-thiel commented Sep 6, 2024

Uh oh!

c-thiel commented Sep 8, 2024

Uh oh!

liurenjie1024 commented Sep 9, 2024

Uh oh!

liurenjie1024 commented Sep 9, 2024

Uh oh!

liurenjie1024 commented Sep 9, 2024

Uh oh!

c-thiel commented Sep 9, 2024

Uh oh!

c-thiel commented Sep 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Xuanwo commented Sep 9, 2024

Uh oh!

c-thiel commented Nov 7, 2024

Uh oh!

Fokko left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Fokko Nov 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

c-thiel commented Nov 15, 2024

Uh oh!

c-thiel commented Nov 16, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

c-thiel commented Aug 26, 2024 •

edited

Loading

c-thiel commented Aug 30, 2024 •

edited

Loading

c-thiel commented Sep 9, 2024 •

edited

Loading

Fokko Nov 11, 2024 •

edited

Loading