-
Notifications
You must be signed in to change notification settings - Fork 344
TableMetadataBuilder #587
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TableMetadataBuilder #587
Conversation
6afeeb0 to
4193131
Compare
|
ToDos:
|
|
Fixes #232 |
|
Thanks @c-thiel for this pr, I've skimmed through it and it looks great to me. However this pr is too huge to review(3k lines), would you mind to split them into smaller onces? For example, we can add one pr for methods involved in one |
|
Thanks for your Feedback @liurenjie1024. This isn't really a refactoring of the builder, it's more a complete rewrite. The old builder allowed to create corrupt metadata in various ways. Splitting it up by I would currently prefer to keep it as a larger block mainly because:
We now have a vision of what it could look like in the end. Before putting any more effort in, we should answer the following questions:
Those points might change the overall design quite a bit and might require a re-write of After we answered those questions, and we still think splitting makes sense, I can try to find time to build stacked-PRs. Maybe just splitting normalization / validation in |
|
@liurenjie1024 I tried to cut a few things out - but not along the lines of
After they are all merged, I'll rebase this PR for the actual builder. |
|
Hi, @c-thiel Sorry for late reply.
I've went through the new builder and I think this is your design is the right direction.
To be honest, I don't quite understand the use case. We can ask for background of this in dev channel, but I think this is not a blocker of this pr, we can always add this later.
I've took a look at the comments of these two prs: apache/iceberg#6701 apache/iceberg#7445 And I think the reason behavior is the |
I agree that this should be required, as I mentioned in #550 |
That sound reasonable to me. If one pr per table update is too much burden, could we split them by components, for example sort oder, partition spec, schema changes? |
|
@liurenjie1024 thanks for the Feedback!
The problem in changing it later is that it changes the semantic of the function. Right now we expect source_id to match the In my opinion ids are much cleaner than names (we might have dropped and re-added a column with the same name in the meantime), so I am OK with going forward. However, moving over to java semantics will require new endpoints (i.e. Give me a thumbs up if that's OK for you. I'll also open a discussion in the dev channel to get some more opinions. |
I don't think we should add the argument to be honest. My reasoning is as follows: Maybe add @nastra or @Fokko could add some comments on the intention of that parameter? |
|
I have reviewed most PRs that I am confident can be merged. The only one left is #615, for which I need more input. |
a3c1c89 to
fea1817
Compare
e00450b to
e36d834
Compare
|
@Xuanwo, @liurenjie1024 this PR is ready for another round of review. It's now rebased on the 6 PRs we merged during the last months. The core logic is ~1100 lines of code, including quite a bit of comments. |
Fokko
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@c-thiel I left a few comments, and suggestions for improvement, LMKWYT. Apart from that it looks great and good to go 👍
| return Ok(self); | ||
| } | ||
|
|
||
| // ToDo Discuss: Java builds a fresh spec here: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We just do this once, after that the schema is evolved using the UpdateSchema, that's implemented by the SchemaUpdate. There you pass in the last-column-id, and new fields will increment from that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is actually a very important point, and I think we should fix rust.
Currently in rust we use the SchemaBuilder for everything.
To get to it, we use an existing Schema and use the into_builder(self) method.
I believe we should modify this to into_builder(self, last_column_id: Option<i64>) so that users are forced to think about the last_column_id. Most likely they will want to use the last_column_id of the metadata they got the Schema from.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm leaning toward the other way, where the last-field-id is tracked by Iceberg-rust itself and nog exposed by the user. The last-schema-id is a monotonically increasing ID that keeps track of which field-IDs are already being used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we do it completely internally, then we should probably not allow or discourage into_builder on schema.
I am unsure how an alternative API should look like. @liurenjie1024 some input would be very helpful.
My thoughts:
- Introduce
TableMetadata.update_schemathat returns the currentSchemaBuilder, however with a fixedlast-column-id(set to theTableMetadatavalue). We document thatadd_schemashould only be used for schemas modified in this way to ensure thatlast_column_idis correct. - If documentation is not enough, we could Introduce a new type
SchemaUpdatethat can only be created by callingTableMetadata.update_schema. It would hold theTableMetadataBuilderinternally and have methods likeadd_fieldorremove_field. Upon build, it would return theTableMetadataBuilder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When adding a word of caution to the method, I'm fine with keeping it public for now.
I know that this is quite late in the process here, but I think it might be complimentary to the builder we're seeing here. For PyIceberg we've taken a different approach, where we apply the MetadataUpdates (AddSchema, SetCurrentSchema, etc) to the metadata, rather than having a builder where you can modify the TableMetadata. Code can be found here: https://github.com/apache/iceberg-python/blob/60800d88c7e44fe2ed58a25f1b0fcd5927156adf/pyiceberg/table/update/__init__.py#L215-L494
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The TableUpdate is currently only part of the rest catalog - not of the core spec.
I think we should keep them public unless we go switch to the python approach.
I added a word of warning that we do not check compatibility. Let me know if this is sufficient.
We should extend it with a link to the new method that is implemented in #697 .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The TableUpdate is currently only part of the rest catalog - not of the core spec.
Just to mention that it's part of core crate is rust:
| pub enum TableUpdate { |
I think we should keep them public unless we go switch to the python approach.
In fact, the original design was following java/python approach, see
| pub struct Transaction<'a> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liurenjie1024 you are right, I thought it was part of rest.
I am still a bit hesitant to making it private. Let me try to curate my points. None of those are blockers, just want to take them into consideration.
- I think it's quite useful in tests - good evidence is the rust crate itself where we use it at a few places. With Lakekeeper we are also using it for tests. Its handy to not require a
TableUpdatewrapper around the schema that we want to add. - There are features that we would need to expose in other ways:
a) applying multipleTableUpdatesto aTableMetadatain a single shot and obtaining theTableMetadataBuildResult, includingchangesandexpired_metadata_logs. ATransactionisn't suitable for this. ATransactionworks on aTable, notTableMetadata.TableMetadatais much lighter and we as a catalog never actually initializeTable- and we don't want to initialize it just to mutate some metadata.Transactionalso calls theCatalogtrait in the end - not applying the updates. So its not exposing the Metadata mutating functionality for us.
This could be solved by addingpub struct TableUpdates(Vec<TableUpdate>)or so instead and add anapplymethod to it. There are other options as well of course. Each of them would require another public struct, so we might also stick with the builder.
b) InitializingTableMetadataconveniently including all fields. Currently the only way to do this is via the builder. I don't think it would be ergonomic to have a basicnew()method onTableMetadataand require user to use a completely different mechanism, such asTableUpdate.applyto go further (i.e. adding a snapshot or setting the Uuid).TableCreationdoesn't offer this either in its current state.
My baseline is that mutating TableMetadata on its own is something that is valuable to external crates, such as our Catalog. So we should offer a public ergonomic interface for it. I believe the builder offers this interface - Transaction and TableUpdate on its own do not.
Let me know what you think!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's reasonable to keep them public for reasons you mentioned. TableMetadataBuilder is doesn't need to do all validity checks for inputs, which are the responsibliity of transaction api.
I've added this to the spec a while ago: apache#7445 But I think this was a mistake, and we should not expose this to the public APIs, as it is much better to track this internally. I noticed this while reviewing apache/iceberg-rust#587 Removing this as part of the APIs in Java, and the Open-API update makes it much more resilient, and don't require the clients to compute this value
Okay, I've added this to the spec a while ago: apache#7445 But I think this was a mistake, and we should not expose this to the public APIs, as it is much better to track this internally. I noticed this while reviewing apache/iceberg-rust#587 Removing this as part of the APIs in Java, and the Open-API update makes it much more resilient, and don't require the clients to compute this value. For example. when there are two conflicting schema changes, the last-column-id must be recomputed correctly when doing the retry operation.
|
@liurenjie1024 for |
Co-authored-by: Renjie Liu <[email protected]>
Slept over it and |
| /// Remove snapshots by its ids from the table metadata. | ||
| /// Does nothing if a snapshot id is not present. | ||
| /// Keeps as changes only the snapshots that were actually removed. | ||
| pub fn remove_snapshots(mut self, snapshot_ids: &[i64]) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently when removing snapshots we might have parent_snapshot_ids of other snapshots pointing to non-existing snapshots. This follows the behavior in Java.
Is this desirable? Or would it be more correct to set them to None as there is no parent available anymore?
@Fokko
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@c-thiel I think it is more correct to set it to None. I'll follow up on the Java side to get this fixed.
| /// # Errors | ||
| /// - The ref is unknown. | ||
| /// - Any of the preconditions of `self.add_snapshot` are not met. | ||
| pub fn append_snapshot(self, snapshot: Snapshot, ref_name: Option<&str>) -> Result<Self> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for misclarification, I mean deprecating the append_snapshot method in TableMetadata, and planning to remove it in future release. I think adding #[deprecate] annotation would be enough since cargo will warn about this. Also I think we should remove this method in TableMetadataBuilder?
crates/iceberg/src/catalog/mod.rs
Outdated
| fn test_check_last_assigned_partition_id() { | ||
| let metadata = metadata(); | ||
|
|
||
| println!("{:?}", metadata.last_partition_id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this debug statement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| } | ||
|
|
||
| /// Check if this schema is identical to another schema semantically - excluding schema id. | ||
| pub(crate) fn is_same_schema(&self, other: &SchemaRef) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not implement PartialEq, Eq trait?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about this as well, but opted for a method with documentation.
This method excludes the schema_id which I can describe here in the docstring. Using Eq I would expect the schema_id to be equal too - especially because its used in tests.
| } | ||
| } | ||
|
|
||
| impl From<TableMetadataBuildResult> for TableMetadata { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we need this? metadata field already exposed as public field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for convenience - should I remove it?
| partition_spec, | ||
| sort_order.unwrap_or(SortOrder::unsorted_order()), | ||
| location, | ||
| FormatVersion::V1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Java, we create V2 tables by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do now too :)
* Core,Open-API: Don't expose the `last-column-id` Okay, I've added this to the spec a while ago: #7445 But I think this was a mistake, and we should not expose this to the public APIs, as it is much better to track this internally. I noticed this while reviewing apache/iceberg-rust#587 Removing this as part of the APIs in Java, and the Open-API update makes it much more resilient, and don't require the clients to compute this value. For example. when there are two conflicting schema changes, the last-column-id must be recomputed correctly when doing the retry operation. * Update the tests as well * Add `deprecation` flag * Wording Co-authored-by: Eduard Tudenhoefner <[email protected]> * Wording Co-authored-by: Eduard Tudenhoefner <[email protected]> * Wording * Thanks Ryan! * Remove `LOG` --------- Co-authored-by: Eduard Tudenhoefner <[email protected]>
liurenjie1024
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @c-thiel for this great pr, LGTM!
* Squash builder * Address comments * Address comments * Match on FormatVersion to fail for V3 * Fix examples * Fix tests * Address comments * Address comments * Update crates/iceberg/src/spec/table_metadata_builder.rs Co-authored-by: Renjie Liu <[email protected]> * Remove ReferenceType * Fix import * Remove current_schema and last_updated_ms accessors * Ensure main branch is not removed * Address comments * Fix tests * Do not ensure ensure_main_branch_not_removed * set_branch_snapshot create branch if not exists --------- Co-authored-by: Renjie Liu <[email protected]>
* Core,Open-API: Don't expose the `last-column-id` Okay, I've added this to the spec a while ago: apache#7445 But I think this was a mistake, and we should not expose this to the public APIs, as it is much better to track this internally. I noticed this while reviewing apache/iceberg-rust#587 Removing this as part of the APIs in Java, and the Open-API update makes it much more resilient, and don't require the clients to compute this value. For example. when there are two conflicting schema changes, the last-column-id must be recomputed correctly when doing the retry operation. * Update the tests as well * Add `deprecation` flag * Wording Co-authored-by: Eduard Tudenhoefner <[email protected]> * Wording Co-authored-by: Eduard Tudenhoefner <[email protected]> * Wording * Thanks Ryan! * Remove `LOG` --------- Co-authored-by: Eduard Tudenhoefner <[email protected]>
| use crate::error::{Error, ErrorKind, Result}; | ||
| use crate::{TableCreation, TableUpdate}; | ||
|
|
||
| const FIRST_FIELD_ID: u32 = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious why do we start field id with 1, rather than 0?
This PR is now ready for first reviews.
Some Remarks:
add_sort_orderandadd_partition_specthe Java code re-builds the added sort-order against the current schema by matching column names. This implementation currently does not do this. Adding this feature would requirePartitionSpec(bound) to store the schema it was bound against (probably a good idea anyway) and splitSortOrderin bound and unbound, where the boundSortOrderalso stores the schema it was bound against. Instead, this implementation assumes that provided sort-orders and partition-specs are valid for the current schema. Compatibility with the current schema is tested.add_schemamethod does not require anew_last_column_idargument. In java there is a todo to achieve the same. I put my reasoning in a comment in the code, feel free to comment on it.new()behaviour that now re-assignes field-ids to start from 0. Some tests started from 1 before. Re-assigning the ids, just like in Java, ensures that fresh metadata always has fresh and correct ids even if they are created manually or re-used from another metadata.