Skip to content

Conversation

@Fokko
Copy link
Contributor

@Fokko Fokko commented Dec 6, 2024

This PR removes SchemalessPartitionSpec and UnboundPartitionSpecField. We could also combine BoundPartitionSpec and UnboundPartitionSpec if we like, but this is already quite a big change.

From the spec:

The field-id property was added for each partition field in v2.
In v1, the reference implementation assigned field ids sequentially
in each spec starting at 1,000. See Partition Evolution for more details.

In v1, partition field IDs were not tracked, but were assigned sequentially
starting at 1000 in the reference implementation. This assignment caused
problems when reading metadata tables based on manifest files from multiple
specs because partition fields with the same ID may contain different data types.

For compatibility with old versions, the following rules are recommended for partition evolution in v1 tables:

  • Do not reorder partition fields
  • Do not drop partition fields; instead replace the field's transform with the void transform
  • Only add partition fields at the end of the previous partition spec

I think for simplicity, we should assign the field-IDs starting from 1000, and this will greatly simplify the objects that we need. For V1 the field-ID is missing, and we can just start assigning from 1000 onwards because the IDs are sequential, for V2 tables we deserialize the field-ID from the payload. While I also noticed that we write the field-id field for V1 tables in the reference implementation: apache/iceberg#11708

Next to that, I also believe that users shouldn't have to worry about the field-IDs and that it should be kept internal to Iceberg-Rust. For the evolution of the partition spec, we should have something similar as Java and PyIceberg, in particular for V1 tables, we have to take the rules above into account, otherwise, there is a serious issue of data-loss, or bricking a table. If we agree on this, I'm happy to implement that API.

@Fokko Fokko force-pushed the fd-simplify-partition-structures branch 7 times, most recently from e988ec3 to e7ca59c Compare December 6, 2024 15:14
This PR removes `SchemalessPartitionSpec` and `UnboundPartitionSpecField`.

From the spec:

> The field-id property was added for each partition field in v2.
> In v1, the reference implementation assigned field ids sequentially
> in each spec starting at 1,000. See Partition Evolution for more details.

> In v1, partition field IDs were not tracked, but were assigned sequentially
> starting at 1000 in the reference implementation. This assignment caused
> problems when reading metadata tables based on manifest files from multiple
> specs because partition fields with the same ID may contain different data types.

> For compatibility with old versions, the following rules are recommended for partition evolution in v1 tables:
> - Do not reorder partition fields
> - Do not drop partition fields; instead replace the field's transform with the void transform
> - Only add partition fields at the end of the previous partition spec

I think for simplicity, we should assign the field-IDs starting from 1000,
and this will greatly simplify the objects that we need.

Next to that, I also believe that users shouldn't have to worry about
the field-IDs and that it should be kept internal to Iceberg-Rust.
@Fokko Fokko force-pushed the fd-simplify-partition-structures branch from e7ca59c to 3eee6e3 Compare December 6, 2024 15:19
@Fokko
Copy link
Contributor Author

Fokko commented Dec 10, 2024

Closed in favor of #771

@Fokko Fokko closed this Dec 10, 2024
@Fokko Fokko deleted the fd-simplify-partition-structures branch December 10, 2024 11:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant