-
Notifications
You must be signed in to change notification settings - Fork 392
Closed
Milestone
Description
Feature Request / Improvement
Support partitioned writes
So I think we want to tackle the static overwrite first, and then we can compute the predicate for the dynamic overwrite to support that. We can come up with a separate API. I haven't really thought this trough, and we can still change this.
I think the most important steps are the breakdown of the work. There is a lot involved, but luckily we already get the test suite from the full overwrite.
Steps I can see:
- Extend the summary generation to support partitioned writes (here in Java)
- Add support for the append files.
- How are we going to fan out the writing of the data. We have an Arrow table, what is an efficient way to compute the partitions and scale out the work. For example, are we going to sort the table on the partition column and do a full pass through it? Or are we going to compute all the affected partitions, and then scale out?
- Add support for static overwrites
- Add support for dynamic overwrites
Other things on my mind:
- In Iceberg it can be that some files are still on an older partitioning, we should make sure that we handle those correctly based on the that we provide.
- How to handle delete files; it might be that the delete files become unrelated because the affected datafiles are replaced. We could first ignore this.
The good part:
- In PyIceberg we're first going to ignore the fast-appends (this is when you create a new manifest, and add it to the manifest list). Instead we'll just take the existing manifest(s) and rewrite it into a single new manifest which makes it a bit easier to reason about the snapshot (and therefore the snapshot summaries). The reason is that this caused quite a few bugs in Java, and it can be added always on a later moment.
pdames, nicor88, nickvazztau, 1taoze, jqin61 and 29 moretrymzet, asheeshgarg, oxlade39, Milias, lfrkncht and 4 more
Metadata
Metadata
Assignees
Labels
No labels