-
Notifications
You must be signed in to change notification settings - Fork 391
feat: Support Bucket and Truncate transforms on write #1345
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
560ba20 to
bd80f39
Compare
kevinjqliu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Great to have writes for all the different transformations!
| @pytest.mark.parametrize( | ||
| "spec, expected_rows", | ||
| [ | ||
| # none of non-identity is supported |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| # none of non-identity is supported |
| source_type: PrimitiveType, | ||
| input_arr: Union[pa.Array, pa.ChunkedArray], | ||
| expected: Union[pa.Array, pa.ChunkedArray], | ||
| num_buckets: int, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: wydt of reordering these for readability? num_buckets, source_type and input_arr are configs of the BucketTransform; expected is the output
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I think I feel indifferent here - there’s something nice about having the input and expected arrays side by side
kevinjqliu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for rebasing the PR! I left a few questions on the tests.
We're currently blocked by CI since spark 3.5.3 is removed from https://dlcdn.apache.org/spark/
| (PartitionSpec(PartitionField(source_id=4, field_id=1001, transform=TruncateTransform(2), name="int_trunc"))), | ||
| (PartitionSpec(PartitionField(source_id=5, field_id=1001, transform=TruncateTransform(2), name="long_trunc"))), | ||
| (PartitionSpec(PartitionField(source_id=2, field_id=1001, transform=TruncateTransform(2), name="string_trunc"))), | ||
| (PartitionSpec(PartitionField(source_id=11, field_id=1001, transform=TruncateTransform(2), name="binary_trunc"))), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we include binary_trunc too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah good question. Truncating binary isn't supported with iceberg-rust so I've excluded this test case for now: https://github.com/apache/iceberg-rust/blob/main/crates/iceberg/src/transform/truncate.rs#L132-L164
| # mixed with non-identity is not supported | ||
| ( | ||
| PartitionSpec( | ||
| PartitionField(source_id=4, field_id=1001, transform=BucketTransform(2), name="int_bucket"), | ||
| PartitionField(source_id=1, field_id=1002, transform=IdentityTransform(), name="bool"), | ||
| ) | ||
| ), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this case supported now?
pyiceberg/transforms.py
Outdated
| def _pyiceberg_transform_wrapper( | ||
| self, transform_func: Callable[["ArrayLike", Any], "ArrayLike"], *args: Any | ||
| ) -> Callable[["ArrayLike"], "ArrayLike"]: | ||
| import pyarrow as pa |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| import pyarrow as pa | |
| try: | |
| import pyarrow as pa | |
| except ModuleNotFoundError as e: | |
| raise ModuleNotFoundError("For bucket/truncate transforms, PyArrow needs to be installed") from e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the long wait, I think this one got buried in my mailbox. I left one minor nit, this looks great @sungwy 🙌
no problem @Fokko - I'm just coming back from holidays myself. And thank you for taking another round of reviews @kevinjqliu ! I'll take the nits and retrigger the CI now that the spark artifact issue has been fixed |
|
Thanks @sungwy and I hope you had some great time off :) |
Getting the PR ready for when
pyiceberg_coreis released fromiceberg-rustPR to introduce python binding release: apache/iceberg-rust#705
Fixes: #1074
Consideration: we could replace the existing
pyarrowdependency onorder_preservingtransforms (Month,Year,Date) withpyiceberg_corefor consistency