feat: Support Bucket and Truncate transforms on write #1345

sungwy · 2024-11-20T02:36:30Z

Getting the PR ready for when pyiceberg_core is released from iceberg-rust

PR to introduce python binding release: apache/iceberg-rust#705

Consideration: we could replace the existing pyarrow dependency on order_preserving transforms (Month,Year,Date) with pyiceberg_core for consistency

kevinjqliu

LGTM! Great to have writes for all the different transformations!

kevinjqliu · 2024-12-24T23:07:48Z

tests/integration/test_writes/test_partitioned_writes.py

+@pytest.mark.parametrize(
+    "spec, expected_rows",
+    [
+        # none of non-identity is supported


Suggested change

# none of non-identity is supported

kevinjqliu · 2024-12-24T23:18:55Z

tests/test_transforms.py

+    source_type: PrimitiveType,
+    input_arr: Union[pa.Array, pa.ChunkedArray],
+    expected: Union[pa.Array, pa.ChunkedArray],
+    num_buckets: int,


nit: wydt of reordering these for readability? num_buckets, source_type and input_arr are configs of the BucketTransform; expected is the output

Hmm I think I feel indifferent here - there’s something nice about having the input and expected arrays side by side

kevinjqliu

LGTM, thanks for rebasing the PR! I left a few questions on the tests.

We're currently blocked by CI since spark 3.5.3 is removed from https://dlcdn.apache.org/spark/

kevinjqliu · 2025-01-16T04:49:25Z

tests/integration/test_writes/test_partitioned_writes.py

        (PartitionSpec(PartitionField(source_id=4, field_id=1001, transform=TruncateTransform(2), name="int_trunc"))),
        (PartitionSpec(PartitionField(source_id=5, field_id=1001, transform=TruncateTransform(2), name="long_trunc"))),
        (PartitionSpec(PartitionField(source_id=2, field_id=1001, transform=TruncateTransform(2), name="string_trunc"))),
-        (PartitionSpec(PartitionField(source_id=11, field_id=1001, transform=TruncateTransform(2), name="binary_trunc"))),


should we include binary_trunc too?

Yeah good question. Truncating binary isn't supported with iceberg-rust so I've excluded this test case for now: https://github.com/apache/iceberg-rust/blob/main/crates/iceberg/src/transform/truncate.rs#L132-L164

kevinjqliu · 2025-01-16T04:52:23Z

tests/integration/test_writes/test_partitioned_writes.py

-        # mixed with non-identity is not supported
-        (
-            PartitionSpec(
-                PartitionField(source_id=4, field_id=1001, transform=BucketTransform(2), name="int_bucket"),
-                PartitionField(source_id=1, field_id=1002, transform=IdentityTransform(), name="bool"),
-            )
-        ),


is this case supported now?

Fokko · 2025-01-16T06:45:11Z

pyiceberg/transforms.py

+    def _pyiceberg_transform_wrapper(
+        self, transform_func: Callable[["ArrayLike", Any], "ArrayLike"], *args: Any
+    ) -> Callable[["ArrayLike"], "ArrayLike"]:
+        import pyarrow as pa


Suggested change

import pyarrow as pa

try:

import pyarrow as pa

except ModuleNotFoundError as e:

raise ModuleNotFoundError("For bucket/truncate transforms, PyArrow needs to be installed") from e

Fokko

Sorry for the long wait, I think this one got buried in my mailbox. I left one minor nit, this looks great @sungwy 🙌

sungwy · 2025-01-16T13:51:38Z

Sorry for the long wait, I think this one got buried in my mailbox. I left one minor nit, this looks great @sungwy 🙌

no problem @Fokko - I'm just coming back from holidays myself.

And thank you for taking another round of reviews @kevinjqliu ! I'll take the nits and retrigger the CI now that the spark artifact issue has been fixed

Fokko · 2025-01-16T15:54:34Z

Thanks @sungwy and I hope you had some great time off :)

introduce bucket transform

dd888ec

kevinjqliu self-requested a review December 19, 2024 17:15

include pyiceberg-core

bd80f39

sungwy force-pushed the bucket-transforms branch from 560ba20 to bd80f39 Compare December 24, 2024 18:27

sungwy added 4 commits December 24, 2024 18:30

introduce bucket transform

27ade9a

include pyiceberg-core

fcd654c

resolve poetry conflict

a0a9c58

Merge branch 'bucket-transform' into bucket-transforms

a4137e0

sungwy marked this pull request as ready for review December 24, 2024 18:35

support truncate transforms

05c440f

sungwy changed the title ~~Introduce bucket transform~~ feat: Support bucket and Truncate transforms on write Dec 24, 2024

sungwy changed the title ~~feat: Support bucket and Truncate transforms on write~~ feat: Support Bucket and Truncate transforms on write Dec 24, 2024

sungwy requested a review from Fokko December 24, 2024 20:45

kevinjqliu approved these changes Dec 24, 2024

View reviewed changes

Remove stale comment

7079265

kevinjqliu added this to the PyIceberg 0.9.0 release milestone Jan 8, 2025

sungwy added 3 commits January 16, 2025 02:25

Merge branch 'main' into bucket-transforms

77246d5

fix poetry hash

c1ece75

avoid codespell error for truncate transform

3d0f03b

kevinjqliu approved these changes Jan 16, 2025

View reviewed changes

Fokko reviewed Jan 16, 2025

View reviewed changes

Fokko approved these changes Jan 16, 2025

View reviewed changes

sungwy added 2 commits January 16, 2025 13:52

Merge branch 'main' into bucket-transforms

1163c2a

adopt nits

0e72d90

Fokko merged commit 50c33aa into apache:main Jan 16, 2025
7 checks passed

sungwy deleted the bucket-transforms branch January 16, 2025 15:55

Fokko mentioned this pull request Jan 20, 2025

Support partitioned writes #208

Closed

4 tasks

kevinjqliu mentioned this pull request Feb 27, 2025

Partitioning: Use pyiceberg_core transform for all pyarrow transforms #1738

Closed

feat: Support Bucket and Truncate transforms on write #1345

feat: Support Bucket and Truncate transforms on write #1345

Uh oh!

Conversation

sungwy commented Nov 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kevinjqliu left a comment

Choose a reason for hiding this comment

Uh oh!

kevinjqliu Dec 24, 2024

Choose a reason for hiding this comment

Uh oh!

kevinjqliu Dec 24, 2024

Choose a reason for hiding this comment

Uh oh!

sungwy Dec 25, 2024

Choose a reason for hiding this comment

Uh oh!

kevinjqliu left a comment

Choose a reason for hiding this comment

Uh oh!

kevinjqliu Jan 16, 2025

Choose a reason for hiding this comment

Uh oh!

sungwy Jan 16, 2025

Choose a reason for hiding this comment

Uh oh!

kevinjqliu Jan 16, 2025

Choose a reason for hiding this comment

Uh oh!

Fokko Jan 16, 2025

Choose a reason for hiding this comment

Uh oh!

Fokko left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sungwy commented Jan 16, 2025

Uh oh!

Fokko commented Jan 16, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sungwy commented Nov 20, 2024 •

edited

Loading

Fokko left a comment •

edited

Loading