Expose Avro reader to PyIceberg #1328

Fokko · 2025-05-14T13:43:42Z

Which issue does this PR close?

I've been looking into exposing the Avro readers to PyIceberg. This will give a huge benefit to PyIceberg because we can drop the Cython Avro reader.

What changes are included in this PR?

Exposing methods and structures to read the manifest lists, and manifests itself.

Are these changes tested?

By using them in PyIceberg :)

…iceberg

bindings/python/src/manifest.rs

sdd · 2025-05-14T18:27:12Z

bindings/python/src/manifest.rs

+        // I don't fully comprehend the deserializer here,
+        // it works for a Type, but not for a StructType
+        // So I had to do some awkward stuff to make it work
+        let res: Result<Type, _> = serde_json::from_str(json);


Do you have an example of the JSON input that fails deserialization into a StructType? If so I'll see what I can do

Thanks @sdd for jumping in here 👍

I would expect the following to work:

Suggested change

let res: Result<Type, _> = serde_json::from_str(json);

let res = serde_json::from_str<StructType>(json);

I was also able to reproduce this in a unit test:

#[test] fn empty_struct_type() { let json = r#"{"type": "struct", "fields": []}"#; let expected = StructType { fields: vec![], id_lookup: OnceLock::default(), name_lookup: OnceLock::default(), }; let res = serde_json::from_str::<StructType>(json).unwrap(); assert_eq!(res, expected); }

But it looks like we need to wrap it in the Type enum.

Xuanwo · 2025-05-15T09:04:40Z

Hi @Fokko, I experimented a bit with this PR. One possible approach is to allow Python to access our structs in _serde, which map directly to the on-disk representation without any type transformation or parsing.

We could have something like this:

#[pyfunction]
pub fn read_manifest_list_v2(bs: &[u8]) -> PyManifestList {
    let reader = apache_avro::Reader::new(bs).unwrap();
    let values = apache_avro::types::Value::Array(
        reader
            .collect::<std::result::Result<Vec<apache_avro::types::Value>, _>>()
            .unwrap(),
    );
    let manifest_list = apache_avro::from_value::<_serde::ManifestListV2>(&values).unwrap();

    PyManifestList {
        inner: manifest_list,
    }
}

Or much better if we can expose such API directly:

#[pyfunction]
pub fn read_manifest_list_v2(bs: &[u8]) -> PyManifestList {
    PyManifestList {
        inner: ManifestList::parse_as_is(bs),
    }
}

Our current design focuses solely on Rust users, but some users may simply want to parse the file themselves and don’t want iceberg-rust to handle any transformation (such as parsing into Datum).

We could reconsider this, perhaps we can expose these as a public API, but hide them behind a feature gate.

cc @liurenjie1024 and @sdd for ideas.

…iceberg

Fokko · 2025-05-15T10:08:27Z

Our current design focuses solely on Rust users, but some users may simply want to parse the file themselves and don’t want iceberg-rust to handle any transformation (such as parsing into Datum).

Yes, that makes sense to me. I think we still want to have Iceberg-Rust some things like setting the default values for V2 (eg, setting 134: content to data, when reading V1 metadata):

Apart from that, I think your approach is great. Curious to learn what others think.

liurenjie1024 · 2025-05-19T07:07:21Z

bindings/python/src/manifest.rs

+pub struct PyLiteral {
+    inner: Literal,
+}
+
+
+#[pyclass]
+pub struct PyPrimitiveLiteral {
+    inner: PrimitiveLiteral,
+}


Should we consider having a values.rs module like what we did in core crate?

bindings/python/src/manifest.rs

liurenjie1024 · 2025-05-19T07:48:18Z

Our current design focuses solely on Rust users, but some users may simply want to parse the file themselves and don’t want iceberg-rust to handle any transformation (such as parsing into Datum).

I'm leaning toward to this approach, also this makes the api more aligned with python/java implementation.

Fokko · 2025-05-20T07:32:30Z

Thanks everyone for chiming in here. Let me summarize the discussion. I think there is consensus that the callback is not ideal.

Supply required information to construct the summaries
1. Instead of having the Fn(i32) -> Result<Option<StructType>> provider, we could pass in a HashMap<i32, StructType>. We would bind all the PartitionSpec's in PyIceberg. This is relative straightforward, but comes at a cost when there are many PartitionSpecs (which should be okay for the majority of tables).
2. What @kevinjqliu suggested Expose Avro reader to PyIceberg #1328 (comment) suggested. Pass in the current Schema and PartitionSpec's to Iceberg-Rust where we can do the lazy binding on the Iceberg-Rust side.
3. Go all the way, and convert the TableMetadata to Iceberg-Rust, this is probably where we end up at some point at some day, but require a lot of scaffolding.
Deserialize in Vec<u8> instead of a Datum, and convert them later into the actual type. This removes the dependency on the Schema and the PartitionSpec's.

I'm leaning towards 2 since that aligns the best with PyIceberg, where we can deserialize the manifest-list without having to know about the schema. I would make sure that we have consensus before moving into a certain direction, and happy to follow up on that.

liurenjie1024 · 2025-05-22T06:59:22Z

Thanks everyone for chiming in here. Let me summarize the discussion. I think there is consensus that the callback is not ideal.

Supply required information to construct the summaries

Instead of having the Fn(i32) -> Result<Option<StructType>> provider, we could pass in a HashMap<i32, StructType>. We would bind all the PartitionSpec's in PyIceberg. This is relative straightforward, but comes at a cost when there are many PartitionSpecs (which should be okay for the majority of tables).

What @kevinjqliu suggested Expose Avro reader to PyIceberg #1328 (comment) suggested. Pass in the current Schema and PartitionSpec's to Iceberg-Rust where we can do the lazy binding on the Iceberg-Rust side.

Go all the way, and convert the TableMetadata to Iceberg-Rust, this is probably where we end up at some point at some day, but require a lot of scaffolding.

Deserialize in Vec<u8> instead of a Datum, and convert them later into the actual type. This removes the dependency on the Schema and the PartitionSpec's.

I'm leaning towards 2 since that aligns the best with PyIceberg, where we can deserialize the manifest-list without having to know about the schema. I would make sure that we have consensus before moving into a certain direction, and happy to follow up on that.

+1

kevinjqliu · 2025-05-22T16:45:42Z

I like #2 as well. The refactor should be less effort than scaffolding between python class and rust struct

Fokko · 2025-05-22T21:05:59Z

Thanks for chiming in here, I've created PR #1369 that implements #2. PTAL

## Which issue does this PR close? I would like to invite everyone to roast my Rust-skills in order for me to improve myself :) Unblocks #1328 This aligns closely with PyIceberg and Java and greatly simplifies the use of the Avro readers in PyIceberg. Otherwise, we would need to update public APIs. ## What changes are included in this PR? ## Are these changes tested? --------- Co-authored-by: Kevin Liu <[email protected]>

Xuanwo · 2025-05-30T01:49:00Z

#1369 has been merged, maybe we can remove the callback now.

…iceberg

Fokko · 2025-09-01T20:34:32Z

@Xuanwo @liurenjie1024 @kevinjqliu This is ready for another round of reviews :)

…iceberg

…o fd-avro-pyiceberg

kevinjqliu · 2025-09-17T03:05:59Z

I pushed a fix for CI running on Windows OS

The proper fix requires fixing how path is handled by pyiceberg's PyArrowFileIO class. This is an issue on the pyiceberg side too when running tests/utils/test_manifest.py::test_read_manifest_entry on Windows. I'll open an issue on the PyIceberg side with more details

Fokko · 2025-09-17T07:14:05Z

@kevinjqliu Thanks for pushing the fix! I would expect that Windows will still work against an object store, but it cannot handle the c:/ prefix.

Fokko · 2025-09-17T07:14:27Z

@Xuanwo @liurenjie1024 CI is now green 💚

Xuanwo

Thank you for working on this, let's move!

Fokko added 5 commits May 13, 2025 02:25

WIP

7249542

Merge branch 'main' of github.com:apache/iceberg-rust

0260aa4

Expose Avro parsers in Python

cff3d2b

Merge branch 'main' of github.com:apache/iceberg-rust into fd-avro-py…

ee6aeda

…iceberg

Cleanup

fb44a0a

Xuanwo reviewed May 14, 2025

View reviewed changes

bindings/python/src/manifest.rs Outdated Show resolved Hide resolved

sdd reviewed May 14, 2025

View reviewed changes

bindings/python/src/manifest.rs Outdated Show resolved Hide resolved

sdd reviewed May 14, 2025

View reviewed changes

bindings/python/src/manifest.rs Outdated Show resolved Hide resolved

sdd reviewed May 14, 2025

View reviewed changes

Thanks Scott!

9bc9baf

Fokko mentioned this pull request May 14, 2025

Use Iceberg-Rust for parsing the ManifestList and Manifests apache/iceberg-python#2004

Draft

Merge branch 'main' of github.com:apache/iceberg-rust into fd-avro-py…

24b02e3

…iceberg

liurenjie1024 reviewed May 19, 2025

View reviewed changes

Fokko mentioned this pull request May 22, 2025

Change FieldSummary {upper,lower}_bound to ByteBuf #1369

Merged

Merge branch 'main' into fd-avro-pyiceberg

d02aff8

Less is more

7c63887

Fokko force-pushed the fd-avro-pyiceberg branch 5 times, most recently from 0fae964 to 51b3f97 Compare June 1, 2025 21:37

emkornfield and others added 5 commits September 1, 2025 11:06

defaults belong in serde

6ca3f1e

remove whitespace

73bae07

Merge branch 'main' of github.com:apache/iceberg-rust into fd-avro-py…

df15c22

…iceberg

Cleanup

3c661f4

Cleanup

248cc4b

Fokko added 7 commits September 1, 2025 22:52

Move PyDataFile to a separate file

589f480

Bump Avro because of better error message

66019f3

Merge branch 'main' of github.com:apache/iceberg-rust into fd-avro-py…

19fe632

…iceberg

WIP

53b833f

Make teh linter happy

d52148d

Merge branch 'main' of github.com:apache/iceberg-rust into fd-avro-py…

350cbc1

…iceberg

Merge branch 'main' into fd-avro-pyiceberg

5549a1f

kevinjqliu self-requested a review September 16, 2025 01:57

Fokko added 3 commits September 16, 2025 14:37

Fix faulty testconf

d5f671c

Merge branch 'fd-avro-pyiceberg' of github.com:Fokko/iceberg-rust int…

1153a2a

…o fd-avro-pyiceberg

Make ruff happy

2536d0c

Fokko force-pushed the fd-avro-pyiceberg branch from 6509e21 to c6bbcb5 Compare September 16, 2025 13:31

Yikes!

628c782

Fokko force-pushed the fd-avro-pyiceberg branch from c6bbcb5 to 628c782 Compare September 16, 2025 14:02

Fokko and others added 2 commits September 16, 2025 17:31

Make tests happy

7f6e579

pass path with file:// scheme

514e8f1

Merge branch 'main' into fd-avro-pyiceberg

c2be1fc

Fokko added this to the 0.7.0 release milestone Sep 17, 2025

Fokko mentioned this pull request Sep 17, 2025

Tracking issues of Iceberg Rust 0.7 Release #1631

Closed

11 tasks

Xuanwo approved these changes Sep 17, 2025

View reviewed changes

Xuanwo merged commit bad8e4e into apache:main Sep 17, 2025
17 checks passed

	let res: Result<Type, _> = serde_json::from_str(json);
	let res = serde_json::from_str<StructType>(json);

Expose Avro reader to PyIceberg #1328

Expose Avro reader to PyIceberg #1328

Uh oh!

Conversation

Fokko commented May 14, 2025

Which issue does this PR close?

What changes are included in this PR?

Are these changes tested?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sdd May 14, 2025

Choose a reason for hiding this comment

Uh oh!

Fokko May 14, 2025

Choose a reason for hiding this comment

Uh oh!

Xuanwo commented May 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Fokko commented May 15, 2025

Uh oh!

liurenjie1024 May 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

liurenjie1024 commented May 19, 2025

Uh oh!

Fokko commented May 20, 2025

Uh oh!

liurenjie1024 commented May 22, 2025

Uh oh!

kevinjqliu commented May 22, 2025

Uh oh!

Fokko commented May 22, 2025

Uh oh!

Xuanwo commented May 30, 2025

Uh oh!

Fokko commented Sep 1, 2025

Uh oh!

kevinjqliu commented Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Fokko commented Sep 17, 2025

Uh oh!

Fokko commented Sep 17, 2025

Uh oh!

Xuanwo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Xuanwo commented May 15, 2025 •

edited

Loading

kevinjqliu commented Sep 17, 2025 •

edited

Loading