feat(reader): Add PartitionSpec support to FileScanTask and RecordBatchTransformer #1821

mbutrovich · 2025-11-03T21:13:20Z

Which issue does this PR close?

Partially address #1749.

What changes are included in this PR?

This PR adds partition spec handling to FileScanTask and RecordBatchTransformer to correctly implement the Iceberg spec's "Column Projection" rules for fields "not present" in data files.

Problem Statement

Prior to this PR, iceberg-rust's FileScanTask had no mechanism to pass partition information to RecordBatchTransformer, causing two issues:

Incorrect handling of bucket partitioning: Couldn't distinguish identity transforms (which should use partition metadata constants) from non-identity transforms like bucket/truncate/year/month (which must read from data file). For example, bucket(4, id) stores
id_bucket = 2 (bucket number) in partition metadata, but actual id values (100, 200, 300) are only in the data file. iceberg-rust was incorrectly treating bucket-partitioned source columns as constants, breaking runtime filtering and returning incorrect query results.
Field ID conflicts in add_files scenarios: When importing Hive tables via add_files, partition columns could have field IDs conflicting with Parquet data columns. Example: Parquet has field_id=1→"name", but Iceberg expects field_id=1→"id" (partition). Per spec, the
correct field is "not present" and requires name mapping fallback.

Iceberg Specification Requirements

Per the Iceberg spec (https://iceberg.apache.org/spec/#column-projection), when a field ID is "not present" in a data file, it must be resolved using these rules:

Return the value from partition metadata if an Identity Transform exists
Use schema.name-mapping.default metadata to map field id to columns without field id
Return the default value if it has a defined initial-default
Return null in all other cases

Why this matters:

Identity transforms (e.g., identity(dept)) store actual column values in partition metadata that can be used as constants without reading the data file
Non-identity transforms (e.g., bucket(4, id), day(timestamp)) store transformed values in partition metadata (e.g., bucket number 2, not the actual id values 100, 200, 300) and must read source columns from the data file

Changes Made

Added partition fields to FileScanTask (scan/task.rs):

partition: Option<Struct> - Partition data from manifest entry
partition_spec: Option<Arc<PartitionSpec>> - For transform-aware constant detection
name_mapping: Option<Arc<NameMapping>> - Name mapping from table metadata

Implemented constants_map() function (arrow/record_batch_transformer.rs):

Replicates Java's PartitionUtil.constantsMap() behavior
Only includes fields where transform is Transform::Identity
Used to determine which fields use partition metadata constants vs. reading from data files

Enhanced RecordBatchTransformer (arrow/record_batch_transformer.rs):

Added build_with_partition_data() method to accept partition spec, partition data, and name mapping
Implements all 4 spec rules for column resolution with identity-transform awareness
Detects field ID conflicts by verifying both field ID AND name match
Falls back to name mapping when field IDs are missing/conflicting (spec rule Design of Serialization/Deserialization #2)

Updated ArrowReader (arrow/reader.rs):

Uses build_with_partition_data() when partition information is available
Falls back to build() when not available

Updated manifest entry processing (scan/context.rs):

Populates partition fields in FileScanTask from manifest entry data

Tests Added

bucket_partitioning_reads_source_column_from_file - Verifies that bucket-partitioned source columns are read from data files (not treated as constants from partition metadata)
identity_partition_uses_constant_from_metadata - Verifies that identity-transformed fields correctly use partition metadata constants
test_bucket_partitioning_with_renamed_source_column - Verifies field-ID-based mapping works despite column rename
add_files_partition_columns_without_field_ids - Verifies name mapping resolution for Hive table imports without field IDs (spec rule Design of Serialization/Deserialization #2)
add_files_with_true_field_id_conflict - Verifies correct field ID conflict detection with name mapping fallback (spec rule Design of Serialization/Deserialization #2)
test_all_four_spec_rules - Integration test verifying all 4 spec rules work together

Are these changes tested?

Yes, there are 6 new unit tests covering all 4 Iceberg spec rules. This also resolved approximately 50 Iceberg Java tests when running with DataFusion Comet's experimental apache/datafusion-comet#2528 PR.

…mer. This resolves ~50 tests in the spark-extensions Iceberg Java suite.

…d main

crates/iceberg/src/scan/task.rs

liurenjie1024

Thanks @mbutrovich for this pr, just finised first round of review.

crates/iceberg/src/scan/task.rs

crates/iceberg/src/arrow/record_batch_transformer.rs

mbutrovich · 2025-11-04T11:08:32Z

Thanks for the first round of feedback @liurenjie1024! I’ll take pass this week.

…ing.

…test with DataFusion Comet first.

mbutrovich · 2025-11-04T20:54:15Z

Hopefully I addressed all of your comments @liurenjie1024. The serialize_with stuff was new to me, so let me know if that's not what you had in mind. Thanks for your patience!

liurenjie1024 · 2025-11-06T08:49:19Z

Hopefully I addressed all of your comments @liurenjie1024. The serialize_with stuff was new to me, so let me know if that's not what you had in mind. Thanks for your patience!

Thanks, I'll take a look today.

liurenjie1024

Thanks @mbutrovich for this pr! I think mostly LGTM, we just need adjustment according to the rule

liurenjie1024 · 2025-11-05T09:19:28Z

crates/iceberg/src/scan/task.rs

+    #[serde(default)]
+    #[serde(skip_serializing_if = "Option::is_none")]
+    #[serde(serialize_with = "serialize_not_implemented")]
+    #[serde(deserialize_with = "deserialize_not_implemented")]


It's not related to this pr, but this is why I don't like pub fields in struct. Adding a field needs to change a lot of unrelated things, also this is error prone since this partition spec is supposed to be the one associated with data file, not default table partition spec.

liurenjie1024 · 2025-11-06T09:53:22Z

crates/iceberg/src/arrow/record_batch_transformer.rs

+                        |(source_field, source_index)| {
+                            let name_matches = source_field.name() == &iceberg_field.name;
+
+                            if name_mapping.is_some() && !name_matches {


We don't this check? I think if we already found a field by id, then we should just use this column?

Please see below.

liurenjie1024 · 2025-11-06T09:55:00Z

crates/iceberg/src/arrow/record_batch_transformer.rs

+                // 3. "Return the default value if it has a defined initial-default"
+                // 4. "Return null in all other cases"
+
+                let column_source = if let Some(constant_value) = constants_map.get(field_id) {


This should not be first step. According the projection rule, this only happens as first check when look up by id failed.

Please see below.

mbutrovich · 2025-11-06T11:33:11Z

Thanks @mbutrovich for this pr! I think mostly LGTM, we just need adjustment according to the rule

I'll take another pass today, thanks for the further comments! It's a non-trivial change and I'm still learning my way around the codebase and spec, so I appreciate your patience.

mbutrovich · 2025-11-06T23:36:41Z

So I apologize because I kept posting comments as I thought I understood what was happening, then I'd second guess myself and delete the comment. So I just set it to draft while I investigated. As always, I appreciate your patience. Basically all of these changes have to do with Iceberg Java's TestAddFilesProcedure suite.

Here's what I can summarize from my findings today:

We don't this check? I think if we already found a field by id, then we should just use this column?

You're right that we normally trust field IDs, but we do need the name check when name_mapping is present. In add_files scenarios (like Iceberg Java's TestAddFilesProcedure.addDataPartitioned), Parquet field IDs can conflict with Iceberg field IDs:

Parquet: field_id=1→"name", field_id=2→"dept"
Iceberg: field_id=1→"id", field_id=2→"name"

Without name checking, when looking for Iceberg field_id=2 ("name"), we'd find Parquet field_id=2 ("dept") and read the wrong column.

To fix this, we only check names when name_mapping is present (indicates potential conflicts). Without name_mapping, name mismatches are just column renames, so we trust the field ID.

This should not be first step. According the projection rule, this only happens as first check when look up by id failed.

You're right that the spec says to check these rules when a field is "not present." However, Java checks partition constants before Parquet field IDs (BaseParquetReaders.java:299), and this is intentional. In add_files, partition columns can exist in both Parquet and partition metadata. The partition metadata is authoritative—it defines which
partition this file belongs to. If we check Parquet first, we'd read the wrong values.

The spec's intent is that identity-partitioned fields are "not present" in data files by definition, even if they physically exist in the file.

This design was the only way I could get all of the tests in Iceberg Java's test suite to pass, and the subtlety seems to be that the spec is not totally clear on what to do when metadata conflicts between Iceberg and Parquet after a migration or schema change, and I had to choose to call invalid metadata "not present" as Java does.

Please feel free to let me know if I misunderstood. I tried to make a test that describes the scenario, and add comments on why the design is the way it is.

mbutrovich added 4 commits November 3, 2025 14:27

Add PartitionSpec to FileScanTask and handling in RecordBatchTransfor…

5d9ee2e

…mer. This resolves ~50 tests in the spark-extensions Iceberg Java suite.

format

ff2347a

Put back changes that accidentally got lost from not having an update…

b9e6f1e

…d main

Format, update comments

e7ff597

mbutrovich commented Nov 3, 2025

View reviewed changes

crates/iceberg/src/scan/task.rs Show resolved Hide resolved

liurenjie1024 reviewed Nov 4, 2025

View reviewed changes

mbutrovich added 8 commits November 4, 2025 06:14

Merge branch 'main' into partition-spec-support

f481c66

Address spec_id and serde feedback. Need to think about the name mapp…

da243bf

…ing.

Add NameMapping. Still need to populate from table metadata. Want to …

75cc2bc

…test with DataFusion Comet first.

Handle field ID conflicts in add_files with name mapping

29dde0e

clean up comments a bit, add a new test

c953837

remove test accidentally brought in from apache#1778.

b30195c

remove inadvertent changes, make comments more succinct

135f149

Merge branch 'main' into partition-spec-support

2b1c28a

mbutrovich mentioned this pull request Nov 4, 2025

ArrowReader enhancements for Apache DataFusion Comet #1749

Open

13 tasks

mbutrovich requested a review from liurenjie1024 November 4, 2025 20:53

mbutrovich added 2 commits November 5, 2025 06:58

Merge branch 'main' into partition-spec-support

72befec

Fix test after merging in main.

37b1513

mbutrovich force-pushed the partition-spec-support branch from d72e629 to 37b1513 Compare November 5, 2025 12:01

liurenjie1024 reviewed Nov 6, 2025

View reviewed changes

mbutrovich closed this Nov 6, 2025

mbutrovich reopened this Nov 6, 2025

mbutrovich marked this pull request as draft November 6, 2025 15:23

Address PR feedback.

c2fcc66

mbutrovich marked this pull request as ready for review November 6, 2025 23:29

mbutrovich requested a review from liurenjie1024 November 6, 2025 23:36

Adjust comments, mostly just to kick CI.

668e78b

feat(reader): Add PartitionSpec support to FileScanTask and RecordBatchTransformer #1821

Are you sure you want to change the base?

feat(reader): Add PartitionSpec support to FileScanTask and RecordBatchTransformer #1821

Conversation

mbutrovich commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

What changes are included in this PR?

Problem Statement

Iceberg Specification Requirements

Changes Made

Tests Added

Are these changes tested?

Uh oh!

Uh oh!

liurenjie1024 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mbutrovich commented Nov 4, 2025

Uh oh!

mbutrovich commented Nov 4, 2025

Uh oh!

liurenjie1024 commented Nov 6, 2025

Uh oh!

liurenjie1024 left a comment

Choose a reason for hiding this comment

Uh oh!

liurenjie1024 Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

liurenjie1024 Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

mbutrovich Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

liurenjie1024 Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

mbutrovich Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

mbutrovich commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mbutrovich commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mbutrovich commented Nov 3, 2025 •

edited

Loading

mbutrovich commented Nov 6, 2025 •

edited

Loading

mbutrovich commented Nov 6, 2025 •

edited

Loading