Skip to content

Conversation

@mbutrovich
Copy link
Contributor

@mbutrovich mbutrovich commented Nov 3, 2025

Which issue does this PR close?

Partially address #1749.

What changes are included in this PR?

This PR adds partition spec handling to FileScanTask and RecordBatchTransformer to correctly implement the Iceberg spec's "Column Projection" rules for fields "not present" in data files.

Problem Statement

Prior to this PR, iceberg-rust's FileScanTask had no mechanism to pass partition information to RecordBatchTransformer, causing two issues:

  1. Incorrect handling of bucket partitioning: Couldn't distinguish identity transforms (which should use partition metadata constants) from non-identity transforms like bucket/truncate/year/month (which must read from data file). For example, bucket(4, id) stores
    id_bucket = 2 (bucket number) in partition metadata, but actual id values (100, 200, 300) are only in the data file. iceberg-rust was incorrectly treating bucket-partitioned source columns as constants, breaking runtime filtering and returning incorrect query results.

  2. Field ID conflicts in add_files scenarios: When importing Hive tables via add_files, partition columns could have field IDs conflicting with Parquet data columns. Example: Parquet has field_id=1→"name", but Iceberg expects field_id=1→"id" (partition). Per spec, the
    correct field is "not present" and requires name mapping fallback.

Iceberg Specification Requirements

Per the Iceberg spec (https://iceberg.apache.org/spec/#column-projection), when a field ID is "not present" in a data file, it must be resolved using these rules:

  1. Return the value from partition metadata if an Identity Transform exists
  2. Use schema.name-mapping.default metadata to map field id to columns without field id
  3. Return the default value if it has a defined initial-default
  4. Return null in all other cases

Why this matters:

  • Identity transforms (e.g., identity(dept)) store actual column values in partition metadata that can be used as constants without reading the data file
  • Non-identity transforms (e.g., bucket(4, id), day(timestamp)) store transformed values in partition metadata (e.g., bucket number 2, not the actual id values 100, 200, 300) and must read source columns from the data file

Changes Made

  1. Added partition fields to FileScanTask (scan/task.rs):
  • partition: Option<Struct> - Partition data from manifest entry
  • partition_spec: Option<Arc<PartitionSpec>> - For transform-aware constant detection
  • name_mapping: Option<Arc<NameMapping>> - Name mapping from table metadata
  1. Implemented constants_map() function (arrow/record_batch_transformer.rs):
  • Replicates Java's PartitionUtil.constantsMap() behavior
  • Only includes fields where transform is Transform::Identity
  • Used to determine which fields use partition metadata constants vs. reading from data files
  1. Enhanced RecordBatchTransformer (arrow/record_batch_transformer.rs):
  • Added build_with_partition_data() method to accept partition spec, partition data, and name mapping
  • Implements all 4 spec rules for column resolution with identity-transform awareness
  • Detects field ID conflicts by verifying both field ID AND name match
  • Falls back to name mapping when field IDs are missing/conflicting (spec rule Design of Serialization/Deserialization #2)
  1. Updated ArrowReader (arrow/reader.rs):
  • Uses build_with_partition_data() when partition information is available
  • Falls back to build() when not available
  1. Updated manifest entry processing (scan/context.rs):
  • Populates partition fields in FileScanTask from manifest entry data

Tests Added

  1. bucket_partitioning_reads_source_column_from_file - Verifies that bucket-partitioned source columns are read from data files (not treated as constants from partition metadata)

  2. identity_partition_uses_constant_from_metadata - Verifies that identity-transformed fields correctly use partition metadata constants

  3. test_bucket_partitioning_with_renamed_source_column - Verifies field-ID-based mapping works despite column rename

  4. add_files_partition_columns_without_field_ids - Verifies name mapping resolution for Hive table imports without field IDs (spec rule Design of Serialization/Deserialization #2)

  5. add_files_with_true_field_id_conflict - Verifies correct field ID conflict detection with name mapping fallback (spec rule Design of Serialization/Deserialization #2)

  6. test_all_four_spec_rules - Integration test verifying all 4 spec rules work together

Are these changes tested?

Yes, there are 6 new unit tests covering all 4 Iceberg spec rules. This also resolved approximately 50 Iceberg Java tests when running with DataFusion Comet's experimental apache/datafusion-comet#2528 PR.

Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mbutrovich for this pr, just finised first round of review.

@mbutrovich
Copy link
Contributor Author

Thanks for the first round of feedback @liurenjie1024! I’ll take pass this week.

@mbutrovich
Copy link
Contributor Author

Hopefully I addressed all of your comments @liurenjie1024. The serialize_with stuff was new to me, so let me know if that's not what you had in mind. Thanks for your patience!

@mbutrovich mbutrovich force-pushed the partition-spec-support branch from d72e629 to 37b1513 Compare November 5, 2025 12:01
@liurenjie1024
Copy link
Contributor

Hopefully I addressed all of your comments @liurenjie1024. The serialize_with stuff was new to me, so let me know if that's not what you had in mind. Thanks for your patience!

Thanks, I'll take a look today.

Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mbutrovich for this pr! I think mostly LGTM, we just need adjustment according to the rule

#[serde(default)]
#[serde(skip_serializing_if = "Option::is_none")]
#[serde(serialize_with = "serialize_not_implemented")]
#[serde(deserialize_with = "deserialize_not_implemented")]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not related to this pr, but this is why I don't like pub fields in struct. Adding a field needs to change a lot of unrelated things, also this is error prone since this partition spec is supposed to be the one associated with data file, not default table partition spec.

|(source_field, source_index)| {
let name_matches = source_field.name() == &iceberg_field.name;

if name_mapping.is_some() && !name_matches {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't this check? I think if we already found a field by id, then we should just use this column?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see below.

// 3. "Return the default value if it has a defined initial-default"
// 4. "Return null in all other cases"

let column_source = if let Some(constant_value) = constants_map.get(field_id) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be first step. According the projection rule, this only happens as first check when look up by id failed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see below.

@mbutrovich
Copy link
Contributor Author

mbutrovich commented Nov 6, 2025

Thanks @mbutrovich for this pr! I think mostly LGTM, we just need adjustment according to the rule

I'll take another pass today, thanks for the further comments! It's a non-trivial change and I'm still learning my way around the codebase and spec, so I appreciate your patience.

@mbutrovich mbutrovich closed this Nov 6, 2025
@mbutrovich mbutrovich reopened this Nov 6, 2025
@mbutrovich mbutrovich marked this pull request as draft November 6, 2025 15:23
@mbutrovich mbutrovich marked this pull request as ready for review November 6, 2025 23:29
@mbutrovich
Copy link
Contributor Author

mbutrovich commented Nov 6, 2025

So I apologize because I kept posting comments as I thought I understood what was happening, then I'd second guess myself and delete the comment. So I just set it to draft while I investigated. As always, I appreciate your patience. Basically all of these changes have to do with Iceberg Java's TestAddFilesProcedure suite.

Here's what I can summarize from my findings today:

We don't this check? I think if we already found a field by id, then we should just use this column?

You're right that we normally trust field IDs, but we do need the name check when name_mapping is present. In add_files scenarios (like Iceberg Java's TestAddFilesProcedure.addDataPartitioned), Parquet field IDs can conflict with Iceberg field IDs:

  • Parquet: field_id=1→"name", field_id=2→"dept"
  • Iceberg: field_id=1→"id", field_id=2→"name"

Without name checking, when looking for Iceberg field_id=2 ("name"), we'd find Parquet field_id=2 ("dept") and read the wrong column.

To fix this, we only check names when name_mapping is present (indicates potential conflicts). Without name_mapping, name mismatches are just column renames, so we trust the field ID.

This should not be first step. According the projection rule, this only happens as first check when look up by id failed.

You're right that the spec says to check these rules when a field is "not present." However, Java checks partition constants before Parquet field IDs (BaseParquetReaders.java:299), and this is intentional. In add_files, partition columns can exist in both Parquet and partition metadata. The partition metadata is authoritative—it defines which
partition this file belongs to. If we check Parquet first, we'd read the wrong values.

The spec's intent is that identity-partitioned fields are "not present" in data files by definition, even if they physically exist in the file.

This design was the only way I could get all of the tests in Iceberg Java's test suite to pass, and the subtlety seems to be that the spec is not totally clear on what to do when metadata conflicts between Iceberg and Parquet after a migration or schema change, and I had to choose to call invalid metadata "not present" as Java does.

Please feel free to let me know if I misunderstood. I tried to make a test that describes the scenario, and add comments on why the design is the way it is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants