test: append partition data file #742

feniljain · 2024-11-30T10:42:21Z

Issue Resolved

Closes #720

Description

Have added test case for base partition data file addition
Added a test case where different "type" of partition field is set in DataFileWriterBuilder when compared to schema
Added a test case where number of partition fields set in DataFileWriterBuilder is different from schema
Fixed noticed spelling mistakes in transaction.rs

Output screenshots

Fokko · 2024-12-03T21:14:30Z

Thanks @feniljain for working on this 🙌 Did some checks:

First metadata.json:

{
  "format-version" : 2,
  "table-uuid" : "eb83b77f-c2c3-473c-a138-444a3de61213",
  "location" : "s3://icebergdata/demo/iceberg/rust/append_partition_data_file_test",
  "last-sequence-number" : 0,
  "last-updated-ms" : 1733259665987,
  "last-column-id" : 3,
  "current-schema-id" : 0,
  "schemas" : [ {
    "type" : "struct",
    "schema-id" : 0,
    "identifier-field-ids" : [ 2 ],
    "fields" : [ {
      "id" : 1,
      "name" : "foo",
      "required" : false,
      "type" : "string"
    }, {
      "id" : 2,
      "name" : "bar",
      "required" : true,
      "type" : "int"
    }, {
      "id" : 3,
      "name" : "baz",
      "required" : false,
      "type" : "boolean"
    } ]
  } ],
  "default-spec-id" : 0,
  "partition-specs" : [ {
    "spec-id" : 0,
    "fields" : [ {
      "name" : "id",
      "transform" : "identity",
      "source-id" : 2,
      "field-id" : 1000
    } ]
  } ],
  "last-partition-id" : 1000,
  "default-sort-order-id" : 0,
  "sort-orders" : [ {
    "order-id" : 0,
    "fields" : [ ]
  } ],
  "properties" : {
    "write.parquet.compression-codec" : "zstd"
  },
  "current-snapshot-id" : -1,
  "refs" : { },
  "snapshots" : [ ],
  "statistics" : [ ],
  "partition-statistics" : [ ],
  "snapshot-log" : [ ],
  "metadata-log" : [ ]
}

Without a commit, the current-snapshot-id should be null instead of -1.

After the commit:

{
  "format-version" : 2,
  "table-uuid" : "eb83b77f-c2c3-473c-a138-444a3de61213",
  "location" : "s3://icebergdata/demo/iceberg/rust/append_partition_data_file_test",
  "last-sequence-number" : 1,
  "last-updated-ms" : 1733259666572,
  "last-column-id" : 3,
  "current-schema-id" : 0,
  "schemas" : [ {
    "type" : "struct",
    "schema-id" : 0,
    "identifier-field-ids" : [ 2 ],
    "fields" : [ {
      "id" : 1,
      "name" : "foo",
      "required" : false,
      "type" : "string"
    }, {
      "id" : 2,
      "name" : "bar",
      "required" : true,
      "type" : "int"
    }, {
      "id" : 3,
      "name" : "baz",
      "required" : false,
      "type" : "boolean"
    } ]
  } ],
  "default-spec-id" : 0,
  "partition-specs" : [ {
    "spec-id" : 0,
    "fields" : [ {
      "name" : "id",
      "transform" : "identity",
      "source-id" : 2,
      "field-id" : 1000
    } ]
  } ],
  "last-partition-id" : 1000,
  "default-sort-order-id" : 0,
  "sort-orders" : [ {
    "order-id" : 0,
    "fields" : [ ]
  } ],
  "properties" : {
    "write.parquet.compression-codec" : "zstd"
  },
  "current-snapshot-id" : 8826880672679595429,
  "refs" : {
    "main" : {
      "snapshot-id" : 8826880672679595429,
      "type" : "branch"
    }
  },
  "snapshots" : [ {
    "sequence-number" : 1,
    "snapshot-id" : 8826880672679595429,
    "timestamp-ms" : 1733259666572,
    "summary" : {
      "operation" : "append"
    },
    "manifest-list" : "s3://icebergdata/demo/iceberg/rust/append_partition_data_file_test/metadata/snap-8826880672679595429-0-01938e53-a487-7ee2-a75e-c061dea0853c.avro",
    "schema-id" : 0
  } ],
  "statistics" : [ ],
  "partition-statistics" : [ ],
  "snapshot-log" : [ {
    "timestamp-ms" : 1733259666572,
    "snapshot-id" : 8826880672679595429
  } ],
  "metadata-log" : [ {
    "timestamp-ms" : 1733259665987,
    "metadata-file" : "s3://icebergdata/demo/iceberg/rust/append_partition_data_file_test/metadata/00000-647ca34f-8a7b-4a44-8d28-775bc62ef650.metadata.json"
  } ]
}

Which looks good.

{
    "manifest_path": "s3://icebergdata/demo/iceberg/rust/append_partition_data_file_test/metadata/01938e53-a487-7ee2-a75e-c061dea0853c-m0.avro",
    "manifest_length": 3391,
    "partition_spec_id": 0,
    "content": 0,
    "sequence_number": 1,
    "min_sequence_number": 1,
    "added_snapshot_id": 8826880672679595000,
    "added_files_count": 1,
    "existing_files_count": 0,
    "deleted_files_count": 0,
    "added_rows_count": 2,
    "existing_rows_count": 0,
    "deleted_rows_count": 0,
    "partitions": {
        "array": [
            {
                "contains_null": false,
                "contains_nan": null,
                "lower_bound": {
                    "bytes": "d\u0000\u0000\u0000"
                },
                "upper_bound": {
                    "bytes": "d\u0000\u0000\u0000"
                }
            }
        ]
    },
    "key_metadata": null
}

Which also looks good. The snapshot:

{
    "status": 1,
    "snapshot_id": null,
    "sequence_number": null,
    "file_sequence_number": null,
    "data_file": {
        "content": 0,
        "file_path": "s3://icebergdata/demo/iceberg/rust/append_partition_data_file_test/data/test-00000.parquet",
        "file_format": "PARQUET",
        "partition": {
            "id": {
                "int": 100
            }
        },
        "record_count": 2,
        "file_size_in_bytes": 1160,
        "column_sizes": {
            "array": [
                {
                    "key": 3,
                    "value": 36
                },
                {
                    "key": 1,
                    "value": 74
                },
                {
                    "key": 2,
                    "value": 55
                }
            ]
        },
        "value_counts": {
            "array": [
                {
                    "key": 1,
                    "value": 2
                },
                {
                    "key": 3,
                    "value": 2
                },
                {
                    "key": 2,
                    "value": 2
                }
            ]
        },
        "null_value_counts": {
            "array": [
                {
                    "key": 2,
                    "value": 0
                },
                {
                    "key": 3,
                    "value": 0
                },
                {
                    "key": 1,
                    "value": 0
                }
            ]
        },
        "nan_value_counts": {
            "array": []
        },
        "lower_bounds": {
            "array": [
                {
                    "key": 2,
                    "value": "d\u0000\u0000\u0000"
                },
                {
                    "key": 3,
                    "value": "\u0000"
                },
                {
                    "key": 1,
                    "value": "foo1"
                }
            ]
        },
        "upper_bounds": {
            "array": [
                {
                    "key": 1,
                    "value": "foo2"
                },
                {
                    "key": 2,
                    "value": "d\u0000\u0000\u0000"
                },
                {
                    "key": 3,
                    "value": "\u0001"
                }
            ]
        },
        "key_metadata": {
            "bytes": ""
        },
        "split_offsets": {
            "array": [
                4
            ]
        },
        "equality_ids": {
            "array": []
        },
        "sort_order_id": null
    }
}

Probably we just want to set the key_metadata to null instead of empty bytes.

feniljain · 2024-12-04T06:05:05Z

Hey @Fokko 👋🏻

Thanks a lot for checking up in detail! Can I take up both of the issues as both are related to this test itself? 😅

Also, slightly tangential, but I have a small idea 💡, do you think we can use snapshot based testing over these files for end-to-end tests? Snapshot based testing would allow us to not check a lot of fields in every test, and we can just compare + evolve snapshots as new features are added. I have seen the idea being used with great success in projects like rust-analyzer before, and crates like https://docs.rs/insta/latest/insta/ can help us set it up.

If you think this makes sense and I should create a new issue to discuss this, do let me know, will do that :)

Fokko · 2024-12-06T14:35:26Z

Regarding the testing, I would like to invite @liurenjie1024 @Xuanwo and @ZENOTME to give an opinion on that :D I'm just tipping my toe into lake rust 🦀

crates/integration_tests/tests/append_partition_data_file_test.rs

ZENOTME · 2024-12-06T15:31:55Z

Hey @Fokko 👋🏻

Thanks a lot for checking up in detail! Can I take up both of the issues as both are related to this test itself? 😅

Also, slightly tangential, but I have a small idea 💡, do you think we can use snapshot based testing over these files for end-to-end tests? Snapshot based testing would allow us to not check a lot of fields in every test, and we can just compare + evolve snapshots as new features are added. I have seen the idea being used with great success in projects like rust-analyzer before, and crates like https://docs.rs/insta/latest/insta/ can help us set it up.

If you think this makes sense and I should create a new issue to discuss this, do let me know, will do that :)

Thanks @feniljain! I try insta locally and I think this is a cool idea. This tool enables us to compare the field of snapshots conveniently. One thing we need to address maybe filter out some random field. A creative idea is to support Avro format files, allowing us to create snapshots of the entire Iceberg metadata, which can then be used for quick comparisons in end-to-end tests. I also agree to open an issue to discuss this.

Xuanwo

Thank you @feniljain for working on this and thank you @Fokko & @ZENOTME's review. Let's move!

feniljain force-pushed the iceberg_partition_test branch 2 times, most recently from f5a9d4a to 8e0a702 Compare November 30, 2024 10:52

Fokko previously approved these changes Dec 3, 2024

View reviewed changes

This was referenced Dec 3, 2024

Write null for current-snapshot-id #752

Closed

Write null for key_metadata instead of empty bytes #753

Closed

Fokko marked this pull request as ready for review December 6, 2024 14:35

ZENOTME reviewed Dec 6, 2024

View reviewed changes

crates/integration_tests/tests/append_partition_data_file_test.rs Outdated Show resolved Hide resolved

feniljain mentioned this pull request Dec 14, 2024

fix: set key_metadata to Null by default #800

Merged

feniljain force-pushed the iceberg_partition_test branch from 038b790 to 138c9d9 Compare December 14, 2024 10:18

feniljain added 2 commits December 14, 2024 16:39

test: append partition data file

2213be5

chore: fix compatible spell mistake

f68e80d

feniljain dismissed Fokko’s stale review via f68e80d December 14, 2024 11:12

feniljain force-pushed the iceberg_partition_test branch from 138c9d9 to f68e80d Compare December 14, 2024 11:12

feniljain requested review from Fokko and ZENOTME December 14, 2024 11:20

Xuanwo approved these changes Dec 14, 2024

View reviewed changes

Xuanwo merged commit 7981def into apache:main Dec 14, 2024
16 checks passed

feniljain mentioned this pull request Dec 14, 2024

Snapshot Testing for Integration Tests #803

Closed

feniljain deleted the iceberg_partition_test branch December 14, 2024 14:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

test: append partition data file #742

test: append partition data file #742

Uh oh!

feniljain commented Nov 30, 2024 •

edited

Loading

Uh oh!

Fokko commented Dec 3, 2024 •

edited

Loading

Uh oh!

feniljain commented Dec 4, 2024 •

edited

Loading

Uh oh!

Fokko commented Dec 6, 2024

Uh oh!

Uh oh!

ZENOTME commented Dec 6, 2024 •

edited

Loading

Uh oh!

Xuanwo left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

test: append partition data file #742

test: append partition data file #742

Uh oh!

Conversation

feniljain commented Nov 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issue Resolved

Description

Output screenshots

Uh oh!

Fokko commented Dec 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

feniljain commented Dec 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Fokko commented Dec 6, 2024

Uh oh!

Uh oh!

ZENOTME commented Dec 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Xuanwo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

feniljain commented Nov 30, 2024 •

edited

Loading

Fokko commented Dec 3, 2024 •

edited

Loading

feniljain commented Dec 4, 2024 •

edited

Loading

ZENOTME commented Dec 6, 2024 •

edited

Loading