Skip to content

Conversation

@feniljain
Copy link
Contributor

@feniljain feniljain commented Nov 30, 2024

Issue Resolved

Closes #720

Description

  • Have added test case for base partition data file addition
  • Added a test case where different "type" of partition field is set in DataFileWriterBuilder when compared to schema
  • Added a test case where number of partition fields set in DataFileWriterBuilder is different from schema
  • Fixed noticed spelling mistakes in transaction.rs

Output screenshots

Screenshot 2024-11-30 at 16 07 39

@feniljain feniljain force-pushed the iceberg_partition_test branch 2 times, most recently from f5a9d4a to 8e0a702 Compare November 30, 2024 10:52
@Fokko
Copy link
Contributor

Fokko commented Dec 3, 2024

Thanks @feniljain for working on this 🙌 Did some checks:

First metadata.json:

{
  "format-version" : 2,
  "table-uuid" : "eb83b77f-c2c3-473c-a138-444a3de61213",
  "location" : "s3://icebergdata/demo/iceberg/rust/append_partition_data_file_test",
  "last-sequence-number" : 0,
  "last-updated-ms" : 1733259665987,
  "last-column-id" : 3,
  "current-schema-id" : 0,
  "schemas" : [ {
    "type" : "struct",
    "schema-id" : 0,
    "identifier-field-ids" : [ 2 ],
    "fields" : [ {
      "id" : 1,
      "name" : "foo",
      "required" : false,
      "type" : "string"
    }, {
      "id" : 2,
      "name" : "bar",
      "required" : true,
      "type" : "int"
    }, {
      "id" : 3,
      "name" : "baz",
      "required" : false,
      "type" : "boolean"
    } ]
  } ],
  "default-spec-id" : 0,
  "partition-specs" : [ {
    "spec-id" : 0,
    "fields" : [ {
      "name" : "id",
      "transform" : "identity",
      "source-id" : 2,
      "field-id" : 1000
    } ]
  } ],
  "last-partition-id" : 1000,
  "default-sort-order-id" : 0,
  "sort-orders" : [ {
    "order-id" : 0,
    "fields" : [ ]
  } ],
  "properties" : {
    "write.parquet.compression-codec" : "zstd"
  },
  "current-snapshot-id" : -1,
  "refs" : { },
  "snapshots" : [ ],
  "statistics" : [ ],
  "partition-statistics" : [ ],
  "snapshot-log" : [ ],
  "metadata-log" : [ ]
}

Without a commit, the current-snapshot-id should be null instead of -1.

After the commit:

{
  "format-version" : 2,
  "table-uuid" : "eb83b77f-c2c3-473c-a138-444a3de61213",
  "location" : "s3://icebergdata/demo/iceberg/rust/append_partition_data_file_test",
  "last-sequence-number" : 1,
  "last-updated-ms" : 1733259666572,
  "last-column-id" : 3,
  "current-schema-id" : 0,
  "schemas" : [ {
    "type" : "struct",
    "schema-id" : 0,
    "identifier-field-ids" : [ 2 ],
    "fields" : [ {
      "id" : 1,
      "name" : "foo",
      "required" : false,
      "type" : "string"
    }, {
      "id" : 2,
      "name" : "bar",
      "required" : true,
      "type" : "int"
    }, {
      "id" : 3,
      "name" : "baz",
      "required" : false,
      "type" : "boolean"
    } ]
  } ],
  "default-spec-id" : 0,
  "partition-specs" : [ {
    "spec-id" : 0,
    "fields" : [ {
      "name" : "id",
      "transform" : "identity",
      "source-id" : 2,
      "field-id" : 1000
    } ]
  } ],
  "last-partition-id" : 1000,
  "default-sort-order-id" : 0,
  "sort-orders" : [ {
    "order-id" : 0,
    "fields" : [ ]
  } ],
  "properties" : {
    "write.parquet.compression-codec" : "zstd"
  },
  "current-snapshot-id" : 8826880672679595429,
  "refs" : {
    "main" : {
      "snapshot-id" : 8826880672679595429,
      "type" : "branch"
    }
  },
  "snapshots" : [ {
    "sequence-number" : 1,
    "snapshot-id" : 8826880672679595429,
    "timestamp-ms" : 1733259666572,
    "summary" : {
      "operation" : "append"
    },
    "manifest-list" : "s3://icebergdata/demo/iceberg/rust/append_partition_data_file_test/metadata/snap-8826880672679595429-0-01938e53-a487-7ee2-a75e-c061dea0853c.avro",
    "schema-id" : 0
  } ],
  "statistics" : [ ],
  "partition-statistics" : [ ],
  "snapshot-log" : [ {
    "timestamp-ms" : 1733259666572,
    "snapshot-id" : 8826880672679595429
  } ],
  "metadata-log" : [ {
    "timestamp-ms" : 1733259665987,
    "metadata-file" : "s3://icebergdata/demo/iceberg/rust/append_partition_data_file_test/metadata/00000-647ca34f-8a7b-4a44-8d28-775bc62ef650.metadata.json"
  } ]
}

Which looks good.

{
    "manifest_path": "s3://icebergdata/demo/iceberg/rust/append_partition_data_file_test/metadata/01938e53-a487-7ee2-a75e-c061dea0853c-m0.avro",
    "manifest_length": 3391,
    "partition_spec_id": 0,
    "content": 0,
    "sequence_number": 1,
    "min_sequence_number": 1,
    "added_snapshot_id": 8826880672679595000,
    "added_files_count": 1,
    "existing_files_count": 0,
    "deleted_files_count": 0,
    "added_rows_count": 2,
    "existing_rows_count": 0,
    "deleted_rows_count": 0,
    "partitions": {
        "array": [
            {
                "contains_null": false,
                "contains_nan": null,
                "lower_bound": {
                    "bytes": "d\u0000\u0000\u0000"
                },
                "upper_bound": {
                    "bytes": "d\u0000\u0000\u0000"
                }
            }
        ]
    },
    "key_metadata": null
}

Which also looks good. The snapshot:

{
    "status": 1,
    "snapshot_id": null,
    "sequence_number": null,
    "file_sequence_number": null,
    "data_file": {
        "content": 0,
        "file_path": "s3://icebergdata/demo/iceberg/rust/append_partition_data_file_test/data/test-00000.parquet",
        "file_format": "PARQUET",
        "partition": {
            "id": {
                "int": 100
            }
        },
        "record_count": 2,
        "file_size_in_bytes": 1160,
        "column_sizes": {
            "array": [
                {
                    "key": 3,
                    "value": 36
                },
                {
                    "key": 1,
                    "value": 74
                },
                {
                    "key": 2,
                    "value": 55
                }
            ]
        },
        "value_counts": {
            "array": [
                {
                    "key": 1,
                    "value": 2
                },
                {
                    "key": 3,
                    "value": 2
                },
                {
                    "key": 2,
                    "value": 2
                }
            ]
        },
        "null_value_counts": {
            "array": [
                {
                    "key": 2,
                    "value": 0
                },
                {
                    "key": 3,
                    "value": 0
                },
                {
                    "key": 1,
                    "value": 0
                }
            ]
        },
        "nan_value_counts": {
            "array": []
        },
        "lower_bounds": {
            "array": [
                {
                    "key": 2,
                    "value": "d\u0000\u0000\u0000"
                },
                {
                    "key": 3,
                    "value": "\u0000"
                },
                {
                    "key": 1,
                    "value": "foo1"
                }
            ]
        },
        "upper_bounds": {
            "array": [
                {
                    "key": 1,
                    "value": "foo2"
                },
                {
                    "key": 2,
                    "value": "d\u0000\u0000\u0000"
                },
                {
                    "key": 3,
                    "value": "\u0001"
                }
            ]
        },
        "key_metadata": {
            "bytes": ""
        },
        "split_offsets": {
            "array": [
                4
            ]
        },
        "equality_ids": {
            "array": []
        },
        "sort_order_id": null
    }
}

Probably we just want to set the key_metadata to null instead of empty bytes.

Fokko
Fokko previously approved these changes Dec 3, 2024
@feniljain
Copy link
Contributor Author

feniljain commented Dec 4, 2024

Hey @Fokko 👋🏻

Thanks a lot for checking up in detail! Can I take up both of the issues as both are related to this test itself? 😅

Also, slightly tangential, but I have a small idea 💡, do you think we can use snapshot based testing over these files for end-to-end tests? Snapshot based testing would allow us to not check a lot of fields in every test, and we can just compare + evolve snapshots as new features are added. I have seen the idea being used with great success in projects like rust-analyzer before, and crates like https://docs.rs/insta/latest/insta/ can help us set it up.

If you think this makes sense and I should create a new issue to discuss this, do let me know, will do that :)

@Fokko
Copy link
Contributor

Fokko commented Dec 6, 2024

Regarding the testing, I would like to invite @liurenjie1024 @Xuanwo and @ZENOTME to give an opinion on that :D I'm just tipping my toe into lake rust 🦀

@Fokko Fokko marked this pull request as ready for review December 6, 2024 14:35
@ZENOTME
Copy link
Contributor

ZENOTME commented Dec 6, 2024

Hey @Fokko 👋🏻

Thanks a lot for checking up in detail! Can I take up both of the issues as both are related to this test itself? 😅

Also, slightly tangential, but I have a small idea 💡, do you think we can use snapshot based testing over these files for end-to-end tests? Snapshot based testing would allow us to not check a lot of fields in every test, and we can just compare + evolve snapshots as new features are added. I have seen the idea being used with great success in projects like rust-analyzer before, and crates like https://docs.rs/insta/latest/insta/ can help us set it up.

If you think this makes sense and I should create a new issue to discuss this, do let me know, will do that :)

Thanks @feniljain! I try insta locally and I think this is a cool idea. This tool enables us to compare the field of snapshots conveniently. One thing we need to address maybe filter out some random field. A creative idea is to support Avro format files, allowing us to create snapshots of the entire Iceberg metadata, which can then be used for quick comparisons in end-to-end tests. I also agree to open an issue to discuss this.

Copy link
Member

@Xuanwo Xuanwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @feniljain for working on this and thank you @Fokko & @ZENOTME's review. Let's move!

@Xuanwo Xuanwo merged commit 7981def into apache:main Dec 14, 2024
16 checks passed
@feniljain feniljain deleted the iceberg_partition_test branch December 14, 2024 14:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Extend the e2e tests with append a partitioned file

4 participants