-
Notifications
You must be signed in to change notification settings - Fork 344
test: append partition data file #742
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
f5a9d4a to
8e0a702
Compare
|
Thanks @feniljain for working on this 🙌 Did some checks: First {
"format-version" : 2,
"table-uuid" : "eb83b77f-c2c3-473c-a138-444a3de61213",
"location" : "s3://icebergdata/demo/iceberg/rust/append_partition_data_file_test",
"last-sequence-number" : 0,
"last-updated-ms" : 1733259665987,
"last-column-id" : 3,
"current-schema-id" : 0,
"schemas" : [ {
"type" : "struct",
"schema-id" : 0,
"identifier-field-ids" : [ 2 ],
"fields" : [ {
"id" : 1,
"name" : "foo",
"required" : false,
"type" : "string"
}, {
"id" : 2,
"name" : "bar",
"required" : true,
"type" : "int"
}, {
"id" : 3,
"name" : "baz",
"required" : false,
"type" : "boolean"
} ]
} ],
"default-spec-id" : 0,
"partition-specs" : [ {
"spec-id" : 0,
"fields" : [ {
"name" : "id",
"transform" : "identity",
"source-id" : 2,
"field-id" : 1000
} ]
} ],
"last-partition-id" : 1000,
"default-sort-order-id" : 0,
"sort-orders" : [ {
"order-id" : 0,
"fields" : [ ]
} ],
"properties" : {
"write.parquet.compression-codec" : "zstd"
},
"current-snapshot-id" : -1,
"refs" : { },
"snapshots" : [ ],
"statistics" : [ ],
"partition-statistics" : [ ],
"snapshot-log" : [ ],
"metadata-log" : [ ]
}Without a commit, the After the commit: {
"format-version" : 2,
"table-uuid" : "eb83b77f-c2c3-473c-a138-444a3de61213",
"location" : "s3://icebergdata/demo/iceberg/rust/append_partition_data_file_test",
"last-sequence-number" : 1,
"last-updated-ms" : 1733259666572,
"last-column-id" : 3,
"current-schema-id" : 0,
"schemas" : [ {
"type" : "struct",
"schema-id" : 0,
"identifier-field-ids" : [ 2 ],
"fields" : [ {
"id" : 1,
"name" : "foo",
"required" : false,
"type" : "string"
}, {
"id" : 2,
"name" : "bar",
"required" : true,
"type" : "int"
}, {
"id" : 3,
"name" : "baz",
"required" : false,
"type" : "boolean"
} ]
} ],
"default-spec-id" : 0,
"partition-specs" : [ {
"spec-id" : 0,
"fields" : [ {
"name" : "id",
"transform" : "identity",
"source-id" : 2,
"field-id" : 1000
} ]
} ],
"last-partition-id" : 1000,
"default-sort-order-id" : 0,
"sort-orders" : [ {
"order-id" : 0,
"fields" : [ ]
} ],
"properties" : {
"write.parquet.compression-codec" : "zstd"
},
"current-snapshot-id" : 8826880672679595429,
"refs" : {
"main" : {
"snapshot-id" : 8826880672679595429,
"type" : "branch"
}
},
"snapshots" : [ {
"sequence-number" : 1,
"snapshot-id" : 8826880672679595429,
"timestamp-ms" : 1733259666572,
"summary" : {
"operation" : "append"
},
"manifest-list" : "s3://icebergdata/demo/iceberg/rust/append_partition_data_file_test/metadata/snap-8826880672679595429-0-01938e53-a487-7ee2-a75e-c061dea0853c.avro",
"schema-id" : 0
} ],
"statistics" : [ ],
"partition-statistics" : [ ],
"snapshot-log" : [ {
"timestamp-ms" : 1733259666572,
"snapshot-id" : 8826880672679595429
} ],
"metadata-log" : [ {
"timestamp-ms" : 1733259665987,
"metadata-file" : "s3://icebergdata/demo/iceberg/rust/append_partition_data_file_test/metadata/00000-647ca34f-8a7b-4a44-8d28-775bc62ef650.metadata.json"
} ]
}Which looks good. {
"manifest_path": "s3://icebergdata/demo/iceberg/rust/append_partition_data_file_test/metadata/01938e53-a487-7ee2-a75e-c061dea0853c-m0.avro",
"manifest_length": 3391,
"partition_spec_id": 0,
"content": 0,
"sequence_number": 1,
"min_sequence_number": 1,
"added_snapshot_id": 8826880672679595000,
"added_files_count": 1,
"existing_files_count": 0,
"deleted_files_count": 0,
"added_rows_count": 2,
"existing_rows_count": 0,
"deleted_rows_count": 0,
"partitions": {
"array": [
{
"contains_null": false,
"contains_nan": null,
"lower_bound": {
"bytes": "d\u0000\u0000\u0000"
},
"upper_bound": {
"bytes": "d\u0000\u0000\u0000"
}
}
]
},
"key_metadata": null
}Which also looks good. The snapshot: {
"status": 1,
"snapshot_id": null,
"sequence_number": null,
"file_sequence_number": null,
"data_file": {
"content": 0,
"file_path": "s3://icebergdata/demo/iceberg/rust/append_partition_data_file_test/data/test-00000.parquet",
"file_format": "PARQUET",
"partition": {
"id": {
"int": 100
}
},
"record_count": 2,
"file_size_in_bytes": 1160,
"column_sizes": {
"array": [
{
"key": 3,
"value": 36
},
{
"key": 1,
"value": 74
},
{
"key": 2,
"value": 55
}
]
},
"value_counts": {
"array": [
{
"key": 1,
"value": 2
},
{
"key": 3,
"value": 2
},
{
"key": 2,
"value": 2
}
]
},
"null_value_counts": {
"array": [
{
"key": 2,
"value": 0
},
{
"key": 3,
"value": 0
},
{
"key": 1,
"value": 0
}
]
},
"nan_value_counts": {
"array": []
},
"lower_bounds": {
"array": [
{
"key": 2,
"value": "d\u0000\u0000\u0000"
},
{
"key": 3,
"value": "\u0000"
},
{
"key": 1,
"value": "foo1"
}
]
},
"upper_bounds": {
"array": [
{
"key": 1,
"value": "foo2"
},
{
"key": 2,
"value": "d\u0000\u0000\u0000"
},
{
"key": 3,
"value": "\u0001"
}
]
},
"key_metadata": {
"bytes": ""
},
"split_offsets": {
"array": [
4
]
},
"equality_ids": {
"array": []
},
"sort_order_id": null
}
}Probably we just want to set the |
|
Hey @Fokko 👋🏻 Thanks a lot for checking up in detail! Can I take up both of the issues as both are related to this test itself? 😅 Also, slightly tangential, but I have a small idea 💡, do you think we can use snapshot based testing over these files for end-to-end tests? Snapshot based testing would allow us to not check a lot of fields in every test, and we can just compare + evolve snapshots as new features are added. I have seen the idea being used with great success in projects like rust-analyzer before, and crates like https://docs.rs/insta/latest/insta/ can help us set it up. If you think this makes sense and I should create a new issue to discuss this, do let me know, will do that :) |
|
Regarding the testing, I would like to invite @liurenjie1024 @Xuanwo and @ZENOTME to give an opinion on that :D I'm just tipping my toe into lake rust 🦀 |
crates/integration_tests/tests/append_partition_data_file_test.rs
Outdated
Show resolved
Hide resolved
Thanks @feniljain! I try insta locally and I think this is a cool idea. This tool enables us to compare the field of snapshots conveniently. One thing we need to address maybe filter out some random field. A creative idea is to support Avro format files, allowing us to create snapshots of the entire Iceberg metadata, which can then be used for quick comparisons in end-to-end tests. I also agree to open an issue to discuss this. |
038b790 to
138c9d9
Compare
138c9d9 to
f68e80d
Compare
Xuanwo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @feniljain for working on this and thank you @Fokko & @ZENOTME's review. Let's move!
Issue Resolved
Closes #720
Description
DataFileWriterBuilderwhen compared to schemaDataFileWriterBuilderis different from schematransaction.rsOutput screenshots