🐛 Write fields instead of spec object #846

Fokko · 2024-06-21T16:36:18Z

It should write the fields instead of the full spec: #208 (comment)

Also, did a small OOP refactor.

Fokko · 2024-06-21T16:52:38Z

cc @kevinjqliu @syun64

kevinjqliu

added a few comments

kevinjqliu · 2024-06-21T18:16:28Z

pyiceberg/manifest.py

+    def _meta(self) -> Dict[str, str]:
+        return {
+            "schema": self._schema.model_dump_json(),
+            "partition-spec": to_json(self._spec.fields).decode("utf-8"),


is there a reason why we dont want to use the same logic? Like

"partition-spec": self._spec.model_dump_json(),

Unfortunately the spec.fields returns a list, which is not a Pydantic object, but a native Python construct. So the method isn't available.

kevinjqliu · 2024-06-21T18:17:04Z

pyiceberg/manifest.py

-                "schema": schema.model_dump_json(),
-                "partition-spec": spec.model_dump_json(),
-                "partition-spec-id": str(spec.spec_id),
-                "format-version": "1",


this is 👍 since the version function is defined

kevinjqliu · 2024-06-21T18:17:10Z

pyiceberg/manifest.py

-                "schema": schema.model_dump_json(),
-                "partition-spec": spec.model_dump_json(),
-                "partition-spec-id": str(spec.spec_id),
-                "format-version": "2",


this is 👍 since the version function is defined

kevinjqliu · 2024-06-21T18:17:59Z

tests/utils/test_manifest.py

            "schema": test_schema.model_dump_json(),
-            "partition-spec": test_spec.model_dump_json(),
-            "partition-spec-id": str(test_spec.spec_id),
+            "partition-spec": """[{"source-id":1,"field-id":1,"transform":"identity","name":"VendorID"},{"source-id":2,"field-id":2,"transform":"identity","name":"tpep_pickup_datetime"}]""",


is it possible to not hardcode this value?

I actually like that we are hardcoding this value because the issue wasn't caught because we inferred it from test_spec before :)

I see, makes sense

I prefer to hardcore the expected value so it is clear what is being returned when you go over the tests.

sungwy

LGTM @Fokko - thank you for the quick fix!

HonahX

Sorry for being late here. @Fokko Great catch! Thanks for fixing this and the refactoring :). @syun64 @kevinjqliu Thanks for reviewing!

Related to projectnessie#9042, Iceberg's `o.a.iceberg.ManifestReader.ManifestReader()` extracts the partition spec either via a provided `Map<Integer, PartitionSpec>` or re-constructs it from Avro metadata attributes. pyiceberg until including version 0.6.1 however writes _invalid_ manifest files (see apache/iceberg-python#846) with the `partition-spec` Avro metadata attribute containing the JSON of the whole partition-spec instead of just the partition-spec fields. This change propagates the mentioned map down to the manifest-reader to work around this pyiceberg issue.

Related to #9042, Iceberg's `o.a.iceberg.ManifestReader.ManifestReader()` extracts the partition spec either via a provided `Map<Integer, PartitionSpec>` or re-constructs it from Avro metadata attributes. pyiceberg until including version 0.6.1 however writes _invalid_ manifest files (see apache/iceberg-python#846) with the `partition-spec` Avro metadata attribute containing the JSON of the whole partition-spec instead of just the partition-spec fields. This change propagates the mentioned map down to the manifest-reader to work around this pyiceberg issue.

🐛 Write fields instead of spec object

4abde82

kevinjqliu reviewed Jun 21, 2024

View reviewed changes

sungwy approved these changes Jun 21, 2024

View reviewed changes

HonahX approved these changes Jun 24, 2024

View reviewed changes

HonahX merged commit 8cdf4ab into apache:main Jun 24, 2024

Fokko deleted the fd-buggg branch June 24, 2024 07:03

Fokko mentioned this pull request Jun 24, 2024

Parsing of partition-spec JSON from Avro manifest files is not to spec, causing deserialization to fail on files written by pyiceberg apache/iceberg-rust#419

Closed

snazy mentioned this pull request Jul 18, 2024

[Bug]: Nessie GC is not deleting files from S3 bucket after GC/delete command (expiry and orphan file clean up) projectnessie/nessie#9042

Closed

snazy mentioned this pull request Jul 18, 2024

GC: Manifest file reading with specById projectnessie/nessie#9131

Merged

🐛 Write fields instead of spec object #846

🐛 Write fields instead of spec object #846

Uh oh!

Conversation

Fokko commented Jun 21, 2024

Uh oh!

Fokko commented Jun 21, 2024

Uh oh!

kevinjqliu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sungwy left a comment

Choose a reason for hiding this comment

Uh oh!

HonahX left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

HonahX left a comment •

edited

Loading