-
Notifications
You must be signed in to change notification settings - Fork 398
🐛 Write fields instead of spec object #846
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @kevinjqliu @syun64 |
kevinjqliu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added a few comments
| def _meta(self) -> Dict[str, str]: | ||
| return { | ||
| "schema": self._schema.model_dump_json(), | ||
| "partition-spec": to_json(self._spec.fields).decode("utf-8"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a reason why we dont want to use the same logic? Like
"partition-spec": self._spec.model_dump_json(),
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately the spec.fields returns a list, which is not a Pydantic object, but a native Python construct. So the method isn't available.
| "schema": schema.model_dump_json(), | ||
| "partition-spec": spec.model_dump_json(), | ||
| "partition-spec-id": str(spec.spec_id), | ||
| "format-version": "1", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is 👍 since the version function is defined
| "schema": schema.model_dump_json(), | ||
| "partition-spec": spec.model_dump_json(), | ||
| "partition-spec-id": str(spec.spec_id), | ||
| "format-version": "2", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is 👍 since the version function is defined
| "schema": test_schema.model_dump_json(), | ||
| "partition-spec": test_spec.model_dump_json(), | ||
| "partition-spec-id": str(test_spec.spec_id), | ||
| "partition-spec": """[{"source-id":1,"field-id":1,"transform":"identity","name":"VendorID"},{"source-id":2,"field-id":2,"transform":"identity","name":"tpep_pickup_datetime"}]""", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it possible to not hardcode this value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually like that we are hardcoding this value because the issue wasn't caught because we inferred it from test_spec before :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, makes sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer to hardcore the expected value so it is clear what is being returned when you go over the tests.
sungwy
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM @Fokko - thank you for the quick fix!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for being late here. @Fokko Great catch! Thanks for fixing this and the refactoring :). @syun64 @kevinjqliu Thanks for reviewing!
Related to projectnessie#9042, Iceberg's `o.a.iceberg.ManifestReader.ManifestReader()` extracts the partition spec either via a provided `Map<Integer, PartitionSpec>` or re-constructs it from Avro metadata attributes. pyiceberg until including version 0.6.1 however writes _invalid_ manifest files (see apache/iceberg-python#846) with the `partition-spec` Avro metadata attribute containing the JSON of the whole partition-spec instead of just the partition-spec fields. This change propagates the mentioned map down to the manifest-reader to work around this pyiceberg issue.
Related to #9042, Iceberg's `o.a.iceberg.ManifestReader.ManifestReader()` extracts the partition spec either via a provided `Map<Integer, PartitionSpec>` or re-constructs it from Avro metadata attributes. pyiceberg until including version 0.6.1 however writes _invalid_ manifest files (see apache/iceberg-python#846) with the `partition-spec` Avro metadata attribute containing the JSON of the whole partition-spec instead of just the partition-spec fields. This change propagates the mentioned map down to the manifest-reader to work around this pyiceberg issue.
It should write the fields instead of the full spec: #208 (comment)
Also, did a small OOP refactor.