-
Notifications
You must be signed in to change notification settings - Fork 389
Add name-mapping #212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add name-mapping #212
Conversation
15bf5f5 to
a8aed26
Compare
All the things to (de)serialize the name-mapping, and all the neccessary visitors and such
a8aed26 to
0a0e829
Compare
pyiceberg/table/name_mapping.py
Outdated
| try: | ||
| return self._field_by_id[field_id] | ||
| except KeyError as e: | ||
| raise ValueError(f"Could not find field-id: {field_id}") from e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why raise an exception? In the Java ApplyNameMapping, fields are not updated with an ID if there is no matching mapping. It seems like returning Optional[MappedField] is the equivalent here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is considered more idiomatic (although it is still up for debate): https://devblogs.microsoft.com/python/idiomatic-python-eafp-versus-lbyl/
We have the same pattern in find_field:
iceberg-python/pyiceberg/schema.py
Lines 177 to 203 in ba8f9eb
| def find_field(self, name_or_id: Union[str, int], case_sensitive: bool = True) -> NestedField: | |
| """Find a field using a field name or field ID. | |
| Args: | |
| name_or_id (Union[str, int]): Either a field name or a field ID. | |
| case_sensitive (bool, optional): Whether to perform a case-sensitive lookup using a field name. Defaults to True. | |
| Raises: | |
| ValueError: When the value cannot be found. | |
| Returns: | |
| NestedField: The matched NestedField. | |
| """ | |
| if isinstance(name_or_id, int): | |
| if name_or_id not in self._lazy_id_to_field: | |
| raise ValueError(f"Could not find field with id: {name_or_id}") | |
| return self._lazy_id_to_field[name_or_id] | |
| if case_sensitive: | |
| field_id = self._name_to_id.get(name_or_id) | |
| else: | |
| field_id = self._lazy_name_to_id_lower.get(name_or_id.lower()) | |
| if field_id is None: | |
| raise ValueError(f"Could not find field with name {name_or_id}, case_sensitive={case_sensitive}") | |
| return self._lazy_id_to_field[field_id] |
This avoids having to do endless None-check in the code downstream. Often when you are sure that it can't be None, but you have to please the type checker.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't you just need to catch ValueError instead? It's normal for fields to not be mapped.
pyiceberg/table/name_mapping.py
Outdated
| def _field_by_name(self) -> Dict[str, MappedField]: | ||
| return visit_name_mapping(self, _IndexByName()) | ||
|
|
||
| def id(self, name: str) -> int: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this pulls from both NameMapping and MappedFields. The behavior is like NameMapping because it looks up fields but the method names are from MappedFields. I think this should use the find method names from NameMapping because the project typically uses find to signal that it isn't a direct lookup.
Also, there isn't a use of the lookup by ID method other than in tests, so I'd skip that. I think the only one that is used is the lookup(name: Tuple[str, ...]) version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, missed that one since they are combined in Python!
pyiceberg/table/name_mapping.py
Outdated
|
|
||
| def id(self, name: str) -> int: | ||
| try: | ||
| return self._field_by_name[name].field_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should behave like the methods elsewhere that accept Union[str, Tuple[str, ...]]. If the caller has a path of separate names to find, then this should join them with . before looking up the mapping.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think of:
def find(self, *names: str) -> MappedField:
name = '.'.join(names)
try:
return self._field_by_name[name]
except KeyError as e:
raise ValueError(f"Could not find field with name: {name}") from eI think this is more pythonic, it will work as find('a.b.c'), find('a', 'b', 'c') and find(*['a', 'b', 'c']).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!
pyiceberg/table/name_mapping.py
Outdated
| """Visit a MappedField.""" | ||
|
|
||
|
|
||
| class _IndexById(NameMappingVisitor[Dict[int, MappedField]]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this is actually needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, I just copied what's on the Java side, but if you don't think it isn't needed, let's remove it for now.
pyiceberg/table/name_mapping.py
Outdated
| return visitor.fields(fields, results) | ||
|
|
||
|
|
||
| def load_mapping_from_json(mapping: str) -> NameMapping: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about parse?
|
|
||
| @visit_name_mapping.register(list) | ||
| def _(fields: List[MappedField], visitor: NameMappingVisitor[T]) -> T: | ||
| results = [visitor.field(field, visit_name_mapping(field.fields, visitor)) for field in fields] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if fields is None? Will that return None or raise an exception?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is hidden by returning [] instead of None in the _CreateMapping visitor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is an interesting semantic discussion. When Pydantic deserializes, and it encounters that the fields field is missing, it will automatically inject an empty list:
| fields: List[MappedField] = Field(default_factory=list) |
When serializing to JSON, it will omit the fields key in the JSÓN object.
This way we can just always assume that there is a list as it is annotated, and this is also why the type-checker isn't complaining. So it boils down to the definition: Is there a difference in meaning when fields is set to None or []?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also added tests to illustrate the (de)serialization and also capture this behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As long as it works, I'm good with this.
| ] | ||
|
|
||
| def primitive(self, primitive: PrimitiveType) -> List[MappedField]: | ||
| return [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be None?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then all the signatures should change into Optional[List[MappedField]] which isn't nice. I've commented above: #212 (comment)
| fields: List[MappedField] = Field(default_factory=list) | ||
|
|
||
| @model_serializer | ||
| def ser_model(self) -> Dict[str, Any]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this needed if fields is Optional[List[MappedField]] instead? It seems more accurate to me to use None for nested mappings for primitive types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No this is the ser, it will leave out the fields entirely if there are no nested fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I meant was: is this method still needed if fields is None instead of an empty list?
rdblue
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great overall. I have a few minor comments about API and whether to use [] or None when there are no nested fields.
| def _field_by_name(self) -> Dict[str, MappedField]: | ||
| return visit_name_mapping(self, _IndexByName()) | ||
|
|
||
| def find(self, *names: str) -> MappedField: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: Do we need to check that names is not empty?
rdblue
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks correct to me. My only issue is that find will throw a ValueError when a name isn't found, which I think is a bit awkward compared to handling None. That's minor though and will get worked out when we go to write ApplyNameMapping code.
And it's arguably more pythonic to raise exceptions for control flow. 😄
All the things to (de)serialize the name-mapping, and all the neccessary visitors and such