-
Notifications
You must be signed in to change notification settings - Fork 392
Improve the InMemory Catalog Implementation #289
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
9242e69 to
c09e7e9
Compare
c09e7e9 to
df17ee3
Compare
pyiceberg/catalog/in_memory.py
Outdated
| from pyiceberg.table.sorting import UNSORTED_SORT_ORDER, SortOrder | ||
| from pyiceberg.typedef import EMPTY_DICT | ||
|
|
||
| DEFAULT_WAREHOUSE_LOCATION = "file:///tmp/warehouse" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
by default, write on disk to /tmp/warehouse
pyiceberg/catalog/in_memory.py
Outdated
| super().__init__(name, **properties) | ||
| self.__tables = {} | ||
| self.__namespaces = {} | ||
| self._warehouse_location = properties.get(WAREHOUSE, None) or DEFAULT_WAREHOUSE_LOCATION |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can pass a warehouse location using properties. warehouse location can be another fs such as s3
| @pytest.fixture | ||
| def catalog() -> InMemoryCatalog: | ||
| return InMemoryCatalog("test.in.memory.catalog", **{"test.key": "test.value"}) | ||
| def catalog(tmp_path: PosixPath) -> InMemoryCatalog: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added ability to write to temporary files for testing, which is then automatically cleaned up
4842d91 to
a56838d
Compare
tests/catalog/test_base.py
Outdated
| assert response.metadata.table_uuid == given_table.metadata.table_uuid | ||
| assert len(response.metadata.schemas) == 1 | ||
| assert response.metadata.schemas[0] == new_schema | ||
| assert given_table.metadata.current_schema_id == 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HonahX
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your contribution! @kevinjqliu I left some initial comments below.
|
Thanks for working on this @kevinjqliu. The issues was created a long time ago, before we had the SqlCatalog with sqlite support. Sqlite can also work in memory rendering the |
a56838d to
9944896
Compare
Fokko
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work @kevinjqliu
Should we also add this catalog to the tests in tests/integration/test_reads.py?
pyiceberg/catalog/__init__.py
Outdated
| CatalogType.GLUE: load_glue, | ||
| CatalogType.DYNAMODB: load_dynamodb, | ||
| CatalogType.SQL: load_sql, | ||
| CatalogType.MEMORY: load_memory, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also add this one to the docs: https://py.iceberg.apache.org/configuration/ With a warning that this is just for testing purposes only.
pyiceberg/catalog/in_memory.py
Outdated
| if identifier in self.__tables: | ||
| raise TableAlreadyExistsError(f"Table already exists: {identifier}") | ||
| else: | ||
| if namespace not in self.__namespaces: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other implementations don't auto-create namespaces, however I think it is fine for the InMemory one.
pyiceberg/catalog/in_memory.py
Outdated
| if not location: | ||
| location = f'{self._warehouse_location}/{"/".join(identifier)}' | ||
|
|
||
| metadata_location = f'{self._warehouse_location}/{"/".join(identifier)}/metadata/metadata.json' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like we don't write the metadata here, but we write it below at the _commit method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, the actual writing is done by _commit_table below, but the path of the metadata location is determined here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, but I'm a bit confused here. If I just want to create the table without inserting any data:
catalog.create_table(schema, ....)I still expect a new metadata.json file to be found at the table location without any call to _commit_table. But that does not seem to be created by the InMemory catalog now. Is there a reason that we choose this behavior?
In the previous implementation no file is written. But since we have updated _commit_table to write the metadata file, I think it more reasonable to make create_table aligned with other production implementation. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gotcha, that makes sense!
9944896 to
be3eb1c
Compare
HonahX
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM! Thanks for updating this to a formal implementation and adding the doc.
I just have one more comment about create_table
|
|
||
| def text(self, response: str) -> None: | ||
| Console().print(response) | ||
| Console(soft_wrap=True).print(response) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some test_console.py outputs are too long and end up with an extra \n in the middle of the string, causing tests to fail
| identifier=TEST_TABLE_IDENTIFIER, | ||
| schema=pyarrow_schema_simple_without_ids, | ||
| location=TEST_TABLE_LOCATION, | ||
| partition_spec=TEST_TABLE_PARTITION_SPEC, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@syun64 FYI, I realized that the TEST_TABLE_PARTITION_SPEC here breaks this test.
TEST_TABLE_PARTITION_SPEC = PartitionSpec(PartitionField(name="x", transform=IdentityTransform(), source_id=1, field_id=1000))
The partition field's source_id here is 1, but in create_table the schema's field_ids are all -1 due to _convert_schema_if_needed
So assign_fresh_partition_spec_ids fails
iceberg-python/pyiceberg/partitioning.py
Lines 203 to 204 in 102e043
| original_column_name = old_schema.find_column_name(field.source_id) | |
| if original_column_name is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @kevinjqliu thank you for flagging this 😄 I think '-1' ID discrepancy is the symptom of the issue that makes the issue easy to understand, just as we decided in #305 (comment)
The root cause of the issue I think is that we are introducing a way for non-ID's schema (PyArrow Schema) to be used as an input into create_table, while not supporting the same for partition_spec and sort_order (PartitionField and SortField both require field IDs as inputs).
So I think we should update both assign_fresh_partition_spec_ids and assign_fresh_sort_order_ids to support field look up by name.
@Fokko - does that sound like a good way to resolve this issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created #338 to track this issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with you @syun64 that for creating tables having to look up the IDs is not ideal. Probably that API has to be extended at some point.
But for the metadata (and also how Iceberg internally tracks columns, since names can change; IDs not), we need to track it by ID. I'm in doubt if assigning -1 was the best idea because that will give you a table that you cannot work with. Thanks for creating the issue, and let's continue there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good @Fokko 👍 and thanks again for flagging this @kevinjqliu !
Fokko
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good 👍
Co-authored-by: Fokko Driesprong <[email protected]>
Co-authored-by: Fokko Driesprong <[email protected]>
Co-authored-by: Fokko Driesprong <[email protected]>
|
Thanks for the suggestions, @Fokko |
@kevinjqliu, this was likely answered offline and I suppose there was a reason to continue working here. I also am curious to know if this catalog still makes sense with inmem sqlite? |
We agreed to not move this implementation to production. See #289 (comment) |
Fokko
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this @kevinjqliu, this is a great improvement 👍
Issue #293
Improve the InMemory Catalog implementation.
In this PR:
create_tableand_commit_tablefunction/tmp/warehouse)test_base.pytest_console.pyto write to a temporary file location on the local file system usingtmp_pathfrom pytesttest_commit_tablefromtests/catalog/test_base.py, issue described inschema_idnot incremented during schema evolution #290