Improve the InMemory Catalog Implementation #289

kevinjqliu · 2024-01-20T19:19:43Z

Improve the InMemory Catalog implementation.

In this PR:

Implement In-Memory catalog’s create_table and _commit_table function
Added default warehouse location which can be on the local file system (defaults to /tmp/warehouse)
Change test_base.py test_console.py to write to a temporary file location on the local file system using tmp_path from pytest
Fix test_commit_table from tests/catalog/test_base.py, issue described in schema_id not incremented during schema evolution #290

kevinjqliu · 2024-01-22T01:10:49Z

pyiceberg/catalog/in_memory.py

+from pyiceberg.table.sorting import UNSORTED_SORT_ORDER, SortOrder
+from pyiceberg.typedef import EMPTY_DICT
+
+DEFAULT_WAREHOUSE_LOCATION = "file:///tmp/warehouse"


by default, write on disk to /tmp/warehouse

kevinjqliu · 2024-01-22T01:11:27Z

pyiceberg/catalog/in_memory.py

+        super().__init__(name, **properties)
+        self.__tables = {}
+        self.__namespaces = {}
+        self._warehouse_location = properties.get(WAREHOUSE, None) or DEFAULT_WAREHOUSE_LOCATION


can pass a warehouse location using properties. warehouse location can be another fs such as s3

kevinjqliu · 2024-01-22T01:13:02Z

tests/catalog/test_base.py

 @pytest.fixture
-def catalog() -> InMemoryCatalog:
-    return InMemoryCatalog("test.in.memory.catalog", **{"test.key": "test.value"})
+def catalog(tmp_path: PosixPath) -> InMemoryCatalog:


added ability to write to temporary files for testing, which is then automatically cleaned up

kevinjqliu · 2024-01-22T05:14:50Z

tests/catalog/test_base.py

    assert response.metadata.table_uuid == given_table.metadata.table_uuid
-    assert len(response.metadata.schemas) == 1
-    assert response.metadata.schemas[0] == new_schema
+    assert given_table.metadata.current_schema_id == 1


this mirrors the java tests
https://github.com/apache/iceberg/blob/e32df0ce08086758c44e9174c582638068244073/core/src/test/java/org/apache/iceberg/TestTableMetadata.java#L1497-L1527

HonahX

Thanks for your contribution! @kevinjqliu I left some initial comments below.

pyiceberg/table/__init__.py

tests/catalog/test_base.py

pyiceberg/catalog/in_memory.py

Fokko · 2024-01-22T08:00:28Z

Thanks for working on this @kevinjqliu. The issues was created a long time ago, before we had the SqlCatalog with sqlite support. Sqlite can also work in memory rendering the InMemoryCatalog obsolete. Having two in-memory implementations in the codebase adds additional complexity in the codebase. My suggestion would be to replace the MemoryCatalog with the SqlCatalog. WDYT?

Fokko

Great work @kevinjqliu

Should we also add this catalog to the tests in tests/integration/test_reads.py?

Fokko · 2024-01-29T17:56:27Z

pyiceberg/catalog/__init__.py

    CatalogType.GLUE: load_glue,
    CatalogType.DYNAMODB: load_dynamodb,
    CatalogType.SQL: load_sql,
+    CatalogType.MEMORY: load_memory,


Can you also add this one to the docs: https://py.iceberg.apache.org/configuration/ With a warning that this is just for testing purposes only.

Fokko · 2024-01-29T21:34:54Z

pyiceberg/catalog/in_memory.py

+        if identifier in self.__tables:
+            raise TableAlreadyExistsError(f"Table already exists: {identifier}")
+        else:
+            if namespace not in self.__namespaces:


Other implementations don't auto-create namespaces, however I think it is fine for the InMemory one.

Fokko · 2024-01-29T21:44:07Z

pyiceberg/catalog/in_memory.py

+            if not location:
+                location = f'{self._warehouse_location}/{"/".join(identifier)}'
+
+            metadata_location = f'{self._warehouse_location}/{"/".join(identifier)}/metadata/metadata.json'


It looks like we don't write the metadata here, but we write it below at the _commit method

yep, the actual writing is done by _commit_table below, but the path of the metadata location is determined here.

Sorry, but I'm a bit confused here. If I just want to create the table without inserting any data:

catalog.create_table(schema, ....)

I still expect a new metadata.json file to be found at the table location without any call to _commit_table. But that does not seem to be created by the InMemory catalog now. Is there a reason that we choose this behavior?

In the previous implementation no file is written. But since we have updated _commit_table to write the metadata file, I think it more reasonable to make create_table aligned with other production implementation. WDYT?

gotcha, that makes sense!

tests/catalog/test_base.py

HonahX

Overall LGTM! Thanks for updating this to a formal implementation and adding the doc.
I just have one more comment about create_table

kevinjqliu · 2024-01-31T06:17:35Z

pyiceberg/cli/output.py


    def text(self, response: str) -> None:
-        Console().print(response)
+        Console(soft_wrap=True).print(response)


some test_console.py outputs are too long and end up with an extra \n in the middle of the string, causing tests to fail

kevinjqliu · 2024-01-31T06:26:53Z

tests/catalog/test_base.py

        identifier=TEST_TABLE_IDENTIFIER,
        schema=pyarrow_schema_simple_without_ids,
-        location=TEST_TABLE_LOCATION,
-        partition_spec=TEST_TABLE_PARTITION_SPEC,


@syun64 FYI, I realized that the TEST_TABLE_PARTITION_SPEC here breaks this test.

TEST_TABLE_PARTITION_SPEC = PartitionSpec(PartitionField(name="x", transform=IdentityTransform(), source_id=1, field_id=1000))

The partition field's source_id here is 1, but in create_table the schema's field_ids are all -1 due to _convert_schema_if_needed

So assign_fresh_partition_spec_ids fails

iceberg-python/pyiceberg/partitioning.py

Lines 203 to 204 in 102e043

original_column_name = old_schema.find_column_name(field.source_id)

if original_column_name is None:

Hey @kevinjqliu thank you for flagging this 😄 I think '-1' ID discrepancy is the symptom of the issue that makes the issue easy to understand, just as we decided in #305 (comment)

The root cause of the issue I think is that we are introducing a way for non-ID's schema (PyArrow Schema) to be used as an input into create_table, while not supporting the same for partition_spec and sort_order (PartitionField and SortField both require field IDs as inputs).

So I think we should update both assign_fresh_partition_spec_ids and assign_fresh_sort_order_ids to support field look up by name.

@Fokko - does that sound like a good way to resolve this issue?

Created #338 to track this issue

I agree with you @syun64 that for creating tables having to look up the IDs is not ideal. Probably that API has to be extended at some point.

But for the metadata (and also how Iceberg internally tracks columns, since names can change; IDs not), we need to track it by ID. I'm in doubt if assigning -1 was the best idea because that will give you a table that you cannot work with. Thanks for creating the issue, and let's continue there.

Sounds good @Fokko 👍 and thanks again for flagging this @kevinjqliu !

kevinjqliu · 2024-02-17T17:39:48Z

hey @Fokko / @HonahX do you mind taking a look at this again?

kevinjqliu · 2024-03-01T04:11:25Z

@Fokko As we discussed in #293, let's not create yet another catalog.
I moved the changes back to test_base.py where the In-Memory catalog was originally. This PR improves the implementation along with a bunch of testing improvements

Fokko

Looks good 👍

mkdocs/docs/configuration.md

tests/catalog/test_base.py

Co-authored-by: Fokko Driesprong <[email protected]>

kevinjqliu · 2024-03-05T16:04:51Z

Thanks for the suggestions, @Fokko

bitsondatadev · 2024-03-09T17:38:18Z

Thanks for working on this @kevinjqliu. The issues was created a long time ago, before we had the SqlCatalog with sqlite support. Sqlite can also work in memory rendering the InMemoryCatalog obsolete. Having two in-memory implementations in the codebase adds additional complexity in the codebase. My suggestion would be to replace the MemoryCatalog with the SqlCatalog. WDYT?

@kevinjqliu, this was likely answered offline and I suppose there was a reason to continue working here. I also am curious to know if this catalog still makes sense with inmem sqlite?

kevinjqliu · 2024-03-09T17:44:32Z

@bitsondatadev

I also am curious to know if this catalog still makes sense with inmem sqlite?

We agreed to not move this implementation to production. See #289 (comment)
Instead, this PR is used to improve the InMemory catalog implementation in tests and use it to improve other tests

Fokko

Thanks for working on this @kevinjqliu, this is a great improvement 👍

kevinjqliu force-pushed the kevinliu/test branch 3 times, most recently from 9242e69 to c09e7e9 Compare January 22, 2024 00:32

kevinjqliu changed the title ~~Kevinliu/test~~ InMemory Catalog Implementation Jan 22, 2024

kevinjqliu force-pushed the kevinliu/test branch from c09e7e9 to df17ee3 Compare January 22, 2024 01:08

kevinjqliu commented Jan 22, 2024

View reviewed changes

kevinjqliu force-pushed the kevinliu/test branch 2 times, most recently from 4842d91 to a56838d Compare January 22, 2024 05:06

kevinjqliu commented Jan 22, 2024

View reviewed changes

kevinjqliu marked this pull request as ready for review January 22, 2024 05:15

HonahX reviewed Jan 22, 2024

View reviewed changes

pyiceberg/table/__init__.py Outdated Show resolved Hide resolved

tests/catalog/test_base.py Show resolved Hide resolved

pyiceberg/catalog/in_memory.py Outdated Show resolved Hide resolved

Fokko mentioned this pull request Jan 22, 2024

InMemory Catalog #293

Closed

kevinjqliu mentioned this pull request Jan 22, 2024

schema_id not incremented during schema evolution #290

Closed

kevinjqliu force-pushed the kevinliu/test branch from a56838d to 9944896 Compare January 22, 2024 17:56

kevinjqliu mentioned this pull request Jan 24, 2024

Cannot write to local filesystem #299

Closed

Fokko reviewed Jan 29, 2024

View reviewed changes

sungwy mentioned this pull request Jan 30, 2024

create_table with a PyArrow Schema #305

Merged

kevinjqliu added 8 commits January 29, 2024 18:52

extract InMemoryCatalog out of test

066e8c4

generalize InMemoryCatalog

b1a99f7

make write work

32c449e

write to temporary location

3c6e06a

can override table location

21b1e50

memory.py -> in_memory.py

e2541ac

fix test_commit_table

3013bdb

rebase from main

be3eb1c

kevinjqliu force-pushed the kevinliu/test branch from 9944896 to be3eb1c Compare January 30, 2024 03:50

revert fs changes

f80849a

add license

67c028a

sungwy reviewed Jan 30, 2024

View reviewed changes

tests/catalog/test_base.py Show resolved Hide resolved

HonahX reviewed Jan 31, 2024

View reviewed changes

create_table write metadata file

96ba8de

kevinjqliu commented Jan 31, 2024

View reviewed changes

kevinjqliu mentioned this pull request Jan 31, 2024

What is Table Identifier? #341

Closed

kevinjqliu requested review from HonahX and sungwy February 2, 2024 15:47

kevinjqliu added 3 commits February 29, 2024 19:49

Merge branch 'main' into kevinliu/test

c7f9053

move InMemoryCatalog back to test_base

8a7b876

remove unused references

10adb1c

kevinjqliu changed the title ~~InMemory Catalog Implementation~~ Improve the InMemory Catalog Implementation Mar 1, 2024

This was referenced Mar 1, 2024

Tests should explicitly check for schema_id #487

Merged

Add SqlCatalog to test console #488

Draft

Fokko approved these changes Mar 5, 2024

View reviewed changes

mkdocs/docs/configuration.md Outdated Show resolved Hide resolved

mkdocs/docs/configuration.md Outdated Show resolved Hide resolved

tests/catalog/test_base.py Outdated Show resolved Hide resolved

kevinjqliu and others added 3 commits March 5, 2024 10:57

Update mkdocs/docs/configuration.md

a466e5f

Co-authored-by: Fokko Driesprong <[email protected]>

Update mkdocs/docs/configuration.md

03ec82b

Co-authored-by: Fokko Driesprong <[email protected]>

Update tests/catalog/test_base.py

58b34ca

Co-authored-by: Fokko Driesprong <[email protected]>

kevinjqliu requested a review from Fokko March 9, 2024 16:50

remove schema_id

810c3c5

Fokko approved these changes Mar 13, 2024

View reviewed changes

Fokko added this to the PyIceberg 0.7.0 release milestone Mar 13, 2024

Fokko merged commit 36a505f into apache:main Mar 13, 2024

kevinjqliu deleted the kevinliu/test branch March 13, 2024 15:01

	original_column_name = old_schema.find_column_name(field.source_id)
	if original_column_name is None:

Improve the InMemory Catalog Implementation #289

Improve the InMemory Catalog Implementation #289

Uh oh!

Conversation

kevinjqliu commented Jan 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kevinjqliu Jan 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HonahX left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Fokko commented Jan 22, 2024

Uh oh!

Fokko left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

HonahX left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kevinjqliu commented Feb 17, 2024

Uh oh!

kevinjqliu commented Mar 1, 2024

Uh oh!

Fokko left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kevinjqliu commented Mar 5, 2024

Uh oh!

bitsondatadev commented Mar 9, 2024

Uh oh!

kevinjqliu commented Mar 9, 2024

Uh oh!

Fokko left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

kevinjqliu commented Jan 20, 2024 •

edited

Loading

kevinjqliu Jan 22, 2024 •

edited

Loading