Skip to content

Rest Catalog: catalog.name should not be part of namespace #742

@c-thiel

Description

@c-thiel

Apache Iceberg version

main (development)

Please describe the bug 🐞

This is a harder one:

I am currently unhappy with the way pyiceberg handles the RestCatalogs name property.
There is one probably undisputed bug here:

return {"namespace": NAMESPACE_SEPARATOR.join(identifier.namespace.root[1:]), "table": identifier.name}

In the previous line we cut the first segment of the namespace (presumably because this is the name) - even if the warehouse name is None. If there warehouse name is None, it isn't even added to the identifier namespace:

identifier=(self.name,) + identifier_tuple if self.name else identifier_tuple,

(notice the if self.name)

Thus, when name is None or "", the first bit of the namespace identifier is removed in L356 (first snippet) wrongly.
If the catalog has a name, we could argue that removing it is correct.

However there is a second Problem:
I would argue that the RestCatalogs name should NEVER be prepended to a Tables namespace identifier in the first place.
And pyiceberg has this opinion too for 50% ;) - unfortunately the namespace provided in the url and in some requests diverge.
With the script at the bottom of this issue I receive the following request to the endpoint POST http://localhost:8080/catalog/v1/2cb6e8f0-0b7e-11ef-a4fb-c3ca318d8520/namespaces/my_namespace/tables/my_table (so namespace "my_namespace"):

{
    "identifier": {
        "namespace": [
            "my_catalog_name",
            "my_namespace"
        ],
        "name": "my_table"
    },
    "requirements": [
        ...
    ],
    "updates": [
        ...
    ]
}

Notice how the namespace provided in the body is different from the namespace in the URL. The URL is not prefixed (due to the [:1] that the path magic function does as seen above), while the body contains a prefixed namespace.

My solution would be to get rid of the name as part of the namespace altogether, thus removing the name prefix here:

identifier=(self.name,) + identifier_tuple if self.name else identifier_tuple,

And consequently also the remove the [1:] here:

return {"namespace": NAMESPACE_SEPARATOR.join(identifier.namespace.root[1:]), "table": identifier.name}

Any thoughts welcome!
My baseline is that at least the URL and the body should match.

My code:

import pandas as pd
import pyarrow as pa
from pyiceberg.catalog.rest import RestCatalog

catalog = RestCatalog(
    name="my_catalog_name",
    uri="http://localhost:8080/catalog/",
    warehouse="test",
    # For now we need a dummy token even for unauthenticated catalogs.
    # This is a requirement of pyiceberg.
    token="dummy",
)

# Lets work with the catalog
namespace = ("my_namespace",)
if namespace not in catalog.list_namespaces():
    catalog.create_namespace(namespace)

df = pd.DataFrame(
    [[1, 1.2, "foo"], [2, 2.2, "bar"]], columns=["my_ints", "my_floats", "strings"]
)
tbl = pa.Table.from_pandas(df)
table = catalog.create_table((*namespace, "my_table"), schema=tbl.schema)
table.overwrite(tbl) # <-- This is the important part

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions