-
Notifications
You must be signed in to change notification settings - Fork 393
Description
Apache Iceberg version
main (development)
Please describe the bug 🐞
This is a harder one:
I am currently unhappy with the way pyiceberg handles the RestCatalogs name property.
There is one probably undisputed bug here:
iceberg-python/pyiceberg/catalog/rest.py
Line 356 in 20b7b53
| return {"namespace": NAMESPACE_SEPARATOR.join(identifier.namespace.root[1:]), "table": identifier.name} |
In the previous line we cut the first segment of the namespace (presumably because this is the name) - even if the warehouse name is None. If there warehouse name is None, it isn't even added to the identifier namespace:
iceberg-python/pyiceberg/catalog/rest.py
Line 464 in 20b7b53
| identifier=(self.name,) + identifier_tuple if self.name else identifier_tuple, |
(notice the
if self.name)
Thus, when name is None or "", the first bit of the namespace identifier is removed in L356 (first snippet) wrongly.
If the catalog has a name, we could argue that removing it is correct.
However there is a second Problem:
I would argue that the RestCatalogs name should NEVER be prepended to a Tables namespace identifier in the first place.
And pyiceberg has this opinion too for 50% ;) - unfortunately the namespace provided in the url and in some requests diverge.
With the script at the bottom of this issue I receive the following request to the endpoint POST http://localhost:8080/catalog/v1/2cb6e8f0-0b7e-11ef-a4fb-c3ca318d8520/namespaces/my_namespace/tables/my_table (so namespace "my_namespace"):
{
"identifier": {
"namespace": [
"my_catalog_name",
"my_namespace"
],
"name": "my_table"
},
"requirements": [
...
],
"updates": [
...
]
}Notice how the namespace provided in the body is different from the namespace in the URL. The URL is not prefixed (due to the [:1] that the path magic function does as seen above), while the body contains a prefixed namespace.
My solution would be to get rid of the name as part of the namespace altogether, thus removing the name prefix here:
iceberg-python/pyiceberg/catalog/rest.py
Line 464 in 20b7b53
| identifier=(self.name,) + identifier_tuple if self.name else identifier_tuple, |
And consequently also the remove the [1:] here:
iceberg-python/pyiceberg/catalog/rest.py
Line 356 in 20b7b53
| return {"namespace": NAMESPACE_SEPARATOR.join(identifier.namespace.root[1:]), "table": identifier.name} |
Any thoughts welcome!
My baseline is that at least the URL and the body should match.
My code:
import pandas as pd
import pyarrow as pa
from pyiceberg.catalog.rest import RestCatalog
catalog = RestCatalog(
name="my_catalog_name",
uri="http://localhost:8080/catalog/",
warehouse="test",
# For now we need a dummy token even for unauthenticated catalogs.
# This is a requirement of pyiceberg.
token="dummy",
)
# Lets work with the catalog
namespace = ("my_namespace",)
if namespace not in catalog.list_namespaces():
catalog.create_namespace(namespace)
df = pd.DataFrame(
[[1, 1.2, "foo"], [2, 2.2, "bar"]], columns=["my_ints", "my_floats", "strings"]
)
tbl = pa.Table.from_pandas(df)
table = catalog.create_table((*namespace, "my_table"), schema=tbl.schema)
table.overwrite(tbl) # <-- This is the important part