-
Notifications
You must be signed in to change notification settings - Fork 393
Description
Question
Currently, the HiveCatalog's _commit_table workflow looks like:
- load current table metadata via
load_table - construct updated metadata
- lock the hive table
- alter the hive table
- unlock the hive table
Suppose now there are 2 process, A and B try to commit some changes to the same iceberg table It is possible that the code execution happens to be in the following order:
- process A load current table metadata
- process A construct updated metadata
- process B starts and finishes the whole
_commit_table - process A lock the hive table
- process A alter the hive table
- process A unlock the hive table
In this specific scenario, both processes successfully commit their changes because process B releases the lock before A tries to acquire. But if the alter_table does not support transactional check, the changes made by process B will be overridden.
Since in python we do not know which Hive version we are connecting to, I wonder if we need to update the code to lock the table before loading current table metadata, like what Java implementation does.
BTW, it seems there are some consistency issue of https://issues.apache.org/jira/browse/HIVE-26882 as well and there is an open fix for that apache/hive#5129
Please correct me if I misunderstand something here. Thanks!
cc: @Fokko