Skip to content

Concern about possible consistency issue in HiveCatalog's _commit_table #588

@HonahX

Description

@HonahX

Question

Currently, the HiveCatalog's _commit_table workflow looks like:

  1. load current table metadata via load_table
  2. construct updated metadata
  3. lock the hive table
  4. alter the hive table
  5. unlock the hive table

Suppose now there are 2 process, A and B try to commit some changes to the same iceberg table It is possible that the code execution happens to be in the following order:

  1. process A load current table metadata
  2. process A construct updated metadata
  3. process B starts and finishes the whole _commit_table
  4. process A lock the hive table
  5. process A alter the hive table
  6. process A unlock the hive table

In this specific scenario, both processes successfully commit their changes because process B releases the lock before A tries to acquire. But if the alter_table does not support transactional check, the changes made by process B will be overridden.

Since in python we do not know which Hive version we are connecting to, I wonder if we need to update the code to lock the table before loading current table metadata, like what Java implementation does.

BTW, it seems there are some consistency issue of https://issues.apache.org/jira/browse/HIVE-26882 as well and there is an open fix for that apache/hive#5129

Please correct me if I misunderstand something here. Thanks!

cc: @Fokko

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions