Hive catalog: Add retry logic for hive locking #701

frankliee · 2024-05-05T14:13:57Z

In the current hive catalog implementation, locking will only be executed once.
If another task has already locked the target table, the locking will fail inevitably.

Since Iceberg-Java has implemented retry on org.apache.iceberg.hive.MetastoreLock, we could refer this to improve Iceberg-python.
This PR adds retry and wait logic for hive catalog with the following modifications.

add _wait_for_lock on hive catalog
add WaitingForLockException
add a new UT that two tasks are locking the same table.

kevinjqliu

Thanks for the contribution! I added a few comments on TableProperties and testing

kevinjqliu · 2024-05-05T18:32:18Z

pyiceberg/catalog/hive.py

+DEFAULT_LOCK_CHECK_MIN_WAIT_TIME = 2
+DEFAULT_LOCK_CHECK_MAX_WAIT_TIME = 30
+DEFAULT_LOCK_CHECK_RETRIES = 5
+DEFAULT_LOCK_CHECK_MULTIPLIER = 2


wdyt about grouping these configs into TableProperties, along with their default value

iceberg-python/pyiceberg/table/__init__.py

Line 200 in 7bd5d9e

class TableProperties:

Hi @kevinjqliu. I don't think these should be grouped in TableProperties. TableProperties controls the behavior of a specifc table while CatalogProperties controls the behavior of the catalog instance. In this case, these properties controls the behavior of HiveCatalog's Lock and thus should be classified as CatalogProperties.

Currently, the convention is to put each catalog's properties in their own files. In this case, they can be in hive.py.
Does this sound good to you? :)

In Iceberg-Java, these props are in TableProperties like that.

iceberg.hive.lock-check-min-wait-ms=xxx iceberg.hive.lock-check-max-wait-ms=xxx iceberg.hive.lock-timeout-ms=xxx

So, it is better to unify this setting?

In java, they are in the HadoopConfiguration. The hadoop configuration for a hive catalog is set at the catalog-level:
https://github.com/apache/iceberg/blob/817a5e1be1616af77329965ac3742c14ca3ae116/hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalog.java#L630

Metastore parse these properties from the hadoop configuration

I call them "CatalogProperties" since pyiceberg does not have hadoop configuration. Sorry if that creates any confusion.

The only hive-lock-related config in TableProperties is engine.hive.lock-enabled which can enable/disable lock for a single table. But that is for https://issues.apache.org/jira/browse/HIVE-26882, see the "warn" section in the end of https://iceberg.apache.org/docs/nightly/configuration/#hadoop-configuration. I don't think we need that in pyiceberg now.

WDYT?

You are right, I have see this setting comes from HadoopConfiguration, "CatalogProperties" seems more suitable.

https://github.com/apache/hive/blob/1b0e9d9758a0f28c4baf7b1895bf96bf11252f73/iceberg/iceberg-catalog/src/main/java/org/apache/iceberg/hive/MetastoreLock.java#L62

pyiceberg/catalog/hive.py

kevinjqliu · 2024-05-05T18:43:42Z

tests/integration/test_reads.py

+
+        def another_task() -> None:
+            lock1: LockResponse = open_client.lock(session_catalog_hive._create_lock_request(database_name, table_name))
+            time.sleep(5)


nit: time.sleep in test is typically an anti-pattern, this will add at least 5 seconds to the test suite in the future.

It might be easier to mock lock and check_lock functions instead of relying on the timing of the function calls

Also maybe add a test case for when _wait_for_lock failed to acquire locks after retry

https://stackoverflow.com/questions/47906671/python-retry-with-tenacity-disable-wait-for-unittest

This might be helpful to override the waiting behavior in retry

I have added a new unit test test_hive_wait_for_lock in test_hive.py that uses mocked lock and check_lock.

But I still keep an integration test based on the real hive metastore to simulate real-world cases.
In order to reduce the test latency, I use fine-grained sleep time instead.

WDYT?

HonahX

Thanks @frankliee for working on this and @kevinjqliu for reviewing!

HonahX · 2024-05-07T03:29:41Z

pyiceberg/catalog/hive.py

+        acquire_lock_timeout = (
+            properties.get(TableProperties.HIVE_ACQUIRE_LOCK_TIMEOUT_MS, TableProperties.HIVE_ACQUIRE_LOCK_TIMEOUT_MS_DEFAULT)
+        ) / 1000.0
+        lock_check_min_wait_time = (
+            properties.get(TableProperties.HIVE_LOCK_CHECK_MIN_WAIT_MS, TableProperties.HIVE_LOCK_CHECK_MIN_WAIT_MS_DEFAULT)
+        ) / 1000.0
+        lock_check_max_wait_time = (
+            properties.get(TableProperties.HIVE_LOCK_CHECK_MAX_WAIT_MS, TableProperties.HIVE_LOCK_CHECK_MAX_WAIT_MS_DEFAULT)
+        ) / 1000.0


We could use PropertyUtil.property_as_int to get the values. This utl methods throws error message when failing to parse the property.

BTW: You would properly need #type: ignore at the end of each PropertyUti.property_as_int to pass the linter.

I have added property_as_float to allow fine-grained setting.

HonahX · 2024-05-07T04:00:40Z

pyiceberg/catalog/hive.py

+DEFAULT_LOCK_CHECK_MIN_WAIT_TIME = 2
+DEFAULT_LOCK_CHECK_MAX_WAIT_TIME = 30
+DEFAULT_LOCK_CHECK_RETRIES = 5
+DEFAULT_LOCK_CHECK_MULTIPLIER = 2


In java, they are in the HadoopConfiguration. The hadoop configuration for a hive catalog is set at the catalog-level:
https://github.com/apache/iceberg/blob/817a5e1be1616af77329965ac3742c14ca3ae116/hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalog.java#L630

Metastore parse these properties from the hadoop configuration

I call them "CatalogProperties" since pyiceberg does not have hadoop configuration. Sorry if that creates any confusion.

The only hive-lock-related config in TableProperties is engine.hive.lock-enabled which can enable/disable lock for a single table. But that is for https://issues.apache.org/jira/browse/HIVE-26882, see the "warn" section in the end of https://iceberg.apache.org/docs/nightly/configuration/#hadoop-configuration. I don't think we need that in pyiceberg now.

WDYT?

HonahX

LGTM!

Fokko · 2024-05-08T21:13:26Z

pyiceberg/catalog/hive.py

+LOCK_CHECK_MIN_WAIT_TIME = "lock_check_min_wait_time"
+LOCK_CHECK_MAX_WAIT_TIME = "lock_check_max_wait_time"
+LOCK_CHECK_RETRIES = "lock_check_retries"
+DEFAULT_LOCK_CHECK_MIN_WAIT_TIME = 2
+DEFAULT_LOCK_CHECK_MAX_WAIT_TIME = 30
+DEFAULT_LOCK_CHECK_RETRIES = 10


Should we align these properties with Java:
https://iceberg.apache.org/docs/nightly/configuration/#write-properties

A good idea, I have updated these default values.

HonahX · 2024-05-15T07:13:14Z

Looks like all the review comments are addressed. I'll merge this. Thanks everyone!

kevinjqliu reviewed May 5, 2024

View reviewed changes

HonahX reviewed May 7, 2024

View reviewed changes

frankliee added 2 commits May 7, 2024 16:38

lock with retry

e688139

fix comment

92e96fa

HonahX approved these changes May 7, 2024

View reviewed changes

Fokko reviewed May 8, 2024

View reviewed changes

fix comment2

2bcfb02

HonahX merged commit 6d52325 into apache:main May 15, 2024

Hive catalog: Add retry logic for hive locking #701

Hive catalog: Add retry logic for hive locking #701

Uh oh!

Conversation

frankliee commented May 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kevinjqliu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HonahX May 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HonahX left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

frankliee May 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HonahX May 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HonahX left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HonahX commented May 15, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

frankliee commented May 5, 2024 •

edited

Loading

HonahX May 7, 2024 •

edited

Loading

frankliee May 7, 2024 •

edited

Loading

HonahX May 7, 2024 •

edited

Loading