Skip to content

Conversation

@frankliee
Copy link
Contributor

@frankliee frankliee commented May 5, 2024

In the current hive catalog implementation, locking will only be executed once.
If another task has already locked the target table, the locking will fail inevitably.

Since Iceberg-Java has implemented retry on org.apache.iceberg.hive.MetastoreLock, we could refer this to improve Iceberg-python.
This PR adds retry and wait logic for hive catalog with the following modifications.

  1. add _wait_for_lock on hive catalog
  2. add WaitingForLockException
  3. add a new UT that two tasks are locking the same table.

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution! I added a few comments on TableProperties and testing

Comment on lines 125 to 128
DEFAULT_LOCK_CHECK_MIN_WAIT_TIME = 2
DEFAULT_LOCK_CHECK_MAX_WAIT_TIME = 30
DEFAULT_LOCK_CHECK_RETRIES = 5
DEFAULT_LOCK_CHECK_MULTIPLIER = 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wdyt about grouping these configs into TableProperties, along with their default value

class TableProperties:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @kevinjqliu. I don't think these should be grouped in TableProperties. TableProperties controls the behavior of a specifc table while CatalogProperties controls the behavior of the catalog instance. In this case, these properties controls the behavior of HiveCatalog's Lock and thus should be classified as CatalogProperties.

Currently, the convention is to put each catalog's properties in their own files. In this case, they can be in hive.py.
Does this sound good to you? :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Iceberg-Java, these props are in TableProperties like that.

iceberg.hive.lock-check-min-wait-ms=xxx
iceberg.hive.lock-check-max-wait-ms=xxx
iceberg.hive.lock-timeout-ms=xxx

So, it is better to unify this setting?

Copy link
Contributor

@HonahX HonahX May 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In java, they are in the HadoopConfiguration. The hadoop configuration for a hive catalog is set at the catalog-level:
https://github.com/apache/iceberg/blob/817a5e1be1616af77329965ac3742c14ca3ae116/hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalog.java#L630

Metastore parse these properties from the hadoop configuration

I call them "CatalogProperties" since pyiceberg does not have hadoop configuration. Sorry if that creates any confusion.

The only hive-lock-related config in TableProperties is engine.hive.lock-enabled which can enable/disable lock for a single table. But that is for https://issues.apache.org/jira/browse/HIVE-26882, see the "warn" section in the end of https://iceberg.apache.org/docs/nightly/configuration/#hadoop-configuration. I don't think we need that in pyiceberg now.

WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, I have see this setting comes from HadoopConfiguration, "CatalogProperties" seems more suitable.

https://github.com/apache/hive/blob/1b0e9d9758a0f28c4baf7b1895bf96bf11252f73/iceberg/iceberg-catalog/src/main/java/org/apache/iceberg/hive/MetastoreLock.java#L62


def another_task() -> None:
lock1: LockResponse = open_client.lock(session_catalog_hive._create_lock_request(database_name, table_name))
time.sleep(5)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: time.sleep in test is typically an anti-pattern, this will add at least 5 seconds to the test suite in the future.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be easier to mock lock and check_lock functions instead of relying on the timing of the function calls

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also maybe add a test case for when _wait_for_lock failed to acquire locks after retry

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://stackoverflow.com/questions/47906671/python-retry-with-tenacity-disable-wait-for-unittest

This might be helpful to override the waiting behavior in retry

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added a new unit test test_hive_wait_for_lock in test_hive.py that uses mocked lock and check_lock.

But I still keep an integration test based on the real hive metastore to simulate real-world cases.
In order to reduce the test latency, I use fine-grained sleep time instead.

WDYT?

Copy link
Contributor

@HonahX HonahX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @frankliee for working on this and @kevinjqliu for reviewing!

Comment on lines 377 to 385
acquire_lock_timeout = (
properties.get(TableProperties.HIVE_ACQUIRE_LOCK_TIMEOUT_MS, TableProperties.HIVE_ACQUIRE_LOCK_TIMEOUT_MS_DEFAULT)
) / 1000.0
lock_check_min_wait_time = (
properties.get(TableProperties.HIVE_LOCK_CHECK_MIN_WAIT_MS, TableProperties.HIVE_LOCK_CHECK_MIN_WAIT_MS_DEFAULT)
) / 1000.0
lock_check_max_wait_time = (
properties.get(TableProperties.HIVE_LOCK_CHECK_MAX_WAIT_MS, TableProperties.HIVE_LOCK_CHECK_MAX_WAIT_MS_DEFAULT)
) / 1000.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could use PropertyUtil.property_as_int to get the values. This utl methods throws error message when failing to parse the property.

BTW: You would properly need #type: ignore at the end of each PropertyUti.property_as_int to pass the linter.

Copy link
Contributor Author

@frankliee frankliee May 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added property_as_float to allow fine-grained setting.

Comment on lines 125 to 128
DEFAULT_LOCK_CHECK_MIN_WAIT_TIME = 2
DEFAULT_LOCK_CHECK_MAX_WAIT_TIME = 30
DEFAULT_LOCK_CHECK_RETRIES = 5
DEFAULT_LOCK_CHECK_MULTIPLIER = 2
Copy link
Contributor

@HonahX HonahX May 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In java, they are in the HadoopConfiguration. The hadoop configuration for a hive catalog is set at the catalog-level:
https://github.com/apache/iceberg/blob/817a5e1be1616af77329965ac3742c14ca3ae116/hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalog.java#L630

Metastore parse these properties from the hadoop configuration

I call them "CatalogProperties" since pyiceberg does not have hadoop configuration. Sorry if that creates any confusion.

The only hive-lock-related config in TableProperties is engine.hive.lock-enabled which can enable/disable lock for a single table. But that is for https://issues.apache.org/jira/browse/HIVE-26882, see the "warn" section in the end of https://iceberg.apache.org/docs/nightly/configuration/#hadoop-configuration. I don't think we need that in pyiceberg now.

WDYT?

Copy link
Contributor

@HonahX HonahX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Comment on lines 125 to 130
LOCK_CHECK_MIN_WAIT_TIME = "lock_check_min_wait_time"
LOCK_CHECK_MAX_WAIT_TIME = "lock_check_max_wait_time"
LOCK_CHECK_RETRIES = "lock_check_retries"
DEFAULT_LOCK_CHECK_MIN_WAIT_TIME = 2
DEFAULT_LOCK_CHECK_MAX_WAIT_TIME = 30
DEFAULT_LOCK_CHECK_RETRIES = 10
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we align these properties with Java:

image https://iceberg.apache.org/docs/nightly/configuration/#write-properties

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A good idea, I have updated these default values.

@HonahX
Copy link
Contributor

HonahX commented May 15, 2024

Looks like all the review comments are addressed. I'll merge this. Thanks everyone!

@HonahX HonahX merged commit 6d52325 into apache:main May 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants