-
Notifications
You must be signed in to change notification settings - Fork 394
Hive catalog: Add retry logic for hive locking #701
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
kevinjqliu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution! I added a few comments on TableProperties and testing
pyiceberg/catalog/hive.py
Outdated
| DEFAULT_LOCK_CHECK_MIN_WAIT_TIME = 2 | ||
| DEFAULT_LOCK_CHECK_MAX_WAIT_TIME = 30 | ||
| DEFAULT_LOCK_CHECK_RETRIES = 5 | ||
| DEFAULT_LOCK_CHECK_MULTIPLIER = 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wdyt about grouping these configs into TableProperties, along with their default value
iceberg-python/pyiceberg/table/__init__.py
Line 200 in 7bd5d9e
| class TableProperties: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @kevinjqliu. I don't think these should be grouped in TableProperties. TableProperties controls the behavior of a specifc table while CatalogProperties controls the behavior of the catalog instance. In this case, these properties controls the behavior of HiveCatalog's Lock and thus should be classified as CatalogProperties.
Currently, the convention is to put each catalog's properties in their own files. In this case, they can be in hive.py.
Does this sound good to you? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Iceberg-Java, these props are in TableProperties like that.
iceberg.hive.lock-check-min-wait-ms=xxx
iceberg.hive.lock-check-max-wait-ms=xxx
iceberg.hive.lock-timeout-ms=xxx
So, it is better to unify this setting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In java, they are in the HadoopConfiguration. The hadoop configuration for a hive catalog is set at the catalog-level:
https://github.com/apache/iceberg/blob/817a5e1be1616af77329965ac3742c14ca3ae116/hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalog.java#L630
Metastore parse these properties from the hadoop configuration
I call them "CatalogProperties" since pyiceberg does not have hadoop configuration. Sorry if that creates any confusion.
The only hive-lock-related config in TableProperties is engine.hive.lock-enabled which can enable/disable lock for a single table. But that is for https://issues.apache.org/jira/browse/HIVE-26882, see the "warn" section in the end of https://iceberg.apache.org/docs/nightly/configuration/#hadoop-configuration. I don't think we need that in pyiceberg now.
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, I have see this setting comes from HadoopConfiguration, "CatalogProperties" seems more suitable.
tests/integration/test_reads.py
Outdated
|
|
||
| def another_task() -> None: | ||
| lock1: LockResponse = open_client.lock(session_catalog_hive._create_lock_request(database_name, table_name)) | ||
| time.sleep(5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: time.sleep in test is typically an anti-pattern, this will add at least 5 seconds to the test suite in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be easier to mock lock and check_lock functions instead of relying on the timing of the function calls
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also maybe add a test case for when _wait_for_lock failed to acquire locks after retry
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://stackoverflow.com/questions/47906671/python-retry-with-tenacity-disable-wait-for-unittest
This might be helpful to override the waiting behavior in retry
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added a new unit test test_hive_wait_for_lock in test_hive.py that uses mocked lock and check_lock.
But I still keep an integration test based on the real hive metastore to simulate real-world cases.
In order to reduce the test latency, I use fine-grained sleep time instead.
WDYT?
HonahX
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @frankliee for working on this and @kevinjqliu for reviewing!
pyiceberg/catalog/hive.py
Outdated
| acquire_lock_timeout = ( | ||
| properties.get(TableProperties.HIVE_ACQUIRE_LOCK_TIMEOUT_MS, TableProperties.HIVE_ACQUIRE_LOCK_TIMEOUT_MS_DEFAULT) | ||
| ) / 1000.0 | ||
| lock_check_min_wait_time = ( | ||
| properties.get(TableProperties.HIVE_LOCK_CHECK_MIN_WAIT_MS, TableProperties.HIVE_LOCK_CHECK_MIN_WAIT_MS_DEFAULT) | ||
| ) / 1000.0 | ||
| lock_check_max_wait_time = ( | ||
| properties.get(TableProperties.HIVE_LOCK_CHECK_MAX_WAIT_MS, TableProperties.HIVE_LOCK_CHECK_MAX_WAIT_MS_DEFAULT) | ||
| ) / 1000.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could use PropertyUtil.property_as_int to get the values. This utl methods throws error message when failing to parse the property.
BTW: You would properly need #type: ignore at the end of each PropertyUti.property_as_int to pass the linter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added property_as_float to allow fine-grained setting.
pyiceberg/catalog/hive.py
Outdated
| DEFAULT_LOCK_CHECK_MIN_WAIT_TIME = 2 | ||
| DEFAULT_LOCK_CHECK_MAX_WAIT_TIME = 30 | ||
| DEFAULT_LOCK_CHECK_RETRIES = 5 | ||
| DEFAULT_LOCK_CHECK_MULTIPLIER = 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In java, they are in the HadoopConfiguration. The hadoop configuration for a hive catalog is set at the catalog-level:
https://github.com/apache/iceberg/blob/817a5e1be1616af77329965ac3742c14ca3ae116/hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalog.java#L630
Metastore parse these properties from the hadoop configuration
I call them "CatalogProperties" since pyiceberg does not have hadoop configuration. Sorry if that creates any confusion.
The only hive-lock-related config in TableProperties is engine.hive.lock-enabled which can enable/disable lock for a single table. But that is for https://issues.apache.org/jira/browse/HIVE-26882, see the "warn" section in the end of https://iceberg.apache.org/docs/nightly/configuration/#hadoop-configuration. I don't think we need that in pyiceberg now.
WDYT?
HonahX
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
pyiceberg/catalog/hive.py
Outdated
| LOCK_CHECK_MIN_WAIT_TIME = "lock_check_min_wait_time" | ||
| LOCK_CHECK_MAX_WAIT_TIME = "lock_check_max_wait_time" | ||
| LOCK_CHECK_RETRIES = "lock_check_retries" | ||
| DEFAULT_LOCK_CHECK_MIN_WAIT_TIME = 2 | ||
| DEFAULT_LOCK_CHECK_MAX_WAIT_TIME = 30 | ||
| DEFAULT_LOCK_CHECK_RETRIES = 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A good idea, I have updated these default values.
|
Looks like all the review comments are addressed. I'll merge this. Thanks everyone! |

In the current hive catalog implementation, locking will only be executed once.
If another task has already locked the target table, the locking will fail inevitably.
Since Iceberg-Java has implemented retry on
org.apache.iceberg.hive.MetastoreLock, we could refer this to improve Iceberg-python.This PR adds retry and wait logic for hive catalog with the following modifications.
_wait_for_lockon hive catalogWaitingForLockException