Skip to content

Conversation

@wzhfy
Copy link
Contributor

@wzhfy wzhfy commented Oct 30, 2017

What changes were proposed in this pull request?

Before Spark 2.x, synchronization for metastore access was protected at line229 in ClientWrapper (now it's at line203 in HiveClientWrapper ).

After Spark 2.x, HiveExternalCatalog was introduced by SPARK-13080, where an extra level of synchronization was added at line95. That is, now we have two levels of synchronization: one is HiveExternalCatalog and the other is IsolatedClientLoader in HiveClientImpl.

But since both HiveExternalCatalog and IsolatedClientLoader are shared among all spark sessions, the extra level of synchronization in HiveExternalCatalog is redundant, thus can be removed.

How was this patch tested?

Manual test and existing tests.

@SparkQA
Copy link

SparkQA commented Oct 30, 2017

Test build #83202 has finished for PR 19605 at commit 072b27d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wzhfy
Copy link
Contributor Author

wzhfy commented Oct 30, 2017

cc @cloud-fan @rxin @gatorsmile

* `clientLoader` in the `retryLocked` method.
*/
private def withClient[T](body: => T): T = synchronized {
private def withClient[T](body: => T): T = {
Copy link
Member

@gatorsmile gatorsmile Oct 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you check the callers of withClient, you can find many callers conduct multiple client-related operations in a single body. Removing this lock might cause some concurrency issues.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then can we remove the synchronization of clientLoader in HiveClientImpl? If we synchronize withClient, then there's no need to sync each operation in body.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but please make sure we only use the hive client here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check whether all the Hive client calls are through the Hive External Catalog. For example, addJar is an exception.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went through all methods in HiveClient having synchronization (except addJar):

  • getState is used only in test.
  • setOut, setInfo and setError are only used in SparkSQLEnv.init().
  • all other methods are called through HiveExternalCatalog.

So it seems addJar is the only exception.

To make addJar also go throught HiveExternalCatalog, we can pass externalCatalog instead of client at line46 in HiveSessionStateBuilder. But I don't know why we need to call newSession() at line45, where a new HiveClientImpl instance is created, with the same class loader and Hive client.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last year, we had a discussion about whether addJar should be moved to HiveExternalCatalog. See #14883

@gatorsmile
Copy link
Member

gatorsmile commented Oct 30, 2017

@gatorsmile
Copy link
Member

@wzhfy Let us close this PR first. We can restart the discussion when we exposing ExternalCatalog APIs. Thanks!

@gatorsmile gatorsmile closed this Dec 22, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants