[SPARK-27899][SQL] Make HiveMetastoreClient.getTableObjectsByName available in ExternalCatalog/SessionCatalog API #24774

LantaoJin · 2019-06-03T07:40:33Z

What changes were proposed in this pull request?

The new Spark ThriftServer SparkGetTablesOperation implemented in #22794 does a catalog.getTableMetadata request for every table. This can get very slow for large schemas (~50ms per table with an external Hive metastore).
Hive ThriftServer GetTablesOperation uses HiveMetastoreClient.getTableObjectsByName to get table information in bulk, but we don't expose that through our APIs that go through Hive -> HiveClientImpl (HiveClient) -> HiveExternalCatalog (ExternalCatalog) -> SessionCatalog.

If we added and exposed getTableObjectsByName through our catalog APIs, we could resolve that performance problem in SparkGetTablesOperation.

How was this patch tested?

Add UT

LantaoJin · 2019-06-03T07:42:18Z

Hi @wangyum @gatorsmile , could you have a chance to review?

juliuszsompolski · 2019-06-03T11:18:03Z

Thank you @LantaoJin for doing that!
I looked through the code and it looks good to me, but I am not really familiar with the catalog code so will refrain from reviewing.

wangyum · 2019-06-03T14:30:47Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala

What if names from different databases? for example:

Seq("db1.table1", "db2.table1")

I have considered this case. Could this be happened in the reality use cases? But good catch, it should be handled in code. How about throws an exception?

wangyum · 2019-06-03T14:33:12Z

...rver/src/test/scala/org/apache/spark/sql/hive/thriftserver/SparkMetadataOperationSuite.scala

Is this change necessary?

This view was not cleaned up when the UT failed in my environment. So adding this could stable the test case I think.

Weird. If the view is not dropped, why the tables do not exist in your environment?

wangyum · 2019-06-03T14:34:21Z

@LantaoJin Could you post some benchmark result?

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalog.scala

gatorsmile · 2019-06-03T16:42:36Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala

Could you submit a refactor PR first? This can reduce the code changes made by this PR.

I would like to do. Does it need to create a new issue or not?

Nope. You can reuse this issue.

gatorsmile · 2019-06-03T16:43:49Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/client/VersionsSuite.scala

This is just a special case. We need more to ensure all the hive metastore versions can correctly support it . Also negative cases are needed. :-)

Sorry, what means? Hasn’t it been covered from 0.12 to 3.1?

this only covers a single table. How about multiple tables? or mixed with non-existent tables? with illegal table names? or an empty seq? and so on. :-)

gatorsmile · 2019-06-03T16:44:41Z

Thank you for your fast work! Will review the details after you address the above comments.

gatorsmile · 2019-06-03T16:44:56Z

ok to test

SparkQA · 2019-06-03T20:04:38Z

Test build #106115 has finished for PR 24774 at commit 4f97fdd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-06-05T05:22:53Z

Test build #106184 has finished for PR 24774 at commit 7d69a50.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-06-05T13:18:45Z

Test build #106199 has finished for PR 24774 at commit 801a007.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2019-06-05T16:21:28Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala

Let us list the input name list in the error message

…ilable in ExternalCatalog/SessionCatalog API

LantaoJin · 2019-06-06T05:10:53Z

After rebased from master, the commit history contains many unrelated commits. I have to create a new local branch and cherry pick the commits from old local branch and then force push it to the remote old branch to clean them.

LantaoJin · 2019-06-06T05:15:49Z

Looks good now

gatorsmile

Thanks for your work!

gatorsmile · 2019-06-05T16:31:39Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala

one same database -> the same database.

gatorsmile · 2019-06-06T06:24:47Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala

+      val dbs = names.map(_.database.getOrElse(getCurrentDatabase))
+      if (dbs.distinct.size != 1) {
+        val tables = names.map(name => formatTableName(name.table))
+        dbs.zip(tables).map { case (d, t) => QualifiedTableName(d, t)}


gatorsmile · 2019-06-06T06:25:20Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala

+        dbs.zip(tables).map { case (d, t) => QualifiedTableName(d, t)}
+        throw new AnalysisException(
+          s"Only the tables/views belong to one same database can be retrieved. Querying " +
+          s"tables/views are ${dbs.zip(tables).map { case (d, t) => QualifiedTableName(d, t)}}"


This should be replaced by a variable name.

gatorsmile · 2019-06-06T06:28:47Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalogSuite.scala

+  }
+
+  test("get tables by name when some tables do not exists") {
+    assert(newBasicCatalog().getTablesByName("db2", Seq("tbl1", "tblnotexist"))


See validNameFormat. Add a test case when the seq of table names contains the invalid name.

gatorsmile · 2019-06-06T06:33:41Z

...rver/src/test/scala/org/apache/spark/sql/hive/thriftserver/SparkMetadataOperationSuite.scala

Weird. If the view is not dropped, why the tables do not exist in your environment?

gatorsmile · 2019-06-06T06:36:11Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClient.scala

  /** Returns the metadata for the specified table or None if it doesn't exist. */
  def getTableOption(dbName: String, tableName: String): Option[CatalogTable]

+  def getTablesByName(dbName: String, tableNames: Seq[String]): Seq[CatalogTable]


Add a comment to describe the function?

gatorsmile · 2019-06-06T06:37:31Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/client/VersionsSuite.scala

+    }
+
+    test(s"$version: getTablesByName when multiple tables") {
+      assert(client.getTablesByName("default", Seq("src", "temporary"))


Also try the invalid names here.

gatorsmile · 2019-06-06T06:38:05Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala

+      // of type "array<string>". This happens when the table is created using
+      // an earlier version of Hive.
+      if (classOf[MetadataTypedColumnsetSerDe].getName
+            == tTable.getSd.getSerdeInfo.getSerializationLib


4-space indent

== needs to be moved to the line 1139

gatorsmile · 2019-06-06T06:48:09Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala

+    if (!(HiveTableType.VIRTUAL_VIEW.toString == tTable.getTableType)) {
+      // Fix the non-printable chars
+      val parameters: JMap[String, String] = tTable.getSd.getParameters
+      val sf: String = parameters.get(serdeConstants.SERIALIZATION_FORMAT)


https://github.com/apache/hive/blame/master/ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java#L1342

Do we need to check null here?

SparkQA · 2019-06-06T07:05:02Z

Test build #106225 has finished for PR 24774 at commit fb7760c.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-06-06T07:05:02Z

Test build #106228 has finished for PR 24774 at commit 873644d.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

wangyum · 2019-06-10T08:44:15Z

I did a simple benchmark in our production environment(The default database has 1626 tables):

cat <<EOF > SPARK-27899.scala
def benchmark(func: () => Unit): Long = {
  val start = System.currentTimeMillis()
  for(i <- 0 until 2) { func() }
  val end = System.currentTimeMillis()
  end - start
}

def default(): Unit = {
  val list = new java.util.ArrayList[Array[AnyRef]]()
  val catalog = spark.sessionState.catalog
  catalog.listTables("default").foreach { tableIdentifier =>
    val catalogTable = catalog.getTableMetadata(tableIdentifier)
    val rowData = Array[AnyRef](
      "",
      catalogTable.database,
      catalogTable.identifier.table,
      catalogTable.tableType,
      catalogTable.comment.getOrElse(""))
    list.add(rowData)
  }
}

def spark_27899(): Unit = {
  val list = new java.util.ArrayList[Array[AnyRef]]()
  val catalog = spark.sessionState.catalog
  catalog.getTablesByName(catalog.listTables("default")).foreach { catalogTable =>
    val rowData = Array[AnyRef](
      "",
      catalogTable.database,
      catalogTable.identifier.table,
      catalogTable.tableType,
      catalogTable.comment.getOrElse(""))
    list.add(rowData)
  }
}

val defaultTimeToken = benchmark(() => default)
val spark27899TimeToken = benchmark(() => spark_27899)
println(s"Default time token: $defaultTimeToken")
println(s"SPARK-27899 time token: $spark27899TimeToken")
EOF

Benchmark result:

Default time token: 317983
SPARK-27899 time token: 58977

LantaoJin · 2019-06-10T09:14:17Z

#24774 (comment)
Seems it's my IDE problem. Cleaned and gone. I will remove this "drop view"

SparkQA · 2019-06-10T15:36:59Z

Test build #106349 has finished for PR 24774 at commit d233146.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2019-06-11T07:32:08Z

LGTM

Thanks! Merged to master. cc @juliuszsompolski

## What changes were proposed in this pull request? This is a part of apache#24774, to reduce the code changes made by that. ## How was this patch tested? Exist UTs. Closes apache#24803 from LantaoJin/SPARK-27899_refactor. Authored-by: LantaoJin <[email protected]> Signed-off-by: gatorsmile <[email protected]>

…ilable in ExternalCatalog/SessionCatalog API ## What changes were proposed in this pull request? The new Spark ThriftServer SparkGetTablesOperation implemented in apache#22794 does a catalog.getTableMetadata request for every table. This can get very slow for large schemas (~50ms per table with an external Hive metastore). Hive ThriftServer GetTablesOperation uses HiveMetastoreClient.getTableObjectsByName to get table information in bulk, but we don't expose that through our APIs that go through Hive -> HiveClientImpl (HiveClient) -> HiveExternalCatalog (ExternalCatalog) -> SessionCatalog. If we added and exposed getTableObjectsByName through our catalog APIs, we could resolve that performance problem in SparkGetTablesOperation. ## How was this patch tested? Add UT Closes apache#24774 from LantaoJin/SPARK-27899. Authored-by: LantaoJin <[email protected]> Signed-off-by: gatorsmile <[email protected]>

wangyum reviewed Jun 3, 2019

View reviewed changes

gatorsmile reviewed Jun 3, 2019

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalog.scala Outdated Show resolved Hide resolved

gatorsmile reviewed Jun 3, 2019

View reviewed changes

LantaoJin mentioned this pull request Jun 5, 2019

[SPARK-27899][SQL] Refactor getTableOption() to extract a common method #24803

Closed

gatorsmile reviewed Jun 5, 2019

View reviewed changes

LantaoJin added 9 commits June 6, 2019 13:02

[SPARK-27899][SQL] Make HiveMetastoreClient.getTableObjectsByName ava…

dbe6065

…ilable in ExternalCatalog/SessionCatalog API

add ut

b73eadc

more robust

7b48220

style

bbb5b18

remove useless code

2d82550

rename

82a30b9

add more unit tests

b621043

Fix ut

93c5261

address comment

873644d

LantaoJin force-pushed the SPARK-27899 branch from fb7760c to 873644d Compare June 6, 2019 05:13

gatorsmile reviewed Jun 6, 2019

View reviewed changes

address comments

d233146

gatorsmile closed this in 63e0711 Jun 11, 2019

dongjoon-hyun added the SQL label Feb 5, 2020

[SPARK-27899][SQL] Make HiveMetastoreClient.getTableObjectsByName available in ExternalCatalog/SessionCatalog API #24774

[SPARK-27899][SQL] Make HiveMetastoreClient.getTableObjectsByName available in ExternalCatalog/SessionCatalog API #24774

Uh oh!

Conversation

LantaoJin commented Jun 3, 2019

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

LantaoJin commented Jun 3, 2019

Uh oh!

juliuszsompolski commented Jun 3, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wangyum commented Jun 3, 2019

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LantaoJin Jun 3, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile commented Jun 3, 2019

Uh oh!

gatorsmile commented Jun 3, 2019

Uh oh!

SparkQA commented Jun 3, 2019

Uh oh!

SparkQA commented Jun 5, 2019

Uh oh!

SparkQA commented Jun 5, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LantaoJin commented Jun 6, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LantaoJin commented Jun 6, 2019

Uh oh!

gatorsmile left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

LantaoJin Jun 3, 2019 •

edited

Loading

LantaoJin commented Jun 6, 2019 •

edited

Loading