Skip to content

Conversation

@LantaoJin
Copy link
Contributor

What changes were proposed in this pull request?

The new Spark ThriftServer SparkGetTablesOperation implemented in #22794 does a catalog.getTableMetadata request for every table. This can get very slow for large schemas (~50ms per table with an external Hive metastore).
Hive ThriftServer GetTablesOperation uses HiveMetastoreClient.getTableObjectsByName to get table information in bulk, but we don't expose that through our APIs that go through Hive -> HiveClientImpl (HiveClient) -> HiveExternalCatalog (ExternalCatalog) -> SessionCatalog.

If we added and exposed getTableObjectsByName through our catalog APIs, we could resolve that performance problem in SparkGetTablesOperation.

How was this patch tested?

Add UT

@LantaoJin
Copy link
Contributor Author

Hi @wangyum @gatorsmile , could you have a chance to review?

@juliuszsompolski
Copy link
Contributor

Thank you @LantaoJin for doing that!
I looked through the code and it looks good to me, but I am not really familiar with the catalog code so will refrain from reviewing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if names from different databases? for example:

Seq("db1.table1", "db2.table1")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have considered this case. Could this be happened in the reality use cases? But good catch, it should be handled in code. How about throws an exception?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This view was not cleaned up when the UT failed in my environment. So adding this could stable the test case I think.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weird. If the view is not dropped, why the tables do not exist in your environment?

@wangyum
Copy link
Member

wangyum commented Jun 3, 2019

@LantaoJin Could you post some benchmark result?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you submit a refactor PR first? This can reduce the code changes made by this PR.

Copy link
Contributor Author

@LantaoJin LantaoJin Jun 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to do. Does it need to create a new issue or not?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope. You can reuse this issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #24803

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a special case. We need more to ensure all the hive metastore versions can correctly support it . Also negative cases are needed. :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, what means? Hasn’t it been covered from 0.12 to 3.1?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this only covers a single table. How about multiple tables? or mixed with non-existent tables? with illegal table names? or an empty seq? and so on. :-)

@gatorsmile
Copy link
Member

Thank you for your fast work! Will review the details after you address the above comments.

@gatorsmile
Copy link
Member

ok to test

@SparkQA
Copy link

SparkQA commented Jun 3, 2019

Test build #106115 has finished for PR 24774 at commit 4f97fdd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 5, 2019

Test build #106184 has finished for PR 24774 at commit 7d69a50.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 5, 2019

Test build #106199 has finished for PR 24774 at commit 801a007.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let us list the input name list in the error message

@LantaoJin
Copy link
Contributor Author

LantaoJin commented Jun 6, 2019

After rebased from master, the commit history contains many unrelated commits. I have to create a new local branch and cherry pick the commits from old local branch and then force push it to the remote old branch to clean them.

@LantaoJin
Copy link
Contributor Author

Looks good now

Copy link
Member

@gatorsmile gatorsmile left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your work!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one same database -> the same database.

val dbs = names.map(_.database.getOrElse(getCurrentDatabase))
if (dbs.distinct.size != 1) {
val tables = names.map(name => formatTableName(name.table))
dbs.zip(tables).map { case (d, t) => QualifiedTableName(d, t)}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

dbs.zip(tables).map { case (d, t) => QualifiedTableName(d, t)}
throw new AnalysisException(
s"Only the tables/views belong to one same database can be retrieved. Querying " +
s"tables/views are ${dbs.zip(tables).map { case (d, t) => QualifiedTableName(d, t)}}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be replaced by a variable name.

}

test("get tables by name when some tables do not exists") {
assert(newBasicCatalog().getTablesByName("db2", Seq("tbl1", "tblnotexist"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See validNameFormat. Add a test case when the seq of table names contains the invalid name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weird. If the view is not dropped, why the tables do not exist in your environment?

/** Returns the metadata for the specified table or None if it doesn't exist. */
def getTableOption(dbName: String, tableName: String): Option[CatalogTable]

def getTablesByName(dbName: String, tableNames: Seq[String]): Seq[CatalogTable]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment to describe the function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

}

test(s"$version: getTablesByName when multiple tables") {
assert(client.getTablesByName("default", Seq("src", "temporary"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also try the invalid names here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

// of type "array<string>". This happens when the table is created using
// an earlier version of Hive.
if (classOf[MetadataTypedColumnsetSerDe].getName
== tTable.getSd.getSerdeInfo.getSerializationLib
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4-space indent

== needs to be moved to the line 1139

if (!(HiveTableType.VIRTUAL_VIEW.toString == tTable.getTableType)) {
// Fix the non-printable chars
val parameters: JMap[String, String] = tTable.getSd.getParameters
val sf: String = parameters.get(serdeConstants.SERIALIZATION_FORMAT)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SparkQA
Copy link

SparkQA commented Jun 6, 2019

Test build #106225 has finished for PR 24774 at commit fb7760c.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 6, 2019

Test build #106228 has finished for PR 24774 at commit 873644d.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wangyum
Copy link
Member

wangyum commented Jun 10, 2019

I did a simple benchmark in our production environment(The default database has 1626 tables):

cat <<EOF > SPARK-27899.scala
def benchmark(func: () => Unit): Long = {
  val start = System.currentTimeMillis()
  for(i <- 0 until 2) { func() }
  val end = System.currentTimeMillis()
  end - start
}

def default(): Unit = {
  val list = new java.util.ArrayList[Array[AnyRef]]()
  val catalog = spark.sessionState.catalog
  catalog.listTables("default").foreach { tableIdentifier =>
    val catalogTable = catalog.getTableMetadata(tableIdentifier)
    val rowData = Array[AnyRef](
      "",
      catalogTable.database,
      catalogTable.identifier.table,
      catalogTable.tableType,
      catalogTable.comment.getOrElse(""))
    list.add(rowData)
  }
}

def spark_27899(): Unit = {
  val list = new java.util.ArrayList[Array[AnyRef]]()
  val catalog = spark.sessionState.catalog
  catalog.getTablesByName(catalog.listTables("default")).foreach { catalogTable =>
    val rowData = Array[AnyRef](
      "",
      catalogTable.database,
      catalogTable.identifier.table,
      catalogTable.tableType,
      catalogTable.comment.getOrElse(""))
    list.add(rowData)
  }
}

val defaultTimeToken = benchmark(() => default)
val spark27899TimeToken = benchmark(() => spark_27899)
println(s"Default time token: $defaultTimeToken")
println(s"SPARK-27899 time token: $spark27899TimeToken")
EOF

Benchmark result:

Default time token: 317983
SPARK-27899 time token: 58977

@LantaoJin
Copy link
Contributor Author

#24774 (comment)
Seems it's my IDE problem. Cleaned and gone. I will remove this "drop view"

@SparkQA
Copy link

SparkQA commented Jun 10, 2019

Test build #106349 has finished for PR 24774 at commit d233146.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

LGTM

Thanks! Merged to master. cc @juliuszsompolski

emanuelebardelli pushed a commit to emanuelebardelli/spark that referenced this pull request Jun 15, 2019
## What changes were proposed in this pull request?

This is a part of apache#24774, to reduce the code changes made by that.

## How was this patch tested?

Exist UTs.

Closes apache#24803 from LantaoJin/SPARK-27899_refactor.

Authored-by: LantaoJin <[email protected]>
Signed-off-by: gatorsmile <[email protected]>
emanuelebardelli pushed a commit to emanuelebardelli/spark that referenced this pull request Jun 15, 2019
…ilable in ExternalCatalog/SessionCatalog API

## What changes were proposed in this pull request?

The new Spark ThriftServer SparkGetTablesOperation implemented in apache#22794 does a catalog.getTableMetadata request for every table. This can get very slow for large schemas (~50ms per table with an external Hive metastore).
Hive ThriftServer GetTablesOperation uses HiveMetastoreClient.getTableObjectsByName to get table information in bulk, but we don't expose that through our APIs that go through Hive -> HiveClientImpl (HiveClient) -> HiveExternalCatalog (ExternalCatalog) -> SessionCatalog.

If we added and exposed getTableObjectsByName through our catalog APIs, we could resolve that performance problem in SparkGetTablesOperation.

## How was this patch tested?

Add UT

Closes apache#24774 from LantaoJin/SPARK-27899.

Authored-by: LantaoJin <[email protected]>
Signed-off-by: gatorsmile <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants