Skip to content

Conversation

@wangyum
Copy link
Member

@wangyum wangyum commented Oct 22, 2018

What changes were proposed in this pull request?

For SQL client tools(DBeaver)'s Navigator use GetTablesOperation to obtain table names.

We should use metadataHive, but it use executionHive.

This PR implement Spark own GetTablesOperation to use metadataHive.

How was this patch tested?

unit test and manual tests

image
image

@wangyum wangyum changed the title [WIP][SPARK-24570] Fix SQL client tools cannot show tables [WIP][SPARK-24570][SQL] Fix SQL client tools cannot show tables Oct 22, 2018
@SparkQA
Copy link

SparkQA commented Oct 22, 2018

Test build #97807 has finished for PR 22794 at commit 9f528bc.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 22, 2018

Test build #97810 has finished for PR 22794 at commit 9f528bc.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@mgaido91 mgaido91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, this means we are returning the tables querying directly Hive metastore, which may return partial/incomplete/incorrect information (consider for instance temporary views). Am I missing something?

@SparkQA
Copy link

SparkQA commented Oct 22, 2018

Test build #97839 has finished for PR 22794 at commit 9f528bc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 23, 2018

Test build #97912 has finished for PR 22794 at commit 2939178.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 24, 2018

Test build #97968 has finished for PR 22794 at commit 8449cdf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wangyum wangyum changed the title [WIP][SPARK-24570][SQL] Fix SQL client tools cannot show tables [SPARK-24570][SQL] Implement Spark own GetTablesOperation to fix SQL client tools cannot show tables Oct 24, 2018
@wangyum
Copy link
Member Author

wangyum commented Oct 24, 2018

Thanks @mgaido91 Changed to sqlContext.sessionState.catalog to obtain table names now.

@wangyum
Copy link
Member Author

wangyum commented Oct 24, 2018

@SparkQA
Copy link

SparkQA commented Oct 24, 2018

Test build #97980 has finished for PR 22794 at commit e9a2a93.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

cc @srinathshankar

import org.apache.spark.sql.catalyst.catalog.CatalogTableType._
import org.apache.spark.sql.catalyst.catalog.SessionCatalog

private[hive] class SparkGetTablesOperation(
Copy link
Member

@gatorsmile gatorsmile Oct 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/**
 * @param schemaName database name, null or a concrete database name
 * @param tableName table name pattern
 * @param tableTypes list of allowed table types
 * ................
 */

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

}
}

test("SPARK-24196: SQL client(DBeaver) can't show tables") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Create a SparkMetadataOperationSuite?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

schemaName: String,
tableName: String,
tableTypes: JList[String])
(sqlContext: SQLContext, sessionToActivePool: JMap[SessionHandle, String])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering why we need sessionToActivePool?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just add sqlContext in the class parameter.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@SparkQA
Copy link

SparkQA commented Oct 25, 2018

Test build #98002 has finished for PR 22794 at commit 80a8d21.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mgaido91
Copy link
Contributor

@wangyum the point is that what I meant in the previous comment is that the current approach is not good IMHO. Now you fixed it for the tables, but the same is true for functions: currently you'd be returning Hive's functions, not Spark's right? You should override all of them IMHO...

@wangyum
Copy link
Member Author

wangyum commented Oct 25, 2018

@mgaido91 You are right. But may be we only override newExecuteStatementOperation, newGetSchemasOperation, newGetTablesOperation, newGetTableTypesOperation,newGetColumnsOperation and newGetFunctionsOperation is enough.

@mgaido91
Copy link
Contributor

I am not sure what Hive 1.2 exposes, but we might have more, it needs to be checked. Anyway, yes, those have to be overridden for sure. When I referred to the current approach, I meant the one described in the PR description, which seems to be outdated tough. May you please update it? Thanks.

}
}

test("Spark's own GetTablesOperation(SparkGetTablesOperation)") {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# Conflicts:
#	sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/server/SparkSQLOperationManager.scala
#	sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/SparkMetadataOperationSuite.scala
@SparkQA
Copy link

SparkQA commented Jan 8, 2019

Test build #100921 has finished for PR 22794 at commit 4f5cd27.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

tableTypes: JList[String]): MetadataOperation = synchronized {
val sqlContext = sessionToContexts.get(parentSession.getSessionHandle)
require(sqlContext != null, s"Session handle: ${parentSession.getSessionHandle} has not been" +
s" initialized or had already closed.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove s

@SparkQA
Copy link

SparkQA commented Jan 8, 2019

Test build #100922 has finished for PR 22794 at commit fb7e0a5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kk17
Copy link

kk17 commented Jan 30, 2019

Just curious that is there any plan for merging this pull request?

@dongjoon-hyun
Copy link
Member

Retest this please.

@SparkQA
Copy link

SparkQA commented Feb 10, 2019

Test build #102135 has finished for PR 22794 at commit fb7e0a5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tooptoop4
Copy link
Contributor

merge please

@wangyum
Copy link
Member Author

wangyum commented Feb 18, 2019

Retest this please.

@SparkQA
Copy link

SparkQA commented Feb 18, 2019

Test build #102444 has finished for PR 22794 at commit fb7e0a5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@gatorsmile gatorsmile left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Thanks! Merged to master.

@asfgit asfgit closed this in 7f53116 Feb 18, 2019
@tooptoop4
Copy link
Contributor

This change caused errors in SIMBA Spark JDBC driver, can't run any normal queries in squirrelSQL anymore.

@gatorsmile
Copy link
Member

What is the error?

@wangyum
Copy link
Member Author

wangyum commented Feb 26, 2019

@tooptoop4 I think it works. Could you provide more infos?
image

@tooptoop4
Copy link
Contributor

try with ssl true in jdbc connectstring/thrift server. will send stacktrace 2day

@tooptoop4
Copy link
Contributor

When i put in spark-hive-thriftserver_2.11-3.0.0.jar cherrypicked without this PR then i don't get an error, with this PR i get below error upon doing a simple select * from sch.tbl limit 5 query, also the schemas dont show. I am on squirrel 3.9.0 (Windows 10) and my SparkThrift has TLS enabled with jks, and LDAP authentication. com.simba.spark.jdbc41.Driver from SimbaSparkJDBC41-1.1.7.1009 version is being used.

java.sql.SQLException: [Simba]SparkJDBCDriver Error getting schema information: Metadata Initialization Error.
at com.simba.spark.hivecommon.dataengine.metadata.HiveJDBCCatalogSchemaOnlyMetadataSource.initSchemas(Unknown Source)
at com.simba.spark.hivecommon.dataengine.metadata.HiveJDBCCatalogSchemaOnlyMetadataSource.(Unknown Source)
at com.simba.spark.hivecommon.dataengine.HiveJDBCDataEngine.makeNewMetadataSource(Unknown Source)
at com.simba.spark.dsi.dataengine.impl.DSIDataEngine.makeNewMetadataResult(Unknown Source)
at com.simba.spark.hivecommon.dataengine.HiveJDBCDataEngine.makeNewMetadataResult(Unknown Source)
at com.simba.spark.jdbc.jdbc41.S41DatabaseMetaData.createMetaDataResult(Unknown Source)
at com.simba.spark.jdbc.common.SDatabaseMetaData.getSchemas(Unknown Source)
at net.sourceforge.squirrel_sql.fw.sql.SQLDatabaseMetaData.getSchemas(SQLDatabaseMetaData.java:278)
at net.sourceforge.squirrel_sql.client.session.SessionManager.areAllSchemasAllowed(SessionManager.java:639)
at net.sourceforge.squirrel_sql.client.session.schemainfo.SchemaInfoCache.getAllSchemaLoadInfos(SchemaInfoCache.java:129)
at net.sourceforge.squirrel_sql.client.session.schemainfo.SchemaInfoCache.getMatchingSchemaLoadInfos(SchemaInfoCache.java:163)
at net.sourceforge.squirrel_sql.client.session.schemainfo.SchemaInfoCache.getMatchingSchemaLoadInfos(SchemaInfoCache.java:156)
at net.sourceforge.squirrel_sql.client.session.schemainfo.SchemaInfo.privateLoadUDTs(SchemaInfo.java:664)
at net.sourceforge.squirrel_sql.client.session.schemainfo.SchemaInfo.loadUDTs(SchemaInfo.java:449)
at net.sourceforge.squirrel_sql.client.session.schemainfo.SchemaInfo.privateLoadAll(SchemaInfo.java:348)
at net.sourceforge.squirrel_sql.client.session.schemainfo.SchemaInfo.initialLoad(SchemaInfo.java:179)
at net.sourceforge.squirrel_sql.client.session.Session$1.run(Session.java:233)
at net.sourceforge.squirrel_sql.fw.util.TaskExecuter.run(TaskExecuter.java:82)
Caused by: com.simba.spark.support.exceptions.GeneralException: [Simba]SparkJDBCDriver Error getting schema information: Metadata Initialization Error.
... 18 more
Caused by: com.simba.spark.support.exceptions.GeneralException: HIVE_METADATA_SCHEMA_ERR
at com.simba.spark.hivecommon.api.ExtendedHS2Client.getSchemas(Unknown Source)
... 18 more
Caused by: org.apache.thrift.transport.TTransportException: java.net.SocketException: Connection closed by remote host
at org.apache.thrift.transport.TIOStreamTransport.flush(TIOStreamTransport.java:161)
at org.apache.thrift.transport.TSaslTransport.flush(TSaslTransport.java:471)
at org.apache.thrift.transport.TSaslClientTransport.flush(TSaslClientTransport.java:37)
at org.apache.thrift.TServiceClient.sendBase(TServiceClient.java:65)
at org.apache.hive.service.cli.thrift.TCLIService$Client.send_GetSchemas(TCLIService.java:291)
at com.simba.spark.hivecommon.api.HS2ClientWrapper.send_GetSchemas(Unknown Source)
at org.apache.hive.service.cli.thrift.TCLIService$Client.GetSchemas(TCLIService.java:283)
at com.simba.spark.hivecommon.api.HS2ClientWrapper.GetSchemas(Unknown Source)
at com.simba.spark.hivecommon.api.ExtendedHS2Client.getSchemas(Unknown Source)
at com.simba.spark.hivecommon.dataengine.metadata.HiveJDBCCatalogSchemaOnlyMetadataSource.initSchemas(Unknown Source)
at com.simba.spark.hivecommon.dataengine.metadata.HiveJDBCCatalogSchemaOnlyMetadataSource.(Unknown Source)
at com.simba.spark.hivecommon.dataengine.HiveJDBCDataEngine.makeNewMetadataSource(Unknown Source)
at com.simba.spark.dsi.dataengine.impl.DSIDataEngine.makeNewMetadataResult(Unknown Source)
at com.simba.spark.hivecommon.dataengine.HiveJDBCDataEngine.makeNewMetadataResult(Unknown Source)
at com.simba.spark.jdbc.jdbc41.S41DatabaseMetaData.createMetaDataResult(Unknown Source)
at com.simba.spark.jdbc.common.SDatabaseMetaData.getSchemas(Unknown Source)
at net.sourceforge.squirrel_sql.fw.sql.SQLDatabaseMetaData.getSchemas(SQLDatabaseMetaData.java:278)
at net.sourceforge.squirrel_sql.client.session.SessionManager.areAllSchemasAllowed(SessionManager.java:639)
at net.sourceforge.squirrel_sql.client.session.schemainfo.SchemaInfoCache.getAllSchemaLoadInfos(SchemaInfoCache.java:129)
at net.sourceforge.squirrel_sql.client.session.schemainfo.SchemaInfoCache.getMatchingSchemaLoadInfos(SchemaInfoCache.java:163)
at net.sourceforge.squirrel_sql.client.session.schemainfo.SchemaInfoCache.getMatchingSchemaLoadInfos(SchemaInfoCache.java:156)
at net.sourceforge.squirrel_sql.client.session.schemainfo.SchemaInfo.privateLoadUDTs(SchemaInfo.java:664)
at net.sourceforge.squirrel_sql.client.session.schemainfo.SchemaInfo.loadUDTs(SchemaInfo.java:449)
at net.sourceforge.squirrel_sql.client.session.schemainfo.SchemaInfo.privateLoadAll(SchemaInfo.java:348)
at net.sourceforge.squirrel_sql.client.session.schemainfo.SchemaInfo.initialLoad(SchemaInfo.java:179)
at net.sourceforge.squirrel_sql.client.session.Session$1.run(Session.java:233)
at net.sourceforge.squirrel_sql.fw.util.TaskExecuter.run(TaskExecuter.java:82)
at java.lang.Thread.run(Unknown Source)
Caused by: java.net.SocketException: Connection closed by remote host
at sun.security.ssl.SSLSocketImpl.checkWrite(Unknown Source)
at sun.security.ssl.AppOutputStream.write(Unknown Source)
at java.io.BufferedOutputStream.flushBuffer(Unknown Source)
at java.io.BufferedOutputStream.flush(Unknown Source)
at org.apache.thrift.transport.TIOStreamTransport.flush(TIOStreamTransport.java:159)
... 27 more

@HyukjinKwon
Copy link
Member

Can you please describe explicit step with screenshots one by one? Since it uses thridparty one, let's be very sure about reproducer and the output. Otherwise, no one can verify.

@mgaido91
Copy link
Contributor

We should check which version of thrift the simba driver is expecting. Spark is using a very old version of Hive's thrift protocol, a mismatch could be the problem.

@wangyum
Copy link
Member Author

wangyum commented Mar 1, 2019

@tooptoop4 Could you give me an email? I will contact you offline.

val tablePattern = convertIdentifierPattern(tableName, true)
matchingDbs.foreach { dbName =>
catalog.listTables(dbName, tablePattern).foreach { tableIdentifier =>
val catalogTable = catalog.getTableMetadata(tableIdentifier)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be very slow for big schemas. Calling getTableMetadata on every table will trigger 3 separate database calls to the metastore (requireDbExists, requireTableExists, and getTable) taking ~tens of ms for every table. So it can be tens of seconds for schemas with hundreds of tables.

The underlying Hive Thriftserver GetTables uses MetastoreClient.getTableObjectsByName (https://hive.apache.org/javadocs/r2.1.1/api/org/apache/hadoop/hive/metastore/HiveMetaStoreClient.html#getTableObjectsByName-java.lang.String-java.util.List-) call to bulk-list the tables, but we don't expose that through our SessionCatalog / ExternalCatalog / HiveClientImpl

Would it be possible to thread that bulk getTableObjectsByName operation through our catalog APIs, to be able to retrieve the tables efficiently here? @wangyum @gatorsmile - what do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LantaoJin @wangyum Could either of you submit a PR to resolve the issue raised by @juliuszsompolski ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I will take this issue.

pull bot pushed a commit to Pandinosaurus/spark that referenced this pull request Jun 11, 2019
…ilable in ExternalCatalog/SessionCatalog API

## What changes were proposed in this pull request?

The new Spark ThriftServer SparkGetTablesOperation implemented in apache#22794 does a catalog.getTableMetadata request for every table. This can get very slow for large schemas (~50ms per table with an external Hive metastore).
Hive ThriftServer GetTablesOperation uses HiveMetastoreClient.getTableObjectsByName to get table information in bulk, but we don't expose that through our APIs that go through Hive -> HiveClientImpl (HiveClient) -> HiveExternalCatalog (ExternalCatalog) -> SessionCatalog.

If we added and exposed getTableObjectsByName through our catalog APIs, we could resolve that performance problem in SparkGetTablesOperation.

## How was this patch tested?

Add UT

Closes apache#24774 from LantaoJin/SPARK-27899.

Authored-by: LantaoJin <[email protected]>
Signed-off-by: gatorsmile <[email protected]>
emanuelebardelli pushed a commit to emanuelebardelli/spark that referenced this pull request Jun 15, 2019
…ilable in ExternalCatalog/SessionCatalog API

## What changes were proposed in this pull request?

The new Spark ThriftServer SparkGetTablesOperation implemented in apache#22794 does a catalog.getTableMetadata request for every table. This can get very slow for large schemas (~50ms per table with an external Hive metastore).
Hive ThriftServer GetTablesOperation uses HiveMetastoreClient.getTableObjectsByName to get table information in bulk, but we don't expose that through our APIs that go through Hive -> HiveClientImpl (HiveClient) -> HiveExternalCatalog (ExternalCatalog) -> SessionCatalog.

If we added and exposed getTableObjectsByName through our catalog APIs, we could resolve that performance problem in SparkGetTablesOperation.

## How was this patch tested?

Add UT

Closes apache#24774 from LantaoJin/SPARK-27899.

Authored-by: LantaoJin <[email protected]>
Signed-off-by: gatorsmile <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants