[SPARK-24570][SQL] Implement Spark own GetTablesOperation to fix SQL client tools cannot show tables #22794

wangyum · 2018-10-22T09:45:33Z

What changes were proposed in this pull request?

For SQL client tools(DBeaver)'s Navigator use GetTablesOperation to obtain table names.

We should use metadataHive, but it use executionHive.

This PR implement Spark own GetTablesOperation to use metadataHive.

How was this patch tested?

unit test and manual tests

SparkQA · 2018-10-22T14:25:12Z

Test build #97807 has finished for PR 22794 at commit 9f528bc.

This patch fails Spark unit tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2018-10-22T14:28:36Z

Test build #97810 has finished for PR 22794 at commit 9f528bc.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

mgaido91

IIUC, this means we are returning the tables querying directly Hive metastore, which may return partial/incomplete/incorrect information (consider for instance temporary views). Am I missing something?

SparkQA · 2018-10-22T14:59:31Z

Test build #97839 has finished for PR 22794 at commit 9f528bc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-10-23T11:24:00Z

Test build #97912 has finished for PR 22794 at commit 2939178.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-10-24T08:57:40Z

Test build #97968 has finished for PR 22794 at commit 8449cdf.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

wangyum · 2018-10-24T12:50:56Z

Thanks @mgaido91 Changed to sqlContext.sessionState.catalog to obtain table names now.

wangyum · 2018-10-24T12:54:42Z

cc @gatorsmile @srowen @dongjoon-hyun

SparkQA · 2018-10-24T15:25:02Z

Test build #97980 has finished for PR 22794 at commit e9a2a93.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-10-24T16:13:04Z

cc @srinathshankar

gatorsmile · 2018-10-24T16:17:33Z

...ftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkGetTablesOperation.scala

+import org.apache.spark.sql.catalyst.catalog.CatalogTableType._
+import org.apache.spark.sql.catalyst.catalog.SessionCatalog
+
+private[hive] class SparkGetTablesOperation(


/** * @param schemaName database name, null or a concrete database name * @param tableName table name pattern * @param tableTypes list of allowed table types * ................ */

gatorsmile · 2018-10-24T16:39:45Z

...ftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2Suites.scala

    }
  }
+
+  test("SPARK-24196: SQL client(DBeaver) can't show tables") {


Create a SparkMetadataOperationSuite?

gatorsmile · 2018-10-24T16:48:45Z

...ftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkGetTablesOperation.scala

+    schemaName: String,
+    tableName: String,
+    tableTypes: JList[String])
+    (sqlContext: SQLContext, sessionToActivePool: JMap[SessionHandle, String])


Just wondering why we need sessionToActivePool?

Just add sqlContext in the class parameter.

SparkQA · 2018-10-25T07:03:31Z

Test build #98002 has finished for PR 22794 at commit 80a8d21.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mgaido91 · 2018-10-25T11:37:48Z

@wangyum the point is that what I meant in the previous comment is that the current approach is not good IMHO. Now you fixed it for the tables, but the same is true for functions: currently you'd be returning Hive's functions, not Spark's right? You should override all of them IMHO...

wangyum · 2018-10-25T14:33:30Z

@mgaido91 You are right. But may be we only override newExecuteStatementOperation, newGetSchemasOperation, newGetTablesOperation, newGetTableTypesOperation,newGetColumnsOperation and newGetFunctionsOperation is enough.

mgaido91 · 2018-10-25T14:56:19Z

I am not sure what Hive 1.2 exposes, but we might have more, it needs to be checked. Anyway, yes, those have to be overridden for sure. When I referred to the current approach, I meant the one described in the PR description, which seems to be outdated tough. May you please update it? Thanks.

wangyum · 2018-10-26T06:59:03Z

...rver/src/test/scala/org/apache/spark/sql/hive/thriftserver/SparkMetadataOperationSuite.scala

+    }
+  }
+
+  test("Spark's own GetTablesOperation(SparkGetTablesOperation)") {


This test mimic HiveDatabaseMetaData.getTables().

# Conflicts: # sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/server/SparkSQLOperationManager.scala # sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/SparkMetadataOperationSuite.scala

SparkQA · 2019-01-08T08:58:28Z

Test build #100921 has finished for PR 22794 at commit 4f5cd27.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mgaido91 · 2019-01-08T09:01:52Z

.../src/main/scala/org/apache/spark/sql/hive/thriftserver/server/SparkSQLOperationManager.scala

+      tableTypes: JList[String]): MetadataOperation = synchronized {
+    val sqlContext = sessionToContexts.get(parentSession.getSessionHandle)
+    require(sqlContext != null, s"Session handle: ${parentSession.getSessionHandle} has not been" +
+      s" initialized or had already closed.")


nit: remove s

SparkQA · 2019-01-08T09:30:14Z

Test build #100922 has finished for PR 22794 at commit fb7e0a5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

kk17 · 2019-01-30T11:35:01Z

Just curious that is there any plan for merging this pull request?

dongjoon-hyun · 2019-02-10T01:17:22Z

Retest this please.

SparkQA · 2019-02-10T01:39:43Z

Test build #102135 has finished for PR 22794 at commit fb7e0a5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tooptoop4 · 2019-02-17T10:58:11Z

merge please

wangyum · 2019-02-18T05:00:56Z

Retest this please.

SparkQA · 2019-02-18T05:27:01Z

Test build #102444 has finished for PR 22794 at commit fb7e0a5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile

LGTM

Thanks! Merged to master.

tooptoop4 · 2019-02-25T21:51:27Z

This change caused errors in SIMBA Spark JDBC driver, can't run any normal queries in squirrelSQL anymore.

gatorsmile · 2019-02-25T21:53:09Z

What is the error?

wangyum · 2019-02-26T02:54:17Z

@tooptoop4 I think it works. Could you provide more infos?

tooptoop4 · 2019-02-26T07:22:57Z

try with ssl true in jdbc connectstring/thrift server. will send stacktrace 2day

tooptoop4 · 2019-02-26T10:23:27Z

When i put in spark-hive-thriftserver_2.11-3.0.0.jar cherrypicked without this PR then i don't get an error, with this PR i get below error upon doing a simple select * from sch.tbl limit 5 query, also the schemas dont show. I am on squirrel 3.9.0 (Windows 10) and my SparkThrift has TLS enabled with jks, and LDAP authentication. com.simba.spark.jdbc41.Driver from SimbaSparkJDBC41-1.1.7.1009 version is being used.

java.sql.SQLException: [Simba]SparkJDBCDriver Error getting schema information: Metadata Initialization Error.
at com.simba.spark.hivecommon.dataengine.metadata.HiveJDBCCatalogSchemaOnlyMetadataSource.initSchemas(Unknown Source)
at com.simba.spark.hivecommon.dataengine.metadata.HiveJDBCCatalogSchemaOnlyMetadataSource.(Unknown Source)
at com.simba.spark.hivecommon.dataengine.HiveJDBCDataEngine.makeNewMetadataSource(Unknown Source)
at com.simba.spark.dsi.dataengine.impl.DSIDataEngine.makeNewMetadataResult(Unknown Source)
at com.simba.spark.hivecommon.dataengine.HiveJDBCDataEngine.makeNewMetadataResult(Unknown Source)
at com.simba.spark.jdbc.jdbc41.S41DatabaseMetaData.createMetaDataResult(Unknown Source)
at com.simba.spark.jdbc.common.SDatabaseMetaData.getSchemas(Unknown Source)
at net.sourceforge.squirrel_sql.fw.sql.SQLDatabaseMetaData.getSchemas(SQLDatabaseMetaData.java:278)
at net.sourceforge.squirrel_sql.client.session.SessionManager.areAllSchemasAllowed(SessionManager.java:639)
at net.sourceforge.squirrel_sql.client.session.schemainfo.SchemaInfoCache.getAllSchemaLoadInfos(SchemaInfoCache.java:129)
at net.sourceforge.squirrel_sql.client.session.schemainfo.SchemaInfoCache.getMatchingSchemaLoadInfos(SchemaInfoCache.java:163)
at net.sourceforge.squirrel_sql.client.session.schemainfo.SchemaInfoCache.getMatchingSchemaLoadInfos(SchemaInfoCache.java:156)
at net.sourceforge.squirrel_sql.client.session.schemainfo.SchemaInfo.privateLoadUDTs(SchemaInfo.java:664)
at net.sourceforge.squirrel_sql.client.session.schemainfo.SchemaInfo.loadUDTs(SchemaInfo.java:449)
at net.sourceforge.squirrel_sql.client.session.schemainfo.SchemaInfo.privateLoadAll(SchemaInfo.java:348)
at net.sourceforge.squirrel_sql.client.session.schemainfo.SchemaInfo.initialLoad(SchemaInfo.java:179)
at net.sourceforge.squirrel_sql.client.session.Session$1.run(Session.java:233)
at net.sourceforge.squirrel_sql.fw.util.TaskExecuter.run(TaskExecuter.java:82)
Caused by: com.simba.spark.support.exceptions.GeneralException: [Simba]SparkJDBCDriver Error getting schema information: Metadata Initialization Error.
... 18 more
Caused by: com.simba.spark.support.exceptions.GeneralException: HIVE_METADATA_SCHEMA_ERR
at com.simba.spark.hivecommon.api.ExtendedHS2Client.getSchemas(Unknown Source)
... 18 more
Caused by: org.apache.thrift.transport.TTransportException: java.net.SocketException: Connection closed by remote host
at org.apache.thrift.transport.TIOStreamTransport.flush(TIOStreamTransport.java:161)
at org.apache.thrift.transport.TSaslTransport.flush(TSaslTransport.java:471)
at org.apache.thrift.transport.TSaslClientTransport.flush(TSaslClientTransport.java:37)
at org.apache.thrift.TServiceClient.sendBase(TServiceClient.java:65)
at org.apache.hive.service.cli.thrift.TCLIService$Client.send_GetSchemas(TCLIService.java:291)
at com.simba.spark.hivecommon.api.HS2ClientWrapper.send_GetSchemas(Unknown Source)
at org.apache.hive.service.cli.thrift.TCLIService$Client.GetSchemas(TCLIService.java:283)
at com.simba.spark.hivecommon.api.HS2ClientWrapper.GetSchemas(Unknown Source)
at com.simba.spark.hivecommon.api.ExtendedHS2Client.getSchemas(Unknown Source)
at com.simba.spark.hivecommon.dataengine.metadata.HiveJDBCCatalogSchemaOnlyMetadataSource.initSchemas(Unknown Source)
at com.simba.spark.hivecommon.dataengine.metadata.HiveJDBCCatalogSchemaOnlyMetadataSource.(Unknown Source)
at com.simba.spark.hivecommon.dataengine.HiveJDBCDataEngine.makeNewMetadataSource(Unknown Source)
at com.simba.spark.dsi.dataengine.impl.DSIDataEngine.makeNewMetadataResult(Unknown Source)
at com.simba.spark.hivecommon.dataengine.HiveJDBCDataEngine.makeNewMetadataResult(Unknown Source)
at com.simba.spark.jdbc.jdbc41.S41DatabaseMetaData.createMetaDataResult(Unknown Source)
at com.simba.spark.jdbc.common.SDatabaseMetaData.getSchemas(Unknown Source)
at net.sourceforge.squirrel_sql.fw.sql.SQLDatabaseMetaData.getSchemas(SQLDatabaseMetaData.java:278)
at net.sourceforge.squirrel_sql.client.session.SessionManager.areAllSchemasAllowed(SessionManager.java:639)
at net.sourceforge.squirrel_sql.client.session.schemainfo.SchemaInfoCache.getAllSchemaLoadInfos(SchemaInfoCache.java:129)
at net.sourceforge.squirrel_sql.client.session.schemainfo.SchemaInfoCache.getMatchingSchemaLoadInfos(SchemaInfoCache.java:163)
at net.sourceforge.squirrel_sql.client.session.schemainfo.SchemaInfoCache.getMatchingSchemaLoadInfos(SchemaInfoCache.java:156)
at net.sourceforge.squirrel_sql.client.session.schemainfo.SchemaInfo.privateLoadUDTs(SchemaInfo.java:664)
at net.sourceforge.squirrel_sql.client.session.schemainfo.SchemaInfo.loadUDTs(SchemaInfo.java:449)
at net.sourceforge.squirrel_sql.client.session.schemainfo.SchemaInfo.privateLoadAll(SchemaInfo.java:348)
at net.sourceforge.squirrel_sql.client.session.schemainfo.SchemaInfo.initialLoad(SchemaInfo.java:179)
at net.sourceforge.squirrel_sql.client.session.Session$1.run(Session.java:233)
at net.sourceforge.squirrel_sql.fw.util.TaskExecuter.run(TaskExecuter.java:82)
at java.lang.Thread.run(Unknown Source)
Caused by: java.net.SocketException: Connection closed by remote host
at sun.security.ssl.SSLSocketImpl.checkWrite(Unknown Source)
at sun.security.ssl.AppOutputStream.write(Unknown Source)
at java.io.BufferedOutputStream.flushBuffer(Unknown Source)
at java.io.BufferedOutputStream.flush(Unknown Source)
at org.apache.thrift.transport.TIOStreamTransport.flush(TIOStreamTransport.java:159)
... 27 more

HyukjinKwon · 2019-02-27T05:28:15Z

Can you please describe explicit step with screenshots one by one? Since it uses thridparty one, let's be very sure about reproducer and the output. Otherwise, no one can verify.

mgaido91 · 2019-02-27T08:54:48Z

We should check which version of thrift the simba driver is expecting. Spark is using a very old version of Hive's thrift protocol, a mismatch could be the problem.

wangyum · 2019-03-01T06:26:25Z

@tooptoop4 Could you give me an email? I will contact you offline.

juliuszsompolski · 2019-05-30T18:36:19Z

...ftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkGetTablesOperation.scala

+    val tablePattern = convertIdentifierPattern(tableName, true)
+    matchingDbs.foreach { dbName =>
+      catalog.listTables(dbName, tablePattern).foreach { tableIdentifier =>
+        val catalogTable = catalog.getTableMetadata(tableIdentifier)


This can be very slow for big schemas. Calling getTableMetadata on every table will trigger 3 separate database calls to the metastore (requireDbExists, requireTableExists, and getTable) taking ~tens of ms for every table. So it can be tens of seconds for schemas with hundreds of tables.

The underlying Hive Thriftserver GetTables uses MetastoreClient.getTableObjectsByName (https://hive.apache.org/javadocs/r2.1.1/api/org/apache/hadoop/hive/metastore/HiveMetaStoreClient.html#getTableObjectsByName-java.lang.String-java.util.List-) call to bulk-list the tables, but we don't expose that through our SessionCatalog / ExternalCatalog / HiveClientImpl

Would it be possible to thread that bulk getTableObjectsByName operation through our catalog APIs, to be able to retrieve the tables efficiently here? @wangyum @gatorsmile - what do you think?

I raised https://issues.apache.org/jira/browse/SPARK-27899 with this idea.

@LantaoJin @wangyum Could either of you submit a PR to resolve the issue raised by @juliuszsompolski ?

OK, I will take this issue.

…ilable in ExternalCatalog/SessionCatalog API ## What changes were proposed in this pull request? The new Spark ThriftServer SparkGetTablesOperation implemented in apache#22794 does a catalog.getTableMetadata request for every table. This can get very slow for large schemas (~50ms per table with an external Hive metastore). Hive ThriftServer GetTablesOperation uses HiveMetastoreClient.getTableObjectsByName to get table information in bulk, but we don't expose that through our APIs that go through Hive -> HiveClientImpl (HiveClient) -> HiveExternalCatalog (ExternalCatalog) -> SessionCatalog. If we added and exposed getTableObjectsByName through our catalog APIs, we could resolve that performance problem in SparkGetTablesOperation. ## How was this patch tested? Add UT Closes apache#24774 from LantaoJin/SPARK-27899. Authored-by: LantaoJin <[email protected]> Signed-off-by: gatorsmile <[email protected]>

…client tools cannot show tables For SQL client tools([DBeaver](https://dbeaver.io/))'s Navigator use [`GetTablesOperation`](https://github.com/apache/spark/blob/a7444570764b0a08b7e908dc7931744f9dbdf3c6/sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/operation/GetTablesOperation.java) to obtain table names. We should use [`metadataHive`](https://github.com/apache/spark/blob/95d172da2b370ff6257bfd6fcd102ac553f6f6af/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLEnv.scala#L52-L53), but it use [`executionHive`](https://github.com/apache/spark/blob/24f5bbd770033dacdea62555488bfffb61665279/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2.scala#L93-L95). This PR implement Spark own `GetTablesOperation` to use `metadataHive`. unit test and manual tests ![image](https://user-images.githubusercontent.com/5399861/47430696-acf77980-d7cc-11e8-824d-f28d78f60a00.png) ![image](https://user-images.githubusercontent.com/5399861/47440576-09649400-d7e1-11e8-97a8-a96f73f70361.png) Closes apache#22794 from wangyum/SPARK-24570. Authored-by: Yuming Wang <[email protected]> Signed-off-by: gatorsmile <[email protected]>

Fix SQL client tools cannot show schemas/tables

9f528bc

wangyum changed the title ~~[WIP][SPARK-24570] Fix SQL client tools cannot show tables~~ [WIP][SPARK-24570][SQL] Fix SQL client tools cannot show tables Oct 22, 2018

mgaido91 reviewed Oct 22, 2018

View reviewed changes

Add test

2939178

Implement GetTablesOperation

8449cdf

wangyum changed the title ~~[WIP][SPARK-24570][SQL] Fix SQL client tools cannot show tables~~ [SPARK-24570][SQL] Implement Spark own GetTablesOperation to fix SQL client tools cannot show tables Oct 24, 2018

Only correct tableType constructor rowData

e9a2a93

gatorsmile reviewed Oct 24, 2018

View reviewed changes

Add SparkMetadataOperationSuite

80a8d21

wangyum commented Oct 26, 2018

View reviewed changes

wangyum mentioned this pull request Dec 27, 2018

[SPARK-24196][SQL] Implement Spark's own GetSchemasOperation #22903

Closed

wangyum added 2 commits January 8, 2019 15:25

Change RowSet to protected

4f5cd27

mgaido91 reviewed Jan 8, 2019

View reviewed changes

remove s

fb7e0a5

gatorsmile reviewed Feb 18, 2019

View reviewed changes

asfgit closed this in 7f53116 Feb 18, 2019

juliuszsompolski reviewed May 30, 2019

View reviewed changes

LantaoJin mentioned this pull request Jun 3, 2019

[SPARK-27899][SQL] Make HiveMetastoreClient.getTableObjectsByName available in ExternalCatalog/SessionCatalog API #24774

Closed

[SPARK-24570][SQL] Implement Spark own GetTablesOperation to fix SQL client tools cannot show tables #22794

[SPARK-24570][SQL] Implement Spark own GetTablesOperation to fix SQL client tools cannot show tables #22794

Uh oh!

Conversation

wangyum commented Oct 22, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Oct 22, 2018

Uh oh!

SparkQA commented Oct 22, 2018

Uh oh!

mgaido91 left a comment

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 22, 2018

Uh oh!

SparkQA commented Oct 23, 2018

Uh oh!

SparkQA commented Oct 24, 2018

Uh oh!

wangyum commented Oct 24, 2018

Uh oh!

wangyum commented Oct 24, 2018

Uh oh!

SparkQA commented Oct 24, 2018

Uh oh!

gatorsmile commented Oct 24, 2018

Uh oh!

gatorsmile Oct 24, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 25, 2018

Uh oh!

mgaido91 commented Oct 25, 2018

Uh oh!

wangyum commented Oct 25, 2018

Uh oh!

mgaido91 commented Oct 25, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 8, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 8, 2019

Uh oh!

kk17 commented Jan 30, 2019

Uh oh!

dongjoon-hyun commented Feb 10, 2019

Uh oh!

SparkQA commented Feb 10, 2019

Uh oh!

tooptoop4 commented Feb 17, 2019

Uh oh!

wangyum commented Feb 18, 2019

Uh oh!

SparkQA commented Feb 18, 2019

Uh oh!

gatorsmile left a comment

Choose a reason for hiding this comment

Uh oh!

tooptoop4 commented Feb 25, 2019

Uh oh!

gatorsmile commented Feb 25, 2019

wangyum commented Oct 22, 2018 •

edited

Loading

gatorsmile Oct 24, 2018 •

edited

Loading