-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-24570][SQL] Implement Spark own GetTablesOperation to fix SQL client tools cannot show tables #22794
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #97807 has finished for PR 22794 at commit
|
|
Test build #97810 has finished for PR 22794 at commit
|
mgaido91
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, this means we are returning the tables querying directly Hive metastore, which may return partial/incomplete/incorrect information (consider for instance temporary views). Am I missing something?
|
Test build #97839 has finished for PR 22794 at commit
|
|
Test build #97912 has finished for PR 22794 at commit
|
|
Test build #97968 has finished for PR 22794 at commit
|
|
Thanks @mgaido91 Changed to |
|
Test build #97980 has finished for PR 22794 at commit
|
| import org.apache.spark.sql.catalyst.catalog.CatalogTableType._ | ||
| import org.apache.spark.sql.catalyst.catalog.SessionCatalog | ||
|
|
||
| private[hive] class SparkGetTablesOperation( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/**
* @param schemaName database name, null or a concrete database name
* @param tableName table name pattern
* @param tableTypes list of allowed table types
* ................
*/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
| } | ||
| } | ||
|
|
||
| test("SPARK-24196: SQL client(DBeaver) can't show tables") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Create a SparkMetadataOperationSuite?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
| schemaName: String, | ||
| tableName: String, | ||
| tableTypes: JList[String]) | ||
| (sqlContext: SQLContext, sessionToActivePool: JMap[SessionHandle, String]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wondering why we need sessionToActivePool?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just add sqlContext in the class parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
|
Test build #98002 has finished for PR 22794 at commit
|
|
@wangyum the point is that what I meant in the previous comment is that the current approach is not good IMHO. Now you fixed it for the tables, but the same is true for functions: currently you'd be returning Hive's functions, not Spark's right? You should override all of them IMHO... |
|
@mgaido91 You are right. But may be we only override |
|
I am not sure what Hive 1.2 exposes, but we might have more, it needs to be checked. Anyway, yes, those have to be overridden for sure. When I referred to the current approach, I meant the one described in the PR description, which seems to be outdated tough. May you please update it? Thanks. |
| } | ||
| } | ||
|
|
||
| test("Spark's own GetTablesOperation(SparkGetTablesOperation)") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test mimic HiveDatabaseMetaData.getTables().
# Conflicts: # sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/server/SparkSQLOperationManager.scala # sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/SparkMetadataOperationSuite.scala
|
Test build #100921 has finished for PR 22794 at commit
|
| tableTypes: JList[String]): MetadataOperation = synchronized { | ||
| val sqlContext = sessionToContexts.get(parentSession.getSessionHandle) | ||
| require(sqlContext != null, s"Session handle: ${parentSession.getSessionHandle} has not been" + | ||
| s" initialized or had already closed.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: remove s
|
Test build #100922 has finished for PR 22794 at commit
|
|
Just curious that is there any plan for merging this pull request? |
|
Retest this please. |
|
Test build #102135 has finished for PR 22794 at commit
|
|
merge please |
|
Retest this please. |
|
Test build #102444 has finished for PR 22794 at commit
|
gatorsmile
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks! Merged to master.
|
This change caused errors in SIMBA Spark JDBC driver, can't run any normal queries in squirrelSQL anymore. |
|
What is the error? |
|
@tooptoop4 I think it works. Could you provide more infos? |
|
try with ssl true in jdbc connectstring/thrift server. will send stacktrace 2day |
|
When i put in spark-hive-thriftserver_2.11-3.0.0.jar cherrypicked without this PR then i don't get an error, with this PR i get below error upon doing a simple select * from sch.tbl limit 5 query, also the schemas dont show. I am on squirrel 3.9.0 (Windows 10) and my SparkThrift has TLS enabled with jks, and LDAP authentication. com.simba.spark.jdbc41.Driver from SimbaSparkJDBC41-1.1.7.1009 version is being used. java.sql.SQLException: [Simba]SparkJDBCDriver Error getting schema information: Metadata Initialization Error. |
|
Can you please describe explicit step with screenshots one by one? Since it uses thridparty one, let's be very sure about reproducer and the output. Otherwise, no one can verify. |
|
We should check which version of thrift the simba driver is expecting. Spark is using a very old version of Hive's thrift protocol, a mismatch could be the problem. |
|
@tooptoop4 Could you give me an email? I will contact you offline. |
| val tablePattern = convertIdentifierPattern(tableName, true) | ||
| matchingDbs.foreach { dbName => | ||
| catalog.listTables(dbName, tablePattern).foreach { tableIdentifier => | ||
| val catalogTable = catalog.getTableMetadata(tableIdentifier) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be very slow for big schemas. Calling getTableMetadata on every table will trigger 3 separate database calls to the metastore (requireDbExists, requireTableExists, and getTable) taking ~tens of ms for every table. So it can be tens of seconds for schemas with hundreds of tables.
The underlying Hive Thriftserver GetTables uses MetastoreClient.getTableObjectsByName (https://hive.apache.org/javadocs/r2.1.1/api/org/apache/hadoop/hive/metastore/HiveMetaStoreClient.html#getTableObjectsByName-java.lang.String-java.util.List-) call to bulk-list the tables, but we don't expose that through our SessionCatalog / ExternalCatalog / HiveClientImpl
Would it be possible to thread that bulk getTableObjectsByName operation through our catalog APIs, to be able to retrieve the tables efficiently here? @wangyum @gatorsmile - what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I raised https://issues.apache.org/jira/browse/SPARK-27899 with this idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@LantaoJin @wangyum Could either of you submit a PR to resolve the issue raised by @juliuszsompolski ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I will take this issue.
…ilable in ExternalCatalog/SessionCatalog API ## What changes were proposed in this pull request? The new Spark ThriftServer SparkGetTablesOperation implemented in apache#22794 does a catalog.getTableMetadata request for every table. This can get very slow for large schemas (~50ms per table with an external Hive metastore). Hive ThriftServer GetTablesOperation uses HiveMetastoreClient.getTableObjectsByName to get table information in bulk, but we don't expose that through our APIs that go through Hive -> HiveClientImpl (HiveClient) -> HiveExternalCatalog (ExternalCatalog) -> SessionCatalog. If we added and exposed getTableObjectsByName through our catalog APIs, we could resolve that performance problem in SparkGetTablesOperation. ## How was this patch tested? Add UT Closes apache#24774 from LantaoJin/SPARK-27899. Authored-by: LantaoJin <[email protected]> Signed-off-by: gatorsmile <[email protected]>
…ilable in ExternalCatalog/SessionCatalog API ## What changes were proposed in this pull request? The new Spark ThriftServer SparkGetTablesOperation implemented in apache#22794 does a catalog.getTableMetadata request for every table. This can get very slow for large schemas (~50ms per table with an external Hive metastore). Hive ThriftServer GetTablesOperation uses HiveMetastoreClient.getTableObjectsByName to get table information in bulk, but we don't expose that through our APIs that go through Hive -> HiveClientImpl (HiveClient) -> HiveExternalCatalog (ExternalCatalog) -> SessionCatalog. If we added and exposed getTableObjectsByName through our catalog APIs, we could resolve that performance problem in SparkGetTablesOperation. ## How was this patch tested? Add UT Closes apache#24774 from LantaoJin/SPARK-27899. Authored-by: LantaoJin <[email protected]> Signed-off-by: gatorsmile <[email protected]>
…client tools cannot show tables For SQL client tools([DBeaver](https://dbeaver.io/))'s Navigator use [`GetTablesOperation`](https://github.com/apache/spark/blob/a7444570764b0a08b7e908dc7931744f9dbdf3c6/sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/operation/GetTablesOperation.java) to obtain table names. We should use [`metadataHive`](https://github.com/apache/spark/blob/95d172da2b370ff6257bfd6fcd102ac553f6f6af/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLEnv.scala#L52-L53), but it use [`executionHive`](https://github.com/apache/spark/blob/24f5bbd770033dacdea62555488bfffb61665279/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2.scala#L93-L95). This PR implement Spark own `GetTablesOperation` to use `metadataHive`. unit test and manual tests   Closes apache#22794 from wangyum/SPARK-24570. Authored-by: Yuming Wang <[email protected]> Signed-off-by: gatorsmile <[email protected]>

What changes were proposed in this pull request?
For SQL client tools(DBeaver)'s Navigator use
GetTablesOperationto obtain table names.We should use
metadataHive, but it useexecutionHive.This PR implement Spark own
GetTablesOperationto usemetadataHive.How was this patch tested?
unit test and manual tests