-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-13080] [SQL] Implement new Catalog API using Hive #11293
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This required converting o.a.s.sql.catalyst.catalog.Table to its counterpart in o.a.s.sql.hive.client.HiveTable. This required making o.a.s.sql.hive.client.TableType an enum because we need to create one of these from name.
Currently there's the catalog table, the Spark table used in the hive module, and the Hive table. To avoid converting to and from between these table representations, we kill the intermediate one, which is the one currently used throughout HiveClient and friends.
Instead, this commit introduces CatalogTableType that serves the same purpose. This adds some type-safety and keeps the code clean.
The operation doesn't support renaming anyway, so it doesn't make sense to pass in a name AND a CatalogDatabase that always has the same name.
We used to pass CatalogTableType#toString into HiveTable, which fails later when Hive extracts the Java enum value from the string. This was the cause of test failures in a few test suites: - InsertIntoHiveTableSuite - MultiDatabaseSuite - ParquetMetastoreSuite - ...
Blatant programming mistake. This was caught by hive.execution.SQLQuerySuite.
When we create views using HiveQl we pass in null data types because we can't specify these types until later. This caused a NPE downstream.
This fixes a failing test in HiveCompatibilitySuite, where Spark was ignoring the character limit in varchar but Hive respected it. The issue was that we were converting Hive types to and from Spark DataType, and in the process losing the limit information. Instead of doing this conversion, we simply encode the data type as a string so we don't loes any information. This means less type-safety but the real fix is outside the scope of this patch.
I missed one place where the data type was still a DataType, but not a string.
This suite extends the existing CatalogTestCases. Many tests needed to be modified significantly for Hive to work. Even after many hours spent on trying to make this work, there is still one that doesn't pass for some reason. In particular, I was not able to call "alterPartitions" on an existing Hive table as of this commit. That test is temporarily ignored for now. The rest of the tests added in this commit should pass.
It turns out that you need to run "USE my_database" before "ALTER TABLE my_table PARTITION ..." (HIVE-2742). Geez.
[SPARK-13080] [SQL] Implement new Catalog API using Hive
|
I simply brought #11189 up to date and resolved some code review issues so we can merge this quickly and unblock some other work. |
|
LGTM pending tests |
|
Thanks, @davies also suggested offline to rename all |
|
Test build #2559 has finished for PR 11293 at commit
|
|
Alright let's discuss the renaming. I initially just used Table, but I think both could work (Table or CatalogTable), because we might have many "tables". |
|
Going to merge this in master. |
|
Test build #51644 has finished for PR 11293 at commit
|
Yeah that's why I renamed it in the first place. As of this patch though there are no more classes that are called just |
|
I didn't review the core parts of this PR yet, hopefully @rxin had done that. |
| * Run some code involving `client` in a [[synchronized]] block and wrap certain | ||
| * exceptions thrown in the process in [[AnalysisException]]. | ||
| */ | ||
| private def withClient[T](body: => T): T = synchronized { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @andrewor14 why does this one need to be synchronized?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't all methods of the catalog be synchronized?
| db: String, | ||
| table: String, | ||
| specs: Seq[Catalog.TablePartitionSpec]): Unit = withHiveState { | ||
| // TODO: figure out how to drop multiple partitions in one call |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This TODO still exists in source code. Actually, Hive MetaStore Client API has a method for deleting multiple partitions:
public List<Partition> dropPartitions(String dbName,
String tblName,
List<ObjectPair<Integer,byte[]>> partExprs,
boolean deleteData,
boolean ifExists) throws NoSuchObjectException,
MetaException,
org.apache.thrift.TException
Potentially, we could optimize it in OSS or in ... cc @cloud-fan
What changes were proposed in this pull request?
This is a step towards merging
SQLContextandHiveContext. A new internal Catalog API was introduced in #10982 and extended in #11069. This patch introduces an implementation of this API usingHiveClient, an existing interface to Hive. It also extendsHiveClientwith additional calls to Hive that are needed to complete the catalog implementation.Where should I start reviewing? The new catalog introduced is
HiveCatalog. This class is relatively simple because it just callsHiveClientImpl, where most of the new logic is. I would not start withHiveClient,HiveQl, orHiveMetastoreCatalog, which are modified mainly because of a refactor.Why is this patch so big? I had to refactor HiveClient to remove an intermediate representation of databases, tables, partitions etc. After this refactor
CatalogTableconvert directly to and fromHiveTable(etc.). Otherwise we would have to first convertCatalogTableto the intermediate representation and then convert that to HiveTable, which is messy.The new class hierarchy is as follows:
Note that, as of this patch, none of these classes are currently used anywhere yet. This will come in the future before the Spark 2.0 release.
How was the this patch tested?
All existing unit tests, and HiveCatalogSuite that extends CatalogTestCases.