[SPARK-13080] [SQL] Implement new Catalog API using Hive #11293

rxin · 2016-02-21T20:56:35Z

What changes were proposed in this pull request?

This is a step towards merging SQLContext and HiveContext. A new internal Catalog API was introduced in #10982 and extended in #11069. This patch introduces an implementation of this API using HiveClient, an existing interface to Hive. It also extends HiveClient with additional calls to Hive that are needed to complete the catalog implementation.

Where should I start reviewing? The new catalog introduced is HiveCatalog. This class is relatively simple because it just calls HiveClientImpl, where most of the new logic is. I would not start with HiveClient, HiveQl, or HiveMetastoreCatalog, which are modified mainly because of a refactor.

Why is this patch so big? I had to refactor HiveClient to remove an intermediate representation of databases, tables, partitions etc. After this refactor CatalogTable convert directly to and from HiveTable (etc.). Otherwise we would have to first convert CatalogTable to the intermediate representation and then convert that to HiveTable, which is messy.

The new class hierarchy is as follows:

org.apache.spark.sql.catalyst.catalog.Catalog
  - org.apache.spark.sql.catalyst.catalog.InMemoryCatalog
  - org.apache.spark.sql.hive.HiveCatalog

Note that, as of this patch, none of these classes are currently used anywhere yet. This will come in the future before the Spark 2.0 release.

How was the this patch tested?

All existing unit tests, and HiveCatalogSuite that extends CatalogTestCases.

This required converting o.a.s.sql.catalyst.catalog.Table to its counterpart in o.a.s.sql.hive.client.HiveTable. This required making o.a.s.sql.hive.client.TableType an enum because we need to create one of these from name.

Currently there's the catalog table, the Spark table used in the hive module, and the Hive table. To avoid converting to and from between these table representations, we kill the intermediate one, which is the one currently used throughout HiveClient and friends.

Instead, this commit introduces CatalogTableType that serves the same purpose. This adds some type-safety and keeps the code clean.

The operation doesn't support renaming anyway, so it doesn't make sense to pass in a name AND a CatalogDatabase that always has the same name.

We used to pass CatalogTableType#toString into HiveTable, which fails later when Hive extracts the Java enum value from the string. This was the cause of test failures in a few test suites: - InsertIntoHiveTableSuite - MultiDatabaseSuite - ParquetMetastoreSuite - ...

Blatant programming mistake. This was caught by hive.execution.SQLQuerySuite.

When we create views using HiveQl we pass in null data types because we can't specify these types until later. This caused a NPE downstream.

This fixes a failing test in HiveCompatibilitySuite, where Spark was ignoring the character limit in varchar but Hive respected it. The issue was that we were converting Hive types to and from Spark DataType, and in the process losing the limit information. Instead of doing this conversion, we simply encode the data type as a string so we don't loes any information. This means less type-safety but the real fix is outside the scope of this patch.

I missed one place where the data type was still a DataType, but not a string.

This suite extends the existing CatalogTestCases. Many tests needed to be modified significantly for Hive to work. Even after many hours spent on trying to make this work, there is still one that doesn't pass for some reason. In particular, I was not able to call "alterPartitions" on an existing Hive table as of this commit. That test is temporarily ignored for now. The rest of the tests added in this commit should pass.

It turns out that you need to run "USE my_database" before "ALTER TABLE my_table PARTITION ..." (HIVE-2742). Geez.

This was caused by cb288da, an attempt to clean up some duplicate code. It turns out that HiveClient and HiveClientImpl cannot both refer to Hive classes due to some classloader issues. Surprise... This commit reverts part of the changes introduced in cb288da.

[SPARK-13080] [SQL] Implement new Catalog API using Hive

rxin · 2016-02-21T20:59:58Z

I simply brought #11189 up to date and resolved some code review issues so we can merge this quickly and unblock some other work.

hvanhovell · 2016-02-21T21:02:28Z

LGTM pending tests

andrewor14 · 2016-02-21T22:17:22Z

Thanks, @davies also suggested offline to rename all CatalogTable and related classes to just Table. We can do that separately after this patch gets merged.

SparkQA · 2016-02-21T22:56:20Z

Test build #2559 has finished for PR 11293 at commit 6703aa5.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- abstract class NoSuchItemException extends Exception

rxin · 2016-02-21T23:00:02Z

Alright let's discuss the renaming. I initially just used Table, but I think both could work (Table or CatalogTable), because we might have many "tables".

rxin · 2016-02-21T23:00:08Z

Going to merge this in master.

SparkQA · 2016-02-21T23:06:19Z

Test build #51644 has finished for PR 11293 at commit 6703aa5.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- abstract class NoSuchItemException extends Exception

andrewor14 · 2016-02-21T23:39:01Z

because we might have many "tables".

Yeah that's why I renamed it in the first place. As of this patch though there are no more classes that are called just Table.

davies · 2016-02-22T04:11:19Z

I didn't review the core parts of this PR yet, hopefully @rxin had done that.

rxin · 2016-02-22T04:43:09Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveCatalog.scala

+   * Run some code involving `client` in a [[synchronized]] block and wrap certain
+   * exceptions thrown in the process in [[AnalysisException]].
+   */
+  private def withClient[T](body: => T): T = synchronized {


cc @andrewor14 why does this one need to be synchronized?

shouldn't all methods of the catalog be synchronized?

MaxGekk · 2020-12-12T11:45:01Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala

+      db: String,
+      table: String,
+      specs: Seq[Catalog.TablePartitionSpec]): Unit = withHiveState {
+    // TODO: figure out how to drop multiple partitions in one call


This TODO still exists in source code. Actually, Hive MetaStore Client API has a method for deleting multiple partitions:

public List<Partition> dropPartitions(String dbName, String tblName, List<ObjectPair<Integer,byte[]>> partExprs, boolean deleteData, boolean ifExists) throws NoSuchObjectException, MetaException, org.apache.thrift.TException

Potentially, we could optimize it in OSS or in ... cc @cloud-fan

Andrew Or added 30 commits February 10, 2016 13:16

Add skeleton for HiveCatalog

3b66605

Implement createDatabase

f3e094a

Fix style

4b09a7d

Implement dropDatabase

526f278

Implement alterDatabase

4aa6e66

Implement getDatabase, listDatabases and databaseExists

433d180

Implement createTable

ff5c5be

This required converting o.a.s.sql.catalyst.catalog.Table to its counterpart in o.a.s.sql.hive.client.HiveTable. This required making o.a.s.sql.hive.client.TableType an enum because we need to create one of these from name.

Explicitly mark methods with override in HiveCatalog

ff49f0c

Implement dropTable

ca98c00

Implement renameTable, alterTable

71f9964

Remove TableType enum

af5ffc0

Instead, this commit introduces CatalogTableType that serves the same purpose. This adds some type-safety and keeps the code clean.

Re-implement all table operations after the refactor

d7b18e6

Implement all partition operations

a915d01

Implement all function operations

3ceb88d

Simplify alterDatabase

07332ad

The operation doesn't support renaming anyway, so it doesn't make sense to pass in a name AND a CatalogDatabase that always has the same name.

Clean up HiveClientImpl a little

cdf1f70

Merge branch 'master' of github.com:apache/spark into hive-catalog

bbb8170

Fix tests?

2b72025

Miscellaneous cleanup

5e2cd3a

Merge branch 'master' of github.com:apache/spark into hive-catalog

6519c2a

Address comments + minor cleanups

7d58fac

Fix CREATE TABLE serde setting

4ecc3b1

Blatant programming mistake. This was caught by hive.execution.SQLQuerySuite.

Fix NPE in CREATE VIEW

863ebd0

When we create views using HiveQl we pass in null data types because we can't specify these types until later. This caused a NPE downstream.

Fix style

fe295fb

Fix MetastoreDataSourcesSuite

43e3c66

I missed one place where the data type was still a DataType, but not a string.

Merge branch 'master' of github.com:apache/spark into hive-catalog

428c3c5

Andrew Or and others added 6 commits February 18, 2016 13:06

Merge branch 'master' of github.com:apache/spark into hive-catalog

ed9c6fa

Miscellaneous clean ups

cb288da

Un-ignore alter partitions test

2ba1990

It turns out that you need to run "USE my_database" before "ALTER TABLE my_table PARTITION ..." (HIVE-2742). Geez.

Merge pull request apache#11189 from andrewor14/hive-catalog

df0ad86

[SPARK-13080] [SQL] Implement new Catalog API using Hive

code review

6703aa5

rxin mentioned this pull request Feb 21, 2016

[SPARK-13080] [SQL] Implement new Catalog API using Hive #11189

Closed

asfgit closed this in 6c3832b Feb 21, 2016

rxin reviewed Feb 22, 2016
View reviewed changes

wzhfy mentioned this pull request Oct 30, 2017

[SPARK-22394] [SQL] Remove redundant synchronization for metastore access #19605

Closed

MaxGekk reviewed Dec 12, 2020

View reviewed changes

[SPARK-13080] [SQL] Implement new Catalog API using Hive #11293

[SPARK-13080] [SQL] Implement new Catalog API using Hive #11293

Uh oh!

Conversation

rxin commented Feb 21, 2016

What changes were proposed in this pull request?

How was the this patch tested?

Uh oh!

rxin commented Feb 21, 2016

Uh oh!

hvanhovell commented Feb 21, 2016

Uh oh!

andrewor14 commented Feb 21, 2016

Uh oh!

SparkQA commented Feb 21, 2016

Uh oh!

rxin commented Feb 21, 2016

Uh oh!

rxin commented Feb 21, 2016

Uh oh!

SparkQA commented Feb 21, 2016

Uh oh!

andrewor14 commented Feb 21, 2016

Uh oh!

davies commented Feb 22, 2016

Uh oh!

rxin Feb 22, 2016

Choose a reason for hiding this comment

Uh oh!

andrewor14 Feb 22, 2016

Choose a reason for hiding this comment

Uh oh!

MaxGekk Dec 12, 2020

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants