Skip to content

Conversation

@rxin
Copy link
Contributor

@rxin rxin commented Feb 21, 2016

What changes were proposed in this pull request?

This is a step towards merging SQLContext and HiveContext. A new internal Catalog API was introduced in #10982 and extended in #11069. This patch introduces an implementation of this API using HiveClient, an existing interface to Hive. It also extends HiveClient with additional calls to Hive that are needed to complete the catalog implementation.

Where should I start reviewing? The new catalog introduced is HiveCatalog. This class is relatively simple because it just calls HiveClientImpl, where most of the new logic is. I would not start with HiveClient, HiveQl, or HiveMetastoreCatalog, which are modified mainly because of a refactor.

Why is this patch so big? I had to refactor HiveClient to remove an intermediate representation of databases, tables, partitions etc. After this refactor CatalogTable convert directly to and from HiveTable (etc.). Otherwise we would have to first convert CatalogTable to the intermediate representation and then convert that to HiveTable, which is messy.

The new class hierarchy is as follows:

org.apache.spark.sql.catalyst.catalog.Catalog
  - org.apache.spark.sql.catalyst.catalog.InMemoryCatalog
  - org.apache.spark.sql.hive.HiveCatalog

Note that, as of this patch, none of these classes are currently used anywhere yet. This will come in the future before the Spark 2.0 release.

How was the this patch tested?

All existing unit tests, and HiveCatalogSuite that extends CatalogTestCases.

Andrew Or added 30 commits February 10, 2016 13:16
This required converting o.a.s.sql.catalyst.catalog.Table to its
counterpart in o.a.s.sql.hive.client.HiveTable. This required
making o.a.s.sql.hive.client.TableType an enum because we need
to create one of these from name.
Currently there's the catalog table, the Spark table used in the
hive module, and the Hive table. To avoid converting to and from
between these table representations, we kill the intermediate one,
which is the one currently used throughout HiveClient and friends.
Instead, this commit introduces CatalogTableType that serves
the same purpose. This adds some type-safety and keeps the code
clean.
The operation doesn't support renaming anyway, so it doesn't
make sense to pass in a name AND a CatalogDatabase that always
has the same name.
We used to pass CatalogTableType#toString into HiveTable, which
fails later when Hive extracts the Java enum value from the
string. This was the cause of test failures in a few test suites:

- InsertIntoHiveTableSuite
- MultiDatabaseSuite
- ParquetMetastoreSuite
- ...
Blatant programming mistake. This was caught by
hive.execution.SQLQuerySuite.
When we create views using HiveQl we pass in null data types
because we can't specify these types until later. This caused
a NPE downstream.
This fixes a failing test in HiveCompatibilitySuite, where Spark
was ignoring the character limit in varchar but Hive respected it.
The issue was that we were converting Hive types to and from
Spark DataType, and in the process losing the limit information.

Instead of doing this conversion, we simply encode the data type
as a string so we don't loes any information. This means less
type-safety but the real fix is outside the scope of this patch.
I missed one place where the data type was still a DataType, but
not a string.
This suite extends the existing CatalogTestCases. Many tests
needed to be modified significantly for Hive to work. Even after
many hours spent on trying to make this work, there is still one
that doesn't pass for some reason. In particular, I was not able
to call "alterPartitions" on an existing Hive table as of this
commit. That test is temporarily ignored for now. The rest of the
tests added in this commit should pass.
Andrew Or and others added 6 commits February 18, 2016 13:06
It turns out that you need to run "USE my_database" before
"ALTER TABLE my_table PARTITION ..." (HIVE-2742). Geez.
This was caused by cb288da, an attempt to clean up some duplicate
code. It turns out that HiveClient and HiveClientImpl cannot both
refer to Hive classes due to some classloader issues. Surprise...

This commit reverts part of the changes introduced in cb288da.
[SPARK-13080] [SQL] Implement new Catalog API using Hive
@rxin
Copy link
Contributor Author

rxin commented Feb 21, 2016

I simply brought #11189 up to date and resolved some code review issues so we can merge this quickly and unblock some other work.

@hvanhovell
Copy link
Contributor

LGTM pending tests

@andrewor14
Copy link
Contributor

Thanks, @davies also suggested offline to rename all CatalogTable and related classes to just Table. We can do that separately after this patch gets merged.

@SparkQA
Copy link

SparkQA commented Feb 21, 2016

Test build #2559 has finished for PR 11293 at commit 6703aa5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • abstract class NoSuchItemException extends Exception

@rxin
Copy link
Contributor Author

rxin commented Feb 21, 2016

Alright let's discuss the renaming. I initially just used Table, but I think both could work (Table or CatalogTable), because we might have many "tables".

@rxin
Copy link
Contributor Author

rxin commented Feb 21, 2016

Going to merge this in master.

@asfgit asfgit closed this in 6c3832b Feb 21, 2016
@SparkQA
Copy link

SparkQA commented Feb 21, 2016

Test build #51644 has finished for PR 11293 at commit 6703aa5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • abstract class NoSuchItemException extends Exception

@andrewor14
Copy link
Contributor

because we might have many "tables".

Yeah that's why I renamed it in the first place. As of this patch though there are no more classes that are called just Table.

@davies
Copy link
Contributor

davies commented Feb 22, 2016

I didn't review the core parts of this PR yet, hopefully @rxin had done that.

* Run some code involving `client` in a [[synchronized]] block and wrap certain
* exceptions thrown in the process in [[AnalysisException]].
*/
private def withClient[T](body: => T): T = synchronized {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @andrewor14 why does this one need to be synchronized?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't all methods of the catalog be synchronized?

db: String,
table: String,
specs: Seq[Catalog.TablePartitionSpec]): Unit = withHiveState {
// TODO: figure out how to drop multiple partitions in one call
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This TODO still exists in source code. Actually, Hive MetaStore Client API has a method for deleting multiple partitions:

public List<Partition> dropPartitions(String dbName,
                                      String tblName,
                                      List<ObjectPair<Integer,byte[]>> partExprs,
                                      boolean deleteData,
                                      boolean ifExists) throws NoSuchObjectException,
                                      MetaException,
                                      org.apache.thrift.TException

Potentially, we could optimize it in OSS or in ... cc @cloud-fan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants