[SPARK-21617][SQL] Store correct metadata in Hive for altered DS table. #18824

vanzin · 2017-08-02T22:41:30Z

This change fixes two issues:

when loading table metadata from Hive, restore the "provider" field of
CatalogTable so DS tables can be identified.
when altering a DS table in the Hive metastore, make sure to not alter
the table's schema, since the DS table's schema is stored as a table
property in those cases.

Also added a new unit test for this issue which fails without this change.

This change fixes two issues: - when loading table metadata from Hive, restore the "provider" field of CatalogTable so DS tables can be identified. - when altering a DS table in the Hive metastore, make sure to not alter the table's schema, since the DS table's schema is stored as a table property in those cases. Also added a new unit test for this issue which fails without this change.

vanzin · 2017-08-02T22:42:13Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala

            "compatible way. Updating Hive metastore in Spark SQL specific format."
        logWarning(warningMessage, e)
-        client.alterTable(updatedTable.copy(schema = updatedTable.partitionSchema))
+        client.alterTable(updatedTable.copy(schema = tableToStore.partitionSchema))


This is the exception handling code I mentioned in the bug report which seems very suspicious. I had half a desire to just remove it, but maybe someone can explain to me why this code makes sense.

I think this part is directly related to the logic which converts the table metadata to Spark SQL specific format:

spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala

Lines 290 to 301 in f18b905

def newSparkSQLSpecificMetastoreTable(): CatalogTable = {

table.copy(

// Hive only allows directory paths as location URIs while Spark SQL data source tables

// also allow file paths. For non-hive-compatible format, we should not set location URI

// to avoid hive metastore to throw exception.

storage = table.storage.copy(

locationUri = None,

properties = storagePropsWithLocation),

schema = table.partitionSchema,

bucketSpec = None,

properties = table.properties ++ tableProperties)

}

vanzin · 2017-08-03T01:42:34Z

Hmm, after I made some changes to the test the whole test suite is failing (although running tests individually works). I'll work on that but the fix itself, other than the test, should be correct.

SparkQA · 2017-08-03T02:30:38Z

Test build #80182 has finished for PR 18824 at commit aae3abd.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2017-08-03T03:55:20Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala

+      val properties = Option(h.getParameters).map(_.asScala.toMap).getOrElse(Map())
+
+      val provider = properties.get(HiveExternalCatalog.DATASOURCE_PROVIDER)
+        .orElse(Some(DDLUtils.HIVE_PROVIDER))


Previously we don't store provider for Hive serde table. Some existing logic to decide if a table retrieved from metastore is a datasource table may be broken due to this change.

Oh. Nvm. Looks like we access the key DATASOURCE_PROVIDER in table.properties for that purpose. This should be safe. Anyway, actually we will set provider for CatalogTable later when restoring the table read from metastore. Maybe this is redundant.

Another concern is we previously don't restore provider for a view, please refer to

spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala

Line 682 in f18b905

case None if table.tableType == VIEW =>

. By this change, we will set provider to HIVE_PROVIDER too for view.

Maybe this is redundant.

This was definitely not redundant in my testing. The metadata loaded from the metastore in HiveExternalCatalog.alterTableSchema definitely did not have the provider set when I debugged this. In fact the test I wrote fails if I remove this code (or comment the line that sets "provider" a few lines below).

Perhaps some other part of the code sets it in a different code path, but this would make that part of the code redundant, not the other way around.

The restoring you mention is done in HiveExternalCatalog.restoreTableMetadata. Let me see if I can use that instead of making this change.

gatorsmile · 2017-08-03T05:35:30Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala

+    // If it's a data source table, make sure the original schema is left unchanged; the
+    // actual schema is recorded as a table property.
+    val tableToStore = if (DDLUtils.isDatasourceTable(updatedTable)) {
+      updatedTable.copy(schema = rawTable.schema)


We do support ALTER TABLE ADD COLUMN, which relies on alterTableSchema . The data source tables can be read by Hive if possible. Thus, I think we should not set the schema unchanged.

I just checked the JIRA description. This sounds a bug we need to resolve. It needs a little bit complex to fix it. We need to follow what we did for create table. cc @xwu0226 Please help @vanzin address this issue.

Hmm, I see that this will break DS tables created with newHiveCompatibleMetastoreTable instead of newSparkSQLSpecificMetastoreTable.

For the former, the only thing I can see that could be used to identify the case is the presence of serde properties in the table metadata. That could replace the DDLUtils.isDatasourceTable(updatedTable) check to see whether the schema needs to be updated.

For the latter case, I see that newSparkSQLSpecificMetastoreTable stores the partition schema as the table's schema (which sort of explains the weird exception handling I saw). So this code is only correct if the partition schema cannot change. Where is the partition schema for a DS table defined? Is that under control of the user (or the data source implementation)? Because if it can change you can run into pretty much the same issue.

- Use the same code to translate between Spark and Hive tables when creating or altering the table. - Fix the test so that it doesn't try to create a new SparkSession, which conflicts with TestHiveSingleton. - Use 2.1's EnvironmentContext to disable auto updating of stats for DS tables.

vanzin · 2017-08-03T19:01:15Z

I reworked the patch to try to merge the "create table" and "alter table" paths, so they both do the translation the same way.

There are still some test failures but I wanted to get this up here for you guys to take a look while I fix those.

SparkQA · 2017-08-03T20:34:25Z

Test build #80217 has finished for PR 18824 at commit cc7cd95.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2017-08-03T22:02:51Z

FYI, I'm rebuilding the environment where I found the bug, to see why the code was failing even with the exception handler. I'll update the bug if necessary.

vanzin · 2017-08-03T22:18:18Z

I updated the bug, let me close this for now while I figure out why that exception is happening.

vanzin commented Aug 2, 2017

View reviewed changes

viirya reviewed Aug 3, 2017

View reviewed changes

gatorsmile reviewed Aug 3, 2017

View reviewed changes

vanzin closed this Aug 3, 2017

vanzin mentioned this pull request Aug 4, 2017

[SPARK-21617][SQL] Store correct table metadata when altering schema in Hive metastore. #18849

Closed

	def newSparkSQLSpecificMetastoreTable(): CatalogTable = {
	table.copy(
	// Hive only allows directory paths as location URIs while Spark SQL data source tables
	// also allow file paths. For non-hive-compatible format, we should not set location URI
	// to avoid hive metastore to throw exception.
	storage = table.storage.copy(
	locationUri = None,
	properties = storagePropsWithLocation),
	schema = table.partitionSchema,
	bucketSpec = None,
	properties = table.properties ++ tableProperties)
	}

[SPARK-21617][SQL] Store correct metadata in Hive for altered DS table. #18824

[SPARK-21617][SQL] Store correct metadata in Hive for altered DS table. #18824

Uh oh!

Conversation

vanzin commented Aug 2, 2017

Uh oh!

vanzin Aug 2, 2017

Choose a reason for hiding this comment

Uh oh!

viirya Aug 3, 2017

Choose a reason for hiding this comment

Uh oh!

vanzin commented Aug 3, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Aug 3, 2017

Uh oh!

viirya Aug 3, 2017

Choose a reason for hiding this comment

Uh oh!

viirya Aug 3, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vanzin Aug 3, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vanzin Aug 3, 2017

Choose a reason for hiding this comment

Uh oh!

gatorsmile Aug 3, 2017

Choose a reason for hiding this comment

Uh oh!

gatorsmile Aug 3, 2017

Choose a reason for hiding this comment

Uh oh!

vanzin Aug 3, 2017

Choose a reason for hiding this comment

Uh oh!

vanzin commented Aug 3, 2017

Uh oh!

SparkQA commented Aug 3, 2017

Uh oh!

vanzin commented Aug 3, 2017

Uh oh!

vanzin commented Aug 3, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

vanzin commented Aug 3, 2017 •

edited

Loading

viirya Aug 3, 2017 •

edited

Loading

vanzin Aug 3, 2017 •

edited

Loading