Skip to content

Conversation

@vanzin
Copy link
Contributor

@vanzin vanzin commented Aug 2, 2017

This change fixes two issues:

  • when loading table metadata from Hive, restore the "provider" field of
    CatalogTable so DS tables can be identified.
  • when altering a DS table in the Hive metastore, make sure to not alter
    the table's schema, since the DS table's schema is stored as a table
    property in those cases.

Also added a new unit test for this issue which fails without this change.

This change fixes two issues:
- when loading table metadata from Hive, restore the "provider" field of
  CatalogTable so DS tables can be identified.
- when altering a DS table in the Hive metastore, make sure to not alter
  the table's schema, since the DS table's schema is stored as a table
  property in those cases.

Also added a new unit test for this issue which fails without this change.
"compatible way. Updating Hive metastore in Spark SQL specific format."
logWarning(warningMessage, e)
client.alterTable(updatedTable.copy(schema = updatedTable.partitionSchema))
client.alterTable(updatedTable.copy(schema = tableToStore.partitionSchema))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the exception handling code I mentioned in the bug report which seems very suspicious. I had half a desire to just remove it, but maybe someone can explain to me why this code makes sense.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this part is directly related to the logic which converts the table metadata to Spark SQL specific format:

def newSparkSQLSpecificMetastoreTable(): CatalogTable = {
table.copy(
// Hive only allows directory paths as location URIs while Spark SQL data source tables
// also allow file paths. For non-hive-compatible format, we should not set location URI
// to avoid hive metastore to throw exception.
storage = table.storage.copy(
locationUri = None,
properties = storagePropsWithLocation),
schema = table.partitionSchema,
bucketSpec = None,
properties = table.properties ++ tableProperties)
}

@vanzin
Copy link
Contributor Author

vanzin commented Aug 3, 2017

Hmm, after I made some changes to the test the whole test suite is failing (although running tests individually works). I'll work on that but the fix itself, other than the test, should be correct.

@SparkQA
Copy link

SparkQA commented Aug 3, 2017

Test build #80182 has finished for PR 18824 at commit aae3abd.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

val properties = Option(h.getParameters).map(_.asScala.toMap).getOrElse(Map())

val provider = properties.get(HiveExternalCatalog.DATASOURCE_PROVIDER)
.orElse(Some(DDLUtils.HIVE_PROVIDER))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously we don't store provider for Hive serde table. Some existing logic to decide if a table retrieved from metastore is a datasource table may be broken due to this change.

Copy link
Member

@viirya viirya Aug 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh. Nvm. Looks like we access the key DATASOURCE_PROVIDER in table.properties for that purpose. This should be safe. Anyway, actually we will set provider for CatalogTable later when restoring the table read from metastore. Maybe this is redundant.

Another concern is we previously don't restore provider for a view, please refer to

. By this change, we will set provider to HIVE_PROVIDER too for view.

Copy link
Contributor Author

@vanzin vanzin Aug 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this is redundant.

This was definitely not redundant in my testing. The metadata loaded from the metastore in HiveExternalCatalog.alterTableSchema definitely did not have the provider set when I debugged this. In fact the test I wrote fails if I remove this code (or comment the line that sets "provider" a few lines below).

Perhaps some other part of the code sets it in a different code path, but this would make that part of the code redundant, not the other way around.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The restoring you mention is done in HiveExternalCatalog.restoreTableMetadata. Let me see if I can use that instead of making this change.

// If it's a data source table, make sure the original schema is left unchanged; the
// actual schema is recorded as a table property.
val tableToStore = if (DDLUtils.isDatasourceTable(updatedTable)) {
updatedTable.copy(schema = rawTable.schema)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do support ALTER TABLE ADD COLUMN, which relies on alterTableSchema . The data source tables can be read by Hive if possible. Thus, I think we should not set the schema unchanged.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just checked the JIRA description. This sounds a bug we need to resolve. It needs a little bit complex to fix it. We need to follow what we did for create table. cc @xwu0226 Please help @vanzin address this issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I see that this will break DS tables created with newHiveCompatibleMetastoreTable instead of newSparkSQLSpecificMetastoreTable.

For the former, the only thing I can see that could be used to identify the case is the presence of serde properties in the table metadata. That could replace the DDLUtils.isDatasourceTable(updatedTable) check to see whether the schema needs to be updated.

For the latter case, I see that newSparkSQLSpecificMetastoreTable stores the partition schema as the table's schema (which sort of explains the weird exception handling I saw). So this code is only correct if the partition schema cannot change. Where is the partition schema for a DS table defined? Is that under control of the user (or the data source implementation)? Because if it can change you can run into pretty much the same issue.

- Use the same code to translate between Spark and Hive tables when
  creating or altering the table.
- Fix the test so that it doesn't try to create a new SparkSession,
  which conflicts with TestHiveSingleton.
- Use 2.1's EnvironmentContext to disable auto updating of stats
  for DS tables.
@vanzin
Copy link
Contributor Author

vanzin commented Aug 3, 2017

I reworked the patch to try to merge the "create table" and "alter table" paths, so they both do the translation the same way.

There are still some test failures but I wanted to get this up here for you guys to take a look while I fix those.

@SparkQA
Copy link

SparkQA commented Aug 3, 2017

Test build #80217 has finished for PR 18824 at commit cc7cd95.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor Author

vanzin commented Aug 3, 2017

FYI, I'm rebuilding the environment where I found the bug, to see why the code was failing even with the exception handler. I'll update the bug if necessary.

@vanzin
Copy link
Contributor Author

vanzin commented Aug 3, 2017

I updated the bug, let me close this for now while I figure out why that exception is happening.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants