-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-21617][SQL] Store correct metadata in Hive for altered DS table. #18824
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This change fixes two issues: - when loading table metadata from Hive, restore the "provider" field of CatalogTable so DS tables can be identified. - when altering a DS table in the Hive metastore, make sure to not alter the table's schema, since the DS table's schema is stored as a table property in those cases. Also added a new unit test for this issue which fails without this change.
| "compatible way. Updating Hive metastore in Spark SQL specific format." | ||
| logWarning(warningMessage, e) | ||
| client.alterTable(updatedTable.copy(schema = updatedTable.partitionSchema)) | ||
| client.alterTable(updatedTable.copy(schema = tableToStore.partitionSchema)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the exception handling code I mentioned in the bug report which seems very suspicious. I had half a desire to just remove it, but maybe someone can explain to me why this code makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this part is directly related to the logic which converts the table metadata to Spark SQL specific format:
spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala
Lines 290 to 301 in f18b905
| def newSparkSQLSpecificMetastoreTable(): CatalogTable = { | |
| table.copy( | |
| // Hive only allows directory paths as location URIs while Spark SQL data source tables | |
| // also allow file paths. For non-hive-compatible format, we should not set location URI | |
| // to avoid hive metastore to throw exception. | |
| storage = table.storage.copy( | |
| locationUri = None, | |
| properties = storagePropsWithLocation), | |
| schema = table.partitionSchema, | |
| bucketSpec = None, | |
| properties = table.properties ++ tableProperties) | |
| } |
|
Hmm, after I made some changes to the test the whole test suite is failing (although running tests individually works). I'll work on that but the fix itself, other than the test, should be correct. |
|
Test build #80182 has finished for PR 18824 at commit
|
| val properties = Option(h.getParameters).map(_.asScala.toMap).getOrElse(Map()) | ||
|
|
||
| val provider = properties.get(HiveExternalCatalog.DATASOURCE_PROVIDER) | ||
| .orElse(Some(DDLUtils.HIVE_PROVIDER)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously we don't store provider for Hive serde table. Some existing logic to decide if a table retrieved from metastore is a datasource table may be broken due to this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh. Nvm. Looks like we access the key DATASOURCE_PROVIDER in table.properties for that purpose. This should be safe. Anyway, actually we will set provider for CatalogTable later when restoring the table read from metastore. Maybe this is redundant.
Another concern is we previously don't restore provider for a view, please refer to
spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala
Line 682 in f18b905
| case None if table.tableType == VIEW => |
provider to HIVE_PROVIDER too for view.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this is redundant.
This was definitely not redundant in my testing. The metadata loaded from the metastore in HiveExternalCatalog.alterTableSchema definitely did not have the provider set when I debugged this. In fact the test I wrote fails if I remove this code (or comment the line that sets "provider" a few lines below).
Perhaps some other part of the code sets it in a different code path, but this would make that part of the code redundant, not the other way around.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The restoring you mention is done in HiveExternalCatalog.restoreTableMetadata. Let me see if I can use that instead of making this change.
| // If it's a data source table, make sure the original schema is left unchanged; the | ||
| // actual schema is recorded as a table property. | ||
| val tableToStore = if (DDLUtils.isDatasourceTable(updatedTable)) { | ||
| updatedTable.copy(schema = rawTable.schema) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do support ALTER TABLE ADD COLUMN, which relies on alterTableSchema . The data source tables can be read by Hive if possible. Thus, I think we should not set the schema unchanged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I see that this will break DS tables created with newHiveCompatibleMetastoreTable instead of newSparkSQLSpecificMetastoreTable.
For the former, the only thing I can see that could be used to identify the case is the presence of serde properties in the table metadata. That could replace the DDLUtils.isDatasourceTable(updatedTable) check to see whether the schema needs to be updated.
For the latter case, I see that newSparkSQLSpecificMetastoreTable stores the partition schema as the table's schema (which sort of explains the weird exception handling I saw). So this code is only correct if the partition schema cannot change. Where is the partition schema for a DS table defined? Is that under control of the user (or the data source implementation)? Because if it can change you can run into pretty much the same issue.
- Use the same code to translate between Spark and Hive tables when creating or altering the table. - Fix the test so that it doesn't try to create a new SparkSession, which conflicts with TestHiveSingleton. - Use 2.1's EnvironmentContext to disable auto updating of stats for DS tables.
|
I reworked the patch to try to merge the "create table" and "alter table" paths, so they both do the translation the same way. There are still some test failures but I wanted to get this up here for you guys to take a look while I fix those. |
|
Test build #80217 has finished for PR 18824 at commit
|
|
FYI, I'm rebuilding the environment where I found the bug, to see why the code was failing even with the exception handler. I'll update the bug if necessary. |
|
I updated the bug, let me close this for now while I figure out why that exception is happening. |
This change fixes two issues:
CatalogTable so DS tables can be identified.
the table's schema, since the DS table's schema is stored as a table
property in those cases.
Also added a new unit test for this issue which fails without this change.