-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-17470][SQL] unify path for data source table and locationUri for hive serde table #15024
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I like this change! For Hive Managed table, |
|
Test build #65147 has finished for PR 15024 at commit
|
|
Test build #65148 has finished for PR 15024 at commit
|
|
@gatorsmile yea that's a good point. I have updated my PR to always set location for file-based data source table |
|
Test build #65258 has finished for PR 15024 at commit
|
|
Test build #65260 has finished for PR 15024 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For managed tables, do we need to set table.properties.get(DATASOURCE_LOCATION) to locationUri? Previously, Hive does it for us. Now, we explicitly remove it at multiple places. For example, alterTable removes it for both external tables and managed tables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do you mean? HiveExternalCatlaog.alterTable always add this location property.
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we unify the if and else branch? Should we also check whether defaultTableLocation directory path is created if tableDefinition.storage.locationUri is set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
require(defaultTableLocation.toUri.toString == givenTableLocation) doesn't give a clear user-facing message. Should we replace it with an explicit Exception?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not user facing, if we hit this, we have a bug
|
Test build #65479 has finished for PR 15024 at commit
|
82c67cf to
b09d8bb
Compare
|
Test build #65485 has finished for PR 15024 at commit
|
|
Test build #67573 has finished for PR 15024 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just FYI, we do not have any test case to cover this. ALTER TABLE SET LOCATION for data source tables is only tested in DDLSuite (when Hive support is not enabled). That means, these code changes are not tested. Actually, this is not the only one. : (
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we make HiveDDLSuite extends DDLSuite? then all these tests will be run with hive support automatically.
|
I have updated this PR according to these rules:
|
|
Test build #67708 has finished for PR 15024 at commit
|
| val STATISTICS_COL_STATS_PREFIX = STATISTICS_PREFIX + "colStats." | ||
|
|
||
| val TABLE_PARTITION_PROVIDER = SPARK_SQL_PREFIX + "partitionProvider" | ||
| val TABLE_LOCATION = SPARK_SQL_PREFIX + "tableLocation" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we use DATASOURCE_PREFIX here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a strong preference.
| // may be not Hive-compatible and can not set the `locationUri` field. We should respect the | ||
| // old `locationUri` even it's None. | ||
| val oldLocation = getLocationFromRawTable(oldTableDef) | ||
| val locationUri = if (oldLocation == tableDefinition.storage.locationUri) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we using in-memory catalog, the location is changed for renameTable on managed data source tables.
It sounds like we are not having the same behavior when using hive external catalog?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The locationUri of raw table is not the final locationUri returned by HiveExternalCatalog, as we have restoreTableMetadata, so the behaviour is consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test("alter table - rename") {
val tabName = "tab1"
val newTabName = "tab2"
withTable(tabName, newTabName) {
spark.range(10).write.saveAsTable(tabName)
val catalog = spark.sessionState.catalog
sql(s"ALTER TABLE $tabName RENAME TO $newTabName")
sql(s"DESC FORMATTED $newTabName").show(100, false)
assert(!catalog.tableExists(TableIdentifier(tabName)))
assert(catalog.tableExists(TableIdentifier(newTabName)))
}
}You can try to run the above test case in DDLSuite.scala and HiveDDLSuite.scala. The locations are different. One is using the new table name; another is using the old one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch! I have fixed it in HiveExternalCatalog.renameTable
How many places where we need to do this? |
|
@yhuai 3 places. It's easy to track it via code diff, as if we miss it somewhere, the test will fail because no path is specified.(data source tables never pass the BTW, I also added a new test suite to guarantee the semantic of the path option. |
|
Test build #67880 has finished for PR 15024 at commit
|
|
Test build #67881 has finished for PR 15024 at commit
|
| // 1. Put table provider, schema, partition column names, bucket specification and partition | ||
| // provider in table properties. | ||
| // 1. Put table provider, location URI, schema, partition column names, bucket specification | ||
| // and partition provider in table properties. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we create a separate function for this step? The related codes become longer and longer.
Could we change the description order to match the order of codes?
Put table provider, partition provider, location URI, schema, partition column names, bucket specification in table properties.
| // to create the table directory and write out data before we create this table, to avoid | ||
| // exposing a partial written table. | ||
| val needDefaultTableLocation = | ||
| tableDefinition.tableType == CatalogTableType.MANAGED && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indentation
| } | ||
|
|
||
| val optionsWithPath = if (table.tableType == CatalogTableType.MANAGED) { | ||
| table.storage.properties + ("path" -> sessionState.catalog.defaultTablePath(table.identifier)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where do we assign the default location?
| tableType = CatalogTableType.MANAGED, | ||
| storage = newStorage, | ||
| // We are creating a new managed table, which should not have custom table location. | ||
| storage = sourceTableDesc.storage.copy(locationUri = None), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When will we set the location? Is it set by hive metastore?
| "org.apache.spark.Logging") | ||
|
|
||
| /** Given a provider name, look up the data source class definition. */ | ||
| def lookupDataSource(provider0: String): Class[_] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not just provider?
| override def createRelation( | ||
| sqlContext: SQLContext, | ||
| parameters: Map[String, String]): BaseRelation = { | ||
| new TestOptionsRelation(parameters)(sqlContext.sparkSession) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also add comment to explain which tests exercise this method?
| tableProperties.put(TABLE_PARTITION_PROVIDER, "hive") | ||
| } | ||
| tableDefinition.storage.locationUri.foreach { location => | ||
| tableProperties.put(TABLE_LOCATION, location) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need this? Why not just use path in serde properties?
| require(tableDefinition.storage.locationUri.isDefined, | ||
| "External file-based data source table must have a `path` entry in storage properties.") | ||
| Some(new Path(map("path")).toUri.toString) | ||
| Some(new Path(tableDefinition.storage.locationUri.get).toUri.toString) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part looks weird since we already have a location uri.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we rename locationUri to something else? this can be any string given by users in path option, and may not be a uri
| val newTable = rawTable.copy( | ||
| identifier = TableIdentifier(newName, Some(db)), | ||
| properties = tableProps) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if I am following at here. So, after rename, we will not have a table property representing the location?
| private def getLocationFromRawTable(rawTable: CatalogTable): Option[String] = { | ||
| rawTable.properties.get(TABLE_LOCATION).orElse { | ||
| // In older version of spark, we store the table location in storage properties with key | ||
| // `path`, instead of table properties with key `spark.sql.tableLocation`. We should |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need spark.sql.tableLocation instead of just relying on hive's location field and path in serde properties?
|
|
||
| withTable("src", "src2") { | ||
| sql(s"CREATE TABLE src(i int) USING ${classOf[TestOptionsSource].getCanonicalName}") | ||
| sql("ALTER TABLE src RENAME TO src2") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test case is still calling the InMemoryCatalog.renameTable. Thus, we still need a test case to verify the behavior of HiveExternalCatalog.renameTable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's nothing about ExternalCatalog, as ExternalCatalog doesn't need to know about the path option. We generate the path option outside of ExternalCatalog, and we only need ExternalCatalog to put table location in the locationUri field correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. But we still need a test case to verify the code path and the behavior you mentioned above. So far, it sounds like we do not have any end-to-end test case for RENAME TABLE using HiveExternalCatalog. I manually verified it and it works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, we should create a JIRA for improving the DDL command test coverage.
|
Test build #67989 has finished for PR 15024 at commit
|
| } | ||
| } | ||
| val needDefaultTableLocation = tableDefinition.tableType == MANAGED && | ||
| tableDefinition.storage.locationUri.isEmpty |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explain when the table is managed but the location uri is not empty.
| val storageWithNewLocation = if (oldLocation == newLocation) { | ||
| oldTableDef.storage | ||
| } else { | ||
| updateLocationInStorageProps(oldTableDef, newLocation).copy(locationUri = newLocation) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also explain this copy.
| // path option in storage properties, to avoid exposing this concept externally. | ||
| val storageWithLocation = { | ||
| val tableLocation = getLocationFromStorageProps(table) | ||
| updateLocationInStorageProps(table, None).copy(locationUri = tableLocation) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explain the reason that we are passing None.
| userSpecifiedLocation match { | ||
| case Some(location) => | ||
| assert(r.options("path") === location) | ||
| assert(table.storage.locationUri.get === location) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems we can revert this changes.
|
LGTM. Merging to master and branch 2.1. |
…or hive serde table ## What changes were proposed in this pull request? Due to a limitation of hive metastore(table location must be directory path, not file path), we always store `path` for data source table in storage properties, instead of the `locationUri` field. However, we should not expose this difference to `CatalogTable` level, but just treat it as a hack in `HiveExternalCatalog`, like we store table schema of data source table in table properties. This PR unifies `path` and `locationUri` outside of `HiveExternalCatalog`, both data source table and hive serde table should use the `locationUri` field. This PR also unifies the way we handle default table location for managed table. Previously, the default table location of hive serde managed table is set by external catalog, but the one of data source table is set by command. After this PR, we follow the hive way and the default table location is always set by external catalog. For managed non-file-based tables, we will assign a default table location and create an empty directory for it, the table location will be removed when the table is dropped. This is reasonable as metastore doesn't care about whether a table is file-based or not, and an empty table directory has no harm. For external non-file-based tables, ideally we can omit the table location, but due to a hive metastore issue, we will assign a random location to it, and remove it right after the table is created. See SPARK-15269 for more details. This is fine as it's well isolated in `HiveExternalCatalog`. To keep the existing behaviour of the `path` option, in this PR we always add the `locationUri` to storage properties using key `path`, before passing storage properties to `DataSource` as data source options. ## How was this patch tested? existing tests. Author: Wenchen Fan <[email protected]> Closes #15024 from cloud-fan/path. (cherry picked from commit 3a1bc6f) Signed-off-by: Yin Huai <[email protected]>
|
OK. Let's get #14750 updated to fix SPARK-17183. |
…or hive serde table ## What changes were proposed in this pull request? Due to a limitation of hive metastore(table location must be directory path, not file path), we always store `path` for data source table in storage properties, instead of the `locationUri` field. However, we should not expose this difference to `CatalogTable` level, but just treat it as a hack in `HiveExternalCatalog`, like we store table schema of data source table in table properties. This PR unifies `path` and `locationUri` outside of `HiveExternalCatalog`, both data source table and hive serde table should use the `locationUri` field. This PR also unifies the way we handle default table location for managed table. Previously, the default table location of hive serde managed table is set by external catalog, but the one of data source table is set by command. After this PR, we follow the hive way and the default table location is always set by external catalog. For managed non-file-based tables, we will assign a default table location and create an empty directory for it, the table location will be removed when the table is dropped. This is reasonable as metastore doesn't care about whether a table is file-based or not, and an empty table directory has no harm. For external non-file-based tables, ideally we can omit the table location, but due to a hive metastore issue, we will assign a random location to it, and remove it right after the table is created. See SPARK-15269 for more details. This is fine as it's well isolated in `HiveExternalCatalog`. To keep the existing behaviour of the `path` option, in this PR we always add the `locationUri` to storage properties using key `path`, before passing storage properties to `DataSource` as data source options. ## How was this patch tested? existing tests. Author: Wenchen Fan <[email protected]> Closes apache#15024 from cloud-fan/path.
What changes were proposed in this pull request?
Due to a limitation of hive metastore(table location must be directory path, not file path), we always store
pathfor data source table in storage properties, instead of thelocationUrifield. However, we should not expose this difference toCatalogTablelevel, but just treat it as a hack inHiveExternalCatalog, like we store table schema of data source table in table properties.This PR unifies
pathandlocationUrioutside ofHiveExternalCatalog, both data source table and hive serde table should use thelocationUrifield.This PR also unifies the way we handle default table location for managed table. Previously, the default table location of hive serde managed table is set by external catalog, but the one of data source table is set by command. After this PR, we follow the hive way and the default table location is always set by external catalog.
For managed non-file-based tables, we will assign a default table location and create an empty directory for it, the table location will be removed when the table is dropped. This is reasonable as metastore doesn't care about whether a table is file-based or not, and an empty table directory has no harm.
For external non-file-based tables, ideally we can omit the table location, but due to a hive metastore issue, we will assign a random location to it, and remove it right after the table is created. See SPARK-15269 for more details. This is fine as it's well isolated in
HiveExternalCatalog.To keep the existing behaviour of the
pathoption, in this PR we always add thelocationUrito storage properties using keypath, before passing storage properties toDataSourceas data source options.How was this patch tested?
existing tests.