Skip to content

Conversation

@cloud-fan
Copy link
Contributor

@cloud-fan cloud-fan commented Sep 9, 2016

What changes were proposed in this pull request?

Due to a limitation of hive metastore(table location must be directory path, not file path), we always store path for data source table in storage properties, instead of the locationUri field. However, we should not expose this difference to CatalogTable level, but just treat it as a hack in HiveExternalCatalog, like we store table schema of data source table in table properties.

This PR unifies path and locationUri outside of HiveExternalCatalog, both data source table and hive serde table should use the locationUri field.

This PR also unifies the way we handle default table location for managed table. Previously, the default table location of hive serde managed table is set by external catalog, but the one of data source table is set by command. After this PR, we follow the hive way and the default table location is always set by external catalog.

For managed non-file-based tables, we will assign a default table location and create an empty directory for it, the table location will be removed when the table is dropped. This is reasonable as metastore doesn't care about whether a table is file-based or not, and an empty table directory has no harm.
For external non-file-based tables, ideally we can omit the table location, but due to a hive metastore issue, we will assign a random location to it, and remove it right after the table is created. See SPARK-15269 for more details. This is fine as it's well isolated in HiveExternalCatalog.

To keep the existing behaviour of the path option, in this PR we always add the locationUri to storage properties using key path, before passing storage properties to DataSource as data source options.

How was this patch tested?

existing tests.

@cloud-fan
Copy link
Contributor Author

cc @yhuai @gatorsmile

@gatorsmile
Copy link
Member

I like this change! For Hive Managed table, locationUri should be filled; otherwise, we might be in trouble if Hive changes the default path to the table location when users do not provide the location.

@SparkQA
Copy link

SparkQA commented Sep 9, 2016

Test build #65147 has finished for PR 15024 at commit 978c94c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 9, 2016

Test build #65148 has finished for PR 15024 at commit 1b474e4.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

@gatorsmile yea that's a good point. I have updated my PR to always set location for file-based data source table

@SparkQA
Copy link

SparkQA commented Sep 12, 2016

Test build #65258 has finished for PR 15024 at commit d24d6ed.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 12, 2016

Test build #65260 has finished for PR 15024 at commit 35def5b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For managed tables, do we need to set table.properties.get(DATASOURCE_LOCATION) to locationUri? Previously, Hive does it for us. Now, we explicitly remove it at multiple places. For example, alterTable removes it for both external tables and managed tables.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you mean? HiveExternalCatlaog.alterTable always add this location property.

@gatorsmile
Copy link
Member

showDataSourceTableOptions in SHOW CREATE TABLE also needs an update.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we unify the if and else branch? Should we also check whether defaultTableLocation directory path is created if tableDefinition.storage.locationUri is set?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

require(defaultTableLocation.toUri.toString == givenTableLocation) doesn't give a clear user-facing message. Should we replace it with an explicit Exception?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not user facing, if we hit this, we have a bug

@SparkQA
Copy link

SparkQA commented Sep 16, 2016

Test build #65479 has finished for PR 15024 at commit 8efc52f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan cloud-fan force-pushed the path branch 2 times, most recently from 82c67cf to b09d8bb Compare September 16, 2016 15:15
@SparkQA
Copy link

SparkQA commented Sep 16, 2016

Test build #65485 has finished for PR 15024 at commit bccd283.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 26, 2016

Test build #67573 has finished for PR 15024 at commit 9a09b18.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just FYI, we do not have any test case to cover this. ALTER TABLE SET LOCATION for data source tables is only tested in DDLSuite (when Hive support is not enabled). That means, these code changes are not tested. Actually, this is not the only one. : (

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we make HiveDDLSuite extends DDLSuite? then all these tests will be run with hive support automatically.

@cloud-fan
Copy link
Contributor Author

cloud-fan commented Oct 28, 2016

I have updated this PR according to these rules:

  1. outside of ExternalCatalog, data source and hive tables should both use locationUri to store table location.
  2. inside HiveExternalCatalog:
    • do nothing for hive tables, same as before.
    • for data source tables, always store locationUri in table properties(like provider, schema, etc.), and set locationUri to None if this table is not hive-compatible.
  3. before passing storage properties to DataSource as data source options, add locationUri as path option, to keep the previous behaviour, i.e. the path option always exists(even users didn't specify it when creating table), and always equal to table location.

cc @rxin @yhuai @ericl @gatorsmile

@SparkQA
Copy link

SparkQA commented Oct 28, 2016

Test build #67708 has finished for PR 15024 at commit 508648d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

val STATISTICS_COL_STATS_PREFIX = STATISTICS_PREFIX + "colStats."

val TABLE_PARTITION_PROVIDER = SPARK_SQL_PREFIX + "partitionProvider"
val TABLE_LOCATION = SPARK_SQL_PREFIX + "tableLocation"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use DATASOURCE_PREFIX here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong preference.

// may be not Hive-compatible and can not set the `locationUri` field. We should respect the
// old `locationUri` even it's None.
val oldLocation = getLocationFromRawTable(oldTableDef)
val locationUri = if (oldLocation == tableDefinition.storage.locationUri) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we using in-memory catalog, the location is changed for renameTable on managed data source tables.

It sounds like we are not having the same behavior when using hive external catalog?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The locationUri of raw table is not the final locationUri returned by HiveExternalCatalog, as we have restoreTableMetadata, so the behaviour is consistent.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  test("alter table - rename") {
    val tabName = "tab1"
    val newTabName = "tab2"
    withTable(tabName, newTabName) {
      spark.range(10).write.saveAsTable(tabName)
      val catalog = spark.sessionState.catalog
      sql(s"ALTER TABLE $tabName RENAME TO $newTabName")
      sql(s"DESC FORMATTED $newTabName").show(100, false)
      assert(!catalog.tableExists(TableIdentifier(tabName)))
      assert(catalog.tableExists(TableIdentifier(newTabName)))
    }
  }

You can try to run the above test case in DDLSuite.scala and HiveDDLSuite.scala. The locations are different. One is using the new table name; another is using the old one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch! I have fixed it in HiveExternalCatalog.renameTable

@yhuai
Copy link
Contributor

yhuai commented Oct 31, 2016

  1. before passing storage properties to DataSource as data source options, add locationUri as path option, to keep the previous behaviour, i.e. the path option always exists(even users didn't specify it when creating table), and always equal to table location.

How many places where we need to do this?

@cloud-fan
Copy link
Contributor Author

@yhuai 3 places. It's easy to track it via code diff, as if we miss it somewhere, the test will fail because no path is specified.(data source tables never pass the paths parameter to DataSource).

BTW, I also added a new test suite to guarantee the semantic of the path option.

@SparkQA
Copy link

SparkQA commented Nov 1, 2016

Test build #67880 has finished for PR 15024 at commit af53eee.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 1, 2016

Test build #67881 has finished for PR 15024 at commit 97f04cf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

// 1. Put table provider, schema, partition column names, bucket specification and partition
// provider in table properties.
// 1. Put table provider, location URI, schema, partition column names, bucket specification
// and partition provider in table properties.
Copy link
Member

@gatorsmile gatorsmile Nov 2, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we create a separate function for this step? The related codes become longer and longer.

Could we change the description order to match the order of codes?

Put table provider, partition provider, location URI, schema, partition column names, bucket specification in table properties.

// to create the table directory and write out data before we create this table, to avoid
// exposing a partial written table.
val needDefaultTableLocation =
tableDefinition.tableType == CatalogTableType.MANAGED &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation

}

val optionsWithPath = if (table.tableType == CatalogTableType.MANAGED) {
table.storage.properties + ("path" -> sessionState.catalog.defaultTablePath(table.identifier))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where do we assign the default location?

tableType = CatalogTableType.MANAGED,
storage = newStorage,
// We are creating a new managed table, which should not have custom table location.
storage = sourceTableDesc.storage.copy(locationUri = None),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When will we set the location? Is it set by hive metastore?

"org.apache.spark.Logging")

/** Given a provider name, look up the data source class definition. */
def lookupDataSource(provider0: String): Class[_] = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just provider?

override def createRelation(
sqlContext: SQLContext,
parameters: Map[String, String]): BaseRelation = {
new TestOptionsRelation(parameters)(sqlContext.sparkSession)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also add comment to explain which tests exercise this method?

tableProperties.put(TABLE_PARTITION_PROVIDER, "hive")
}
tableDefinition.storage.locationUri.foreach { location =>
tableProperties.put(TABLE_LOCATION, location)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this? Why not just use path in serde properties?

require(tableDefinition.storage.locationUri.isDefined,
"External file-based data source table must have a `path` entry in storage properties.")
Some(new Path(map("path")).toUri.toString)
Some(new Path(tableDefinition.storage.locationUri.get).toUri.toString)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part looks weird since we already have a location uri.

Copy link
Contributor Author

@cloud-fan cloud-fan Nov 2, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we rename locationUri to something else? this can be any string given by users in path option, and may not be a uri

val newTable = rawTable.copy(
identifier = TableIdentifier(newName, Some(db)),
properties = tableProps)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if I am following at here. So, after rename, we will not have a table property representing the location?

private def getLocationFromRawTable(rawTable: CatalogTable): Option[String] = {
rawTable.properties.get(TABLE_LOCATION).orElse {
// In older version of spark, we store the table location in storage properties with key
// `path`, instead of table properties with key `spark.sql.tableLocation`. We should
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need spark.sql.tableLocation instead of just relying on hive's location field and path in serde properties?


withTable("src", "src2") {
sql(s"CREATE TABLE src(i int) USING ${classOf[TestOptionsSource].getCanonicalName}")
sql("ALTER TABLE src RENAME TO src2")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test case is still calling the InMemoryCatalog.renameTable. Thus, we still need a test case to verify the behavior of HiveExternalCatalog.renameTable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's nothing about ExternalCatalog, as ExternalCatalog doesn't need to know about the path option. We generate the path option outside of ExternalCatalog, and we only need ExternalCatalog to put table location in the locationUri field correctly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. But we still need a test case to verify the code path and the behavior you mentioned above. So far, it sounds like we do not have any end-to-end test case for RENAME TABLE using HiveExternalCatalog. I manually verified it and it works.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, we should create a JIRA for improving the DDL command test coverage.

@SparkQA
Copy link

SparkQA commented Nov 2, 2016

Test build #67989 has finished for PR 15024 at commit d1bcb49.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

}
}
val needDefaultTableLocation = tableDefinition.tableType == MANAGED &&
tableDefinition.storage.locationUri.isEmpty
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explain when the table is managed but the location uri is not empty.

val storageWithNewLocation = if (oldLocation == newLocation) {
oldTableDef.storage
} else {
updateLocationInStorageProps(oldTableDef, newLocation).copy(locationUri = newLocation)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also explain this copy.

// path option in storage properties, to avoid exposing this concept externally.
val storageWithLocation = {
val tableLocation = getLocationFromStorageProps(table)
updateLocationInStorageProps(table, None).copy(locationUri = tableLocation)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explain the reason that we are passing None.

userSpecifiedLocation match {
case Some(location) =>
assert(r.options("path") === location)
assert(table.storage.locationUri.get === location)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems we can revert this changes.

@yhuai
Copy link
Contributor

yhuai commented Nov 3, 2016

LGTM. Merging to master and branch 2.1.

asfgit pushed a commit that referenced this pull request Nov 3, 2016
…or hive serde table

## What changes were proposed in this pull request?

Due to a limitation of hive metastore(table location must be directory path, not file path), we always store `path` for data source table in storage properties, instead of the `locationUri` field. However, we should not expose this difference to `CatalogTable` level, but just treat it as a hack in `HiveExternalCatalog`, like we store table schema of data source table in table properties.

This PR unifies `path` and `locationUri` outside of `HiveExternalCatalog`, both data source table and hive serde table should use the `locationUri` field.

This PR also unifies the way we handle default table location for managed table. Previously, the default table location of hive serde managed table is set by external catalog, but the one of data source table is set by command. After this PR, we follow the hive way and the default table location is always set by external catalog.

For managed non-file-based tables, we will assign a default table location and create an empty directory for it, the table location will be removed when the table is dropped. This is reasonable as metastore doesn't care about whether a table is file-based or not, and an empty table directory has no harm.
For external non-file-based tables, ideally we can omit the table location, but due to a hive metastore issue, we will assign a random location to it, and remove it right after the table is created. See SPARK-15269 for more details. This is fine as it's well isolated in `HiveExternalCatalog`.

To keep the existing behaviour of the `path` option, in this PR we always add the `locationUri` to storage properties using key `path`, before passing storage properties to `DataSource` as data source options.
## How was this patch tested?

existing tests.

Author: Wenchen Fan <[email protected]>

Closes #15024 from cloud-fan/path.

(cherry picked from commit 3a1bc6f)
Signed-off-by: Yin Huai <[email protected]>
@yhuai
Copy link
Contributor

yhuai commented Nov 3, 2016

OK. Let's get #14750 updated to fix SPARK-17183.

@asfgit asfgit closed this in 3a1bc6f Nov 3, 2016
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
…or hive serde table

## What changes were proposed in this pull request?

Due to a limitation of hive metastore(table location must be directory path, not file path), we always store `path` for data source table in storage properties, instead of the `locationUri` field. However, we should not expose this difference to `CatalogTable` level, but just treat it as a hack in `HiveExternalCatalog`, like we store table schema of data source table in table properties.

This PR unifies `path` and `locationUri` outside of `HiveExternalCatalog`, both data source table and hive serde table should use the `locationUri` field.

This PR also unifies the way we handle default table location for managed table. Previously, the default table location of hive serde managed table is set by external catalog, but the one of data source table is set by command. After this PR, we follow the hive way and the default table location is always set by external catalog.

For managed non-file-based tables, we will assign a default table location and create an empty directory for it, the table location will be removed when the table is dropped. This is reasonable as metastore doesn't care about whether a table is file-based or not, and an empty table directory has no harm.
For external non-file-based tables, ideally we can omit the table location, but due to a hive metastore issue, we will assign a random location to it, and remove it right after the table is created. See SPARK-15269 for more details. This is fine as it's well isolated in `HiveExternalCatalog`.

To keep the existing behaviour of the `path` option, in this PR we always add the `locationUri` to storage properties using key `path`, before passing storage properties to `DataSource` as data source options.
## How was this patch tested?

existing tests.

Author: Wenchen Fan <[email protected]>

Closes apache#15024 from cloud-fan/path.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants