Skip to content

Conversation

@liancheng
Copy link
Contributor

This PR is an alternative to #13120 authored by @xwu0226.

What changes were proposed in this pull request?

When creating an external Spark SQL data source table and persisting its metadata to Hive metastore, we don't use the standard Hive Table.dataLocation field because Hive only allows directory paths as data locations while Spark SQL also allows file paths. However, if we don't set Table.dataLocation, Hive always creates an unexpected empty table directory under database location, but doesn't remove it while dropping the table (because the table is external).

This PR works around this issue by explicitly setting Table.dataLocation and then manullay removing the created directory after creating the external table.

Please refer to this JIRA comment for more details about why we chose this approach as a workaround.

How was this patch tested?

  1. A new test case is added in HiveQuerySuite for this case
  2. Updated ShowCreateTableSuite to use the same table name in all test cases. (This is how I hit this issue at the first place.)

@liancheng
Copy link
Contributor Author

liancheng commented May 24, 2016

@xwu0226 Would you please help review this one? It's based on our discussion in your PR (#13120). The benefit of this version is that it avoids the bad case mentioned in this comment.

cc @yhuai

@xwu0226
Copy link
Contributor

xwu0226 commented May 24, 2016

@liancheng sure. Thanks!

@SparkQA
Copy link

SparkQA commented May 24, 2016

Test build #59169 has finished for PR 13270 at commit f376332.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if we should worry about the case where the defaultTablePath happens to be the same as the user-specified path for creating external table that contains real data. It may delete the external table data. For example:

create table t10 (c1 int) using parquet options(path '/Users/xinwu/spark/spark-warehouse/t10');
insert into t10 values (1);
drop table t10;
create table t10 (c1 int) using parquet options(path '/Users/xinwu/spark/spark-warehouse/t10');

In the above case, my metastore warehouse dir is /Users/xinwu/spark/spark-warehouse', so thedefaultTablePathwill return'/Users/xinwu/spark/spark-warehouse/t10'`. Now, upon the creation of the 2nd table, the data path will be deleted, right? Maybe this is a corner case that we may not worry about. but I thought I should bring it up.

Another observation is that in a hive-compatible case (above case), createDataSourceTabes set the locationURI with the user specified path, but will be overridden by the above code. Then, users will not be able to query anything back from hive shell, unless users don't expect to see same results from hive shell for hive-compatible tables. I am not sure about the semantic of hive-compatible datasource table. Will this be a problem? Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, thanks! Will add a check for the first case. The second case should be the reason why Jenkins tests failed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a __PLACEHOLDER__ suffix to the dummy table location path to fix case one. And made sure that we handle Hive compatible tables properly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So hive metastore will create the dummy location <metastore warehouse dir>/__PLACEHOLDER__ when the data source table is created and will be removed right away. And for hive compatible case, the locationURI value set by createDataSourceTables.newHiveCompatibleMetastoreTable will not be overridden with __PLACEHOLDER__, correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. Because these tables are in standard Hive format.

Copy link
Contributor Author

@liancheng liancheng May 25, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it's <warehouse-dir>/<table-name>-__PLACEHOLDER__.

@SparkQA
Copy link

SparkQA commented May 24, 2016

Test build #59224 has finished for PR 13270 at commit 7c77dc1.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 25, 2016

Test build #59246 has finished for PR 13270 at commit 193e005.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Preserve the original exception so that we can see Hive internal stack trace.

@liancheng
Copy link
Contributor Author

@xwu0226 Thanks. Handled Hive compatible tables here. CatalogTable should be implementation agnostic, so it's infeasible to add a Hive specific flag to CatalogTable.

@xwu0226
Copy link
Contributor

xwu0226 commented May 25, 2016

@liancheng Thanks! I see what you mean for the code you handle the hive compatible tables. This will handle the table lookup time.. But for creating table , we still possibly wrongly set the locationURI with <warehouse.dir>/tableName/__PLACEHOLDER__ for hive-compatible table. such that querying from Hive shell will not return any results, right?
This is one reason why I just wanted to tough the code path for creating non-hive compatible datasource table in createDataSourceTables.scala -> CreateDataSourceTableUtils.createDataSourceTable in my PR.

Please see my last comment in my PR trying to address your concern for having multiple data files in one directory.

@SparkQA
Copy link

SparkQA commented May 25, 2016

Test build #59295 has finished for PR 13270 at commit 64a0cfd.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@liancheng
Copy link
Contributor Author

@xwu0226 No we shouldn't have problem with Hive compatible tables now since we only add the placeholder location URI when table.storage.locationUri is empty (see here), while Hive compatible tables always set this field.

(BTW, the placeholder path is <warehouse-dir>/<tableName>-__PLACEHOLDER__. String __PLACEHOLDER__ is part of the directory name rather than name of a sub-directory.)

@liancheng
Copy link
Contributor Author

@xwu0226 I should probably add a comment to explain the locationUri.isEmpty thing since it's not quite intuitive.

@xwu0226
Copy link
Contributor

xwu0226 commented May 25, 2016

@liancheng I see your point now. table.storage.locationUri is not set for non-compatible data source table. And you only set the placeholder for this case. Thank you!!

@liancheng
Copy link
Contributor Author

@xwu0226 For your last comment in your PR, I also realized that we are not really using CatalogTable.storage.locationUri for data source tables while doing this PR.

This means we can set an arbitrary location URI to that field as long as this location:

  1. is either an existing directory, or
  2. a non-existing location, but can definitely be created as a new directory successfully by Hive

This PR takes the 2nd approach. Namely make a temporary directory and remove it later. The reason is that, I think it can be dangerous to set existing directories as location URI. I was afraid of the fact that Hive may try to delete that directory because of bugs in either Hive side or Spark side. Actually I tried to set / (which definitely exists) as location URI when doing this PR, and Hive tries to delete my root directory during unit test execution... (But that was probably caused my own bugs though.)

@xwu0226
Copy link
Contributor

xwu0226 commented May 25, 2016

@liancheng Thanks for the explanation!! It is safer this way. :)

Copy link
Contributor Author

@liancheng liancheng May 26, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I firstly noticed this bug while writing tests in this test suite. I found that test cases always fail if I use the same table name in multiple test cases. That's why I made the changes to this file as additional tests.

@SparkQA
Copy link

SparkQA commented May 26, 2016

Test build #59311 has finished for PR 13270 at commit db06f13.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 26, 2016

Test build #59313 has finished for PR 13270 at commit 1545f04.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@liancheng
Copy link
Contributor Author

liancheng commented May 26, 2016

cc @yhuai @cloud-fan

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Hive-specific details/hacks should not be exposed in SessionCatalog. Let's move it into HiveExternalCatalog.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added these changes here mostly because HiveExternalCatalog doesn't have access to the Hadoop configuration, which is used to instantiate the FileSystem instance. Added an extra constructor argument to HiveExternalCatalog and moved this change there.

@SparkQA
Copy link

SparkQA commented May 31, 2016

Test build #59632 has finished for PR 13270 at commit 3830dbb.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@liancheng liancheng force-pushed the spark-15269-unpleasant-fix branch from 3830dbb to 336fb55 Compare May 31, 2016 05:00
@SparkQA
Copy link

SparkQA commented May 31, 2016

Test build #59635 has finished for PR 13270 at commit 336fb55.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


private def requireDbMatches(db: String, table: CatalogTable): Unit = {
if (table.identifier.database != Some(db)) {
if (!table.identifier.database.contains(db)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not change this because contains does not exist in scala 2.10.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh... I should disable this check in IDEA then.

@SparkQA
Copy link

SparkQA commented May 31, 2016

Test build #59687 has finished for PR 13270 at commit 7d0122f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yhuai
Copy link
Contributor

yhuai commented Jun 1, 2016

LGTM. @liancheng Can you merge this?

@liancheng
Copy link
Contributor Author

Merging to master and branch-2.0.

@xwu0226 @yhuai Thanks for the review!

asfgit pushed a commit that referenced this pull request Jun 1, 2016
… while creating external Spark SQL data sourcet tables.

This PR is an alternative to #13120 authored by xwu0226.

## What changes were proposed in this pull request?

When creating an external Spark SQL data source table and persisting its metadata to Hive metastore, we don't use the standard Hive `Table.dataLocation` field because Hive only allows directory paths as data locations while Spark SQL also allows file paths. However, if we don't set `Table.dataLocation`, Hive always creates an unexpected empty table directory under database location, but doesn't remove it while dropping the table (because the table is external).

This PR works around this issue by explicitly setting `Table.dataLocation` and then manullay removing the created directory after creating the external table.

Please refer to [this JIRA comment][1] for more details about why we chose this approach as a workaround.

[1]: https://issues.apache.org/jira/browse/SPARK-15269?focusedCommentId=15297408&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15297408

## How was this patch tested?

1. A new test case is added in `HiveQuerySuite` for this case
2. Updated `ShowCreateTableSuite` to use the same table name in all test cases. (This is how I hit this issue at the first place.)

Author: Cheng Lian <[email protected]>

Closes #13270 from liancheng/spark-15269-unpleasant-fix.

(cherry picked from commit 7bb64aa)
Signed-off-by: Cheng Lian <[email protected]>
@asfgit asfgit closed this in 7bb64aa Jun 1, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants