-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-15269][SQL] Removes unexpected empty table directories created while creating external Spark SQL data sourcet tables. #13270
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@xwu0226 Would you please help review this one? It's based on our discussion in your PR (#13120). The benefit of this version is that it avoids the bad case mentioned in this comment. cc @yhuai |
|
@liancheng sure. Thanks! |
|
Test build #59169 has finished for PR 13270 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering if we should worry about the case where the defaultTablePath happens to be the same as the user-specified path for creating external table that contains real data. It may delete the external table data. For example:
create table t10 (c1 int) using parquet options(path '/Users/xinwu/spark/spark-warehouse/t10');
insert into t10 values (1);
drop table t10;
create table t10 (c1 int) using parquet options(path '/Users/xinwu/spark/spark-warehouse/t10');
In the above case, my metastore warehouse dir is /Users/xinwu/spark/spark-warehouse', so thedefaultTablePathwill return'/Users/xinwu/spark/spark-warehouse/t10'`. Now, upon the creation of the 2nd table, the data path will be deleted, right? Maybe this is a corner case that we may not worry about. but I thought I should bring it up.
Another observation is that in a hive-compatible case (above case), createDataSourceTabes set the locationURI with the user specified path, but will be overridden by the above code. Then, users will not be able to query anything back from hive shell, unless users don't expect to see same results from hive shell for hive-compatible tables. I am not sure about the semantic of hive-compatible datasource table. Will this be a problem? Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, thanks! Will add a check for the first case. The second case should be the reason why Jenkins tests failed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a __PLACEHOLDER__ suffix to the dummy table location path to fix case one. And made sure that we handle Hive compatible tables properly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So hive metastore will create the dummy location <metastore warehouse dir>/__PLACEHOLDER__ when the data source table is created and will be removed right away. And for hive compatible case, the locationURI value set by createDataSourceTables.newHiveCompatibleMetastoreTable will not be overridden with __PLACEHOLDER__, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. Because these tables are in standard Hive format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually it's <warehouse-dir>/<table-name>-__PLACEHOLDER__.
|
Test build #59224 has finished for PR 13270 at commit
|
|
Test build #59246 has finished for PR 13270 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Preserve the original exception so that we can see Hive internal stack trace.
|
@liancheng Thanks! I see what you mean for the code you handle the hive compatible tables. This will handle the table lookup time.. But for creating table , we still possibly wrongly set the Please see my last comment in my PR trying to address your concern for having multiple data files in one directory. |
|
Test build #59295 has finished for PR 13270 at commit
|
|
@xwu0226 No we shouldn't have problem with Hive compatible tables now since we only add the placeholder location URI when (BTW, the placeholder path is |
|
@xwu0226 I should probably add a comment to explain the |
|
@liancheng I see your point now. |
|
@xwu0226 For your last comment in your PR, I also realized that we are not really using This means we can set an arbitrary location URI to that field as long as this location:
This PR takes the 2nd approach. Namely make a temporary directory and remove it later. The reason is that, I think it can be dangerous to set existing directories as location URI. I was afraid of the fact that Hive may try to delete that directory because of bugs in either Hive side or Spark side. Actually I tried to set |
|
@liancheng Thanks for the explanation!! It is safer this way. :) |
db06f13 to
1545f04
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I firstly noticed this bug while writing tests in this test suite. I found that test cases always fail if I use the same table name in multiple test cases. That's why I made the changes to this file as additional tests.
|
Test build #59311 has finished for PR 13270 at commit
|
|
Test build #59313 has finished for PR 13270 at commit
|
|
cc @yhuai @cloud-fan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Hive-specific details/hacks should not be exposed in SessionCatalog. Let's move it into HiveExternalCatalog.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added these changes here mostly because HiveExternalCatalog doesn't have access to the Hadoop configuration, which is used to instantiate the FileSystem instance. Added an extra constructor argument to HiveExternalCatalog and moved this change there.
|
Test build #59632 has finished for PR 13270 at commit
|
3830dbb to
336fb55
Compare
|
Test build #59635 has finished for PR 13270 at commit
|
|
|
||
| private def requireDbMatches(db: String, table: CatalogTable): Unit = { | ||
| if (table.identifier.database != Some(db)) { | ||
| if (!table.identifier.database.contains(db)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not change this because contains does not exist in scala 2.10.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh... I should disable this check in IDEA then.
|
Test build #59687 has finished for PR 13270 at commit
|
|
LGTM. @liancheng Can you merge this? |
… while creating external Spark SQL data sourcet tables. This PR is an alternative to #13120 authored by xwu0226. ## What changes were proposed in this pull request? When creating an external Spark SQL data source table and persisting its metadata to Hive metastore, we don't use the standard Hive `Table.dataLocation` field because Hive only allows directory paths as data locations while Spark SQL also allows file paths. However, if we don't set `Table.dataLocation`, Hive always creates an unexpected empty table directory under database location, but doesn't remove it while dropping the table (because the table is external). This PR works around this issue by explicitly setting `Table.dataLocation` and then manullay removing the created directory after creating the external table. Please refer to [this JIRA comment][1] for more details about why we chose this approach as a workaround. [1]: https://issues.apache.org/jira/browse/SPARK-15269?focusedCommentId=15297408&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15297408 ## How was this patch tested? 1. A new test case is added in `HiveQuerySuite` for this case 2. Updated `ShowCreateTableSuite` to use the same table name in all test cases. (This is how I hit this issue at the first place.) Author: Cheng Lian <[email protected]> Closes #13270 from liancheng/spark-15269-unpleasant-fix. (cherry picked from commit 7bb64aa) Signed-off-by: Cheng Lian <[email protected]>
This PR is an alternative to #13120 authored by @xwu0226.
What changes were proposed in this pull request?
When creating an external Spark SQL data source table and persisting its metadata to Hive metastore, we don't use the standard Hive
Table.dataLocationfield because Hive only allows directory paths as data locations while Spark SQL also allows file paths. However, if we don't setTable.dataLocation, Hive always creates an unexpected empty table directory under database location, but doesn't remove it while dropping the table (because the table is external).This PR works around this issue by explicitly setting
Table.dataLocationand then manullay removing the created directory after creating the external table.Please refer to this JIRA comment for more details about why we chose this approach as a workaround.
How was this patch tested?
HiveQuerySuitefor this caseShowCreateTableSuiteto use the same table name in all test cases. (This is how I hit this issue at the first place.)