[SPARK-15269][SQL] Removes unexpected empty table directories created while creating external Spark SQL data sourcet tables. #13270

liancheng · 2016-05-24T00:41:50Z

This PR is an alternative to #13120 authored by @xwu0226.

What changes were proposed in this pull request?

When creating an external Spark SQL data source table and persisting its metadata to Hive metastore, we don't use the standard Hive Table.dataLocation field because Hive only allows directory paths as data locations while Spark SQL also allows file paths. However, if we don't set Table.dataLocation, Hive always creates an unexpected empty table directory under database location, but doesn't remove it while dropping the table (because the table is external).

This PR works around this issue by explicitly setting Table.dataLocation and then manullay removing the created directory after creating the external table.

Please refer to this JIRA comment for more details about why we chose this approach as a workaround.

How was this patch tested?

A new test case is added in HiveQuerySuite for this case
Updated ShowCreateTableSuite to use the same table name in all test cases. (This is how I hit this issue at the first place.)

liancheng · 2016-05-24T00:47:16Z

@xwu0226 Would you please help review this one? It's based on our discussion in your PR (#13120). The benefit of this version is that it avoids the bad case mentioned in this comment.

cc @yhuai

xwu0226 · 2016-05-24T00:48:16Z

@liancheng sure. Thanks!

SparkQA · 2016-05-24T01:44:11Z

Test build #59169 has finished for PR 13270 at commit f376332.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

xwu0226 · 2016-05-24T02:06:05Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala

I am wondering if we should worry about the case where the defaultTablePath happens to be the same as the user-specified path for creating external table that contains real data. It may delete the external table data. For example:

create table t10 (c1 int) using parquet options(path '/Users/xinwu/spark/spark-warehouse/t10'); insert into t10 values (1); drop table t10; create table t10 (c1 int) using parquet options(path '/Users/xinwu/spark/spark-warehouse/t10');

In the above case, my metastore warehouse dir is /Users/xinwu/spark/spark-warehouse', so thedefaultTablePathwill return'/Users/xinwu/spark/spark-warehouse/t10'`. Now, upon the creation of the 2nd table, the data path will be deleted, right? Maybe this is a corner case that we may not worry about. but I thought I should bring it up.

Another observation is that in a hive-compatible case (above case), createDataSourceTabes set the locationURI with the user specified path, but will be overridden by the above code. Then, users will not be able to query anything back from hive shell, unless users don't expect to see same results from hive shell for hive-compatible tables. I am not sure about the semantic of hive-compatible datasource table. Will this be a problem? Thanks!

Yeah, thanks! Will add a check for the first case. The second case should be the reason why Jenkins tests failed.

Added a __PLACEHOLDER__ suffix to the dummy table location path to fix case one. And made sure that we handle Hive compatible tables properly.

So hive metastore will create the dummy location <metastore warehouse dir>/__PLACEHOLDER__ when the data source table is created and will be removed right away. And for hive compatible case, the locationURI value set by createDataSourceTables.newHiveCompatibleMetastoreTable will not be overridden with __PLACEHOLDER__, correct?

Correct. Because these tables are in standard Hive format.

Actually it's <warehouse-dir>/<table-name>-__PLACEHOLDER__.

SparkQA · 2016-05-24T23:04:31Z

Test build #59224 has finished for PR 13270 at commit 7c77dc1.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-05-25T02:22:36Z

Test build #59246 has finished for PR 13270 at commit 193e005.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

liancheng · 2016-05-25T20:18:07Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala

Preserve the original exception so that we can see Hive internal stack trace.

liancheng · 2016-05-25T21:38:02Z

@xwu0226 Thanks. Handled Hive compatible tables here. CatalogTable should be implementation agnostic, so it's infeasible to add a Hive specific flag to CatalogTable.

xwu0226 · 2016-05-25T22:08:29Z

@liancheng Thanks! I see what you mean for the code you handle the hive compatible tables. This will handle the table lookup time.. But for creating table , we still possibly wrongly set the locationURI with <warehouse.dir>/tableName/__PLACEHOLDER__ for hive-compatible table. such that querying from Hive shell will not return any results, right?
This is one reason why I just wanted to tough the code path for creating non-hive compatible datasource table in createDataSourceTables.scala -> CreateDataSourceTableUtils.createDataSourceTable in my PR.

Please see my last comment in my PR trying to address your concern for having multiple data files in one directory.

SparkQA · 2016-05-25T22:25:37Z

Test build #59295 has finished for PR 13270 at commit 64a0cfd.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

liancheng · 2016-05-25T23:24:52Z

@xwu0226 No we shouldn't have problem with Hive compatible tables now since we only add the placeholder location URI when table.storage.locationUri is empty (see here), while Hive compatible tables always set this field.

(BTW, the placeholder path is <warehouse-dir>/<tableName>-__PLACEHOLDER__. String __PLACEHOLDER__ is part of the directory name rather than name of a sub-directory.)

liancheng · 2016-05-25T23:25:39Z

@xwu0226 I should probably add a comment to explain the locationUri.isEmpty thing since it's not quite intuitive.

xwu0226 · 2016-05-25T23:33:02Z

@liancheng I see your point now. table.storage.locationUri is not set for non-compatible data source table. And you only set the placeholder for this case. Thank you!!

liancheng · 2016-05-25T23:43:44Z

@xwu0226 For your last comment in your PR, I also realized that we are not really using CatalogTable.storage.locationUri for data source tables while doing this PR.

This means we can set an arbitrary location URI to that field as long as this location:

is either an existing directory, or
a non-existing location, but can definitely be created as a new directory successfully by Hive

This PR takes the 2nd approach. Namely make a temporary directory and remove it later. The reason is that, I think it can be dangerous to set existing directories as location URI. I was afraid of the fact that Hive may try to delete that directory because of bugs in either Hive side or Spark side. Actually I tried to set / (which definitely exists) as location URI when doing this PR, and Hive tries to delete my root directory during unit test execution... (But that was probably caused my own bugs though.)

xwu0226 · 2016-05-25T23:55:21Z

@liancheng Thanks for the explanation!! It is safer this way. :)

liancheng · 2016-05-26T00:10:30Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/ShowCreateTableSuite.scala

I firstly noticed this bug while writing tests in this test suite. I found that test cases always fail if I use the same table name in multiple test cases. That's why I made the changes to this file as additional tests.

SparkQA · 2016-05-26T01:20:25Z

Test build #59311 has finished for PR 13270 at commit db06f13.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-05-26T01:31:39Z

Test build #59313 has finished for PR 13270 at commit 1545f04.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

liancheng · 2016-05-26T01:44:34Z

cc @yhuai @cloud-fan

yhuai · 2016-05-27T15:29:04Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala

I think Hive-specific details/hacks should not be exposed in SessionCatalog. Let's move it into HiveExternalCatalog.

Added these changes here mostly because HiveExternalCatalog doesn't have access to the Hadoop configuration, which is used to instantiate the FileSystem instance. Added an extra constructor argument to HiveExternalCatalog and moved this change there.

SparkQA · 2016-05-31T04:54:16Z

Test build #59632 has finished for PR 13270 at commit 3830dbb.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-05-31T06:31:46Z

Test build #59635 has finished for PR 13270 at commit 336fb55.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yhuai · 2016-05-31T20:11:48Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala


  private def requireDbMatches(db: String, table: CatalogTable): Unit = {
-    if (table.identifier.database != Some(db)) {
+    if (!table.identifier.database.contains(db)) {


Let's not change this because contains does not exist in scala 2.10.

Oh... I should disable this check in IDEA then.

SparkQA · 2016-05-31T23:45:56Z

Test build #59687 has finished for PR 13270 at commit 7d0122f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yhuai · 2016-06-01T22:59:42Z

LGTM. @liancheng Can you merge this?

liancheng · 2016-06-01T23:01:22Z

Merging to master and branch-2.0.

@xwu0226 @yhuai Thanks for the review!

… while creating external Spark SQL data sourcet tables. This PR is an alternative to #13120 authored by xwu0226. ## What changes were proposed in this pull request? When creating an external Spark SQL data source table and persisting its metadata to Hive metastore, we don't use the standard Hive `Table.dataLocation` field because Hive only allows directory paths as data locations while Spark SQL also allows file paths. However, if we don't set `Table.dataLocation`, Hive always creates an unexpected empty table directory under database location, but doesn't remove it while dropping the table (because the table is external). This PR works around this issue by explicitly setting `Table.dataLocation` and then manullay removing the created directory after creating the external table. Please refer to [this JIRA comment][1] for more details about why we chose this approach as a workaround. [1]: https://issues.apache.org/jira/browse/SPARK-15269?focusedCommentId=15297408&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15297408 ## How was this patch tested? 1. A new test case is added in `HiveQuerySuite` for this case 2. Updated `ShowCreateTableSuite` to use the same table name in all test cases. (This is how I hit this issue at the first place.) Author: Cheng Lian <[email protected]> Closes #13270 from liancheng/spark-15269-unpleasant-fix. (cherry picked from commit 7bb64aa) Signed-off-by: Cheng Lian <[email protected]>

xwu0226 reviewed May 24, 2016
View reviewed changes

liancheng reviewed May 25, 2016
View reviewed changes

liancheng mentioned this pull request May 25, 2016

[SPARK-15269][SQL] Set provided path to CatalogTable.storage.locationURI when creating external non-hive compatible table #13120

Closed

liancheng force-pushed the spark-15269-unpleasant-fix branch from db06f13 to 1545f04 Compare May 26, 2016 00:00

liancheng reviewed May 26, 2016
View reviewed changes

yhuai reviewed May 27, 2016
View reviewed changes

liancheng added 6 commits May 30, 2016 22:00

Fixes SPARK-15269

528b835

Fixes test failures

fa7b5b6

Fixes more test failures

0aedf7b

Handles Hive compatible tables properly

505f3f0

Fixes newly added test case

04af79d

More comments

6241289

Moves changes in SessionCatalog to HiveExternalCatalog

336fb55

liancheng force-pushed the spark-15269-unpleasant-fix branch from 3830dbb to 336fb55 Compare May 31, 2016 05:00

yhuai reviewed May 31, 2016
View reviewed changes

Reverts minor refactoring that is not supported by Scala 2.10

7d0122f

asfgit closed this in 7bb64aa Jun 1, 2016

gatorsmile mentioned this pull request Nov 27, 2016

[SPARK-18482][SQL] make sure Spark can access the table metadata created by older version of spark #16003

Closed

[SPARK-15269][SQL] Removes unexpected empty table directories created while creating external Spark SQL data sourcet tables. #13270

[SPARK-15269][SQL] Removes unexpected empty table directories created while creating external Spark SQL data sourcet tables. #13270

Uh oh!

Conversation

liancheng commented May 24, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

liancheng commented May 24, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xwu0226 commented May 24, 2016

Uh oh!

SparkQA commented May 24, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

liancheng May 25, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 24, 2016

Uh oh!

SparkQA commented May 25, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

liancheng commented May 25, 2016

Uh oh!

xwu0226 commented May 25, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented May 25, 2016

Uh oh!

liancheng commented May 25, 2016

Uh oh!

liancheng commented May 25, 2016

Uh oh!

xwu0226 commented May 25, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

liancheng commented May 25, 2016

Uh oh!

xwu0226 commented May 25, 2016

Uh oh!

liancheng May 26, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 26, 2016

Uh oh!

SparkQA commented May 26, 2016

Uh oh!

liancheng commented May 26, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 31, 2016

Uh oh!

SparkQA commented May 31, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 31, 2016

Uh oh!

yhuai commented Jun 1, 2016

Uh oh!

liancheng commented May 24, 2016 •

edited

Loading

liancheng May 25, 2016 •

edited

Loading

xwu0226 commented May 25, 2016 •

edited

Loading

xwu0226 commented May 25, 2016 •

edited

Loading

liancheng May 26, 2016 •

edited

Loading

liancheng commented May 26, 2016 •

edited

Loading