-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-17541][SQL] fix some DDL bugs about table management when same-name temp view exists #15099
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #65379 has finished for PR 15099 at commit
|
|
Test build #65383 has finished for PR 15099 at commit
|
|
Test build #65420 has finished for PR 15099 at commit
|
| assert(table.provider.isDefined) | ||
| assert(table.schema.isEmpty) | ||
|
|
||
| val tableName = table.identifier.unquotedString |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a suggestion. Since we already added the database into the identifier, we can use fully qualified names in the error messages. It will be much easier for users to know which table it is. That means, we can change this line to
val tableName = tableIdentWithDB.identifier.unquotedStringBelow is an example of DB2. The error message shows the fully qualified name.
db2 => drop table ta
DB21034E The command was processed as an SQL statement because it was not a
valid Command Line Processor command. During SQL processing it returned:
SQL0204N "DB2INST1.TA" is an undefined name. SQLSTATE=42704
| sessionCatalog.dropTable(TableIdentifier(viewName), ignoreIfNotExists = true, purge = false) | ||
| val maybeTempView = sparkSession.sessionState.catalog.getTempView(viewName) | ||
| if (maybeTempView.isDefined) { | ||
| val view = SubqueryAlias(viewName, maybeTempView.get, Some(TableIdentifier(viewName))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
val view = maybeTempView.get
If we simplify it to the above, the effect is the same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no it's different. When we cache it, we use the plan returned by loopupRelation, which has the subquery.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me show an example in CachedTableSuite.scala:
test("Drops cached temporary table") {
testData.select('key).createOrReplaceTempView("t1")
testData.select('key).createOrReplaceTempView("t2")
spark.catalog.cacheTable("t1")
assert(spark.catalog.isCached("t1"))
assert(spark.catalog.isCached("t2"))
spark.catalog.dropTempView("t1")
intercept[AnalysisException](spark.table("t1"))
assert(!spark.catalog.isCached("t2"))
assert(spark.sharedState.cacheManager.isEmpty)
}It is not affected by SubqueryAlias.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you mean you pulled my PR and remove the subquery and this test still passed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. I tried it. : )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are right, the CacheManager search the key by sameResult, so the subquery is useless
| tempTables.get(formatTableName(name)) | ||
| } | ||
|
|
||
| def dropTempView(name: String): Unit = synchronized { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add the description for this function too?
| val sessionState = sparkSession.sessionState | ||
| if (sessionState.catalog.tableExists(table.identifier)) { | ||
| val db = table.identifier.database.getOrElse(sessionState.catalog.getCurrentDatabase) | ||
| val tableIdentWithDB = table.identifier.copy(database = Some(db)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make these three lines a util in TableIdentifier?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CreateDataSourceTableCommand is planned in physical plan stage. Maybe we should generate correct table identifier in logical plan stage?
| } | ||
| case SaveMode.Overwrite => | ||
| sparkSession.sql(s"DROP TABLE IF EXISTS $tableName") | ||
| sessionState.catalog.dropTable(tableIdentWithDB, ignoreIfNotExists = false, purge = false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If ignoreIfNotExists = false, it throws an exception if the table doesn't exist.
// File SessionCatalog line 392
} else if (!ignoreIfNotExists) {
throw new NoSuchTableException(db = db, table = table)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
| /** | ||
| * Return a temporary view exactly as it was stored. | ||
| */ | ||
| def getTempView(name: String): Option[LogicalPlan] = synchronized { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the same file, there are also interfaces like isTemporaryTable, tableExists, refreshTable, clearTempTables, which can accept a temporary view, but using the name "table" instead of "view"
I am not sure introducing getTempView and dropTempView will make the class SessionCatalog's interface simpler or more complicated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we already have createTempView, I don't think the new APIs make SessionCatalog more complicated.
| val sessionState = sparkSession.sessionState | ||
| if (sessionState.catalog.tableExists(table.identifier)) { | ||
| val db = table.identifier.database.getOrElse(sessionState.catalog.getCurrentDatabase) | ||
| val tableIdentWithDB = table.identifier.copy(database = Some(db)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CreateDataSourceTableCommand is planned in physical plan stage. Maybe we should generate correct table identifier in logical plan stage?
| if (sparkSession.sessionState.catalog.tableExists(table.identifier)) { | ||
| // Pass a table identifier with database part, so that `tableExists` won't check temp views | ||
| // unexpectedly. | ||
| if (sparkSession.sessionState.catalog.tableExists(tableIdentWithDB)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For
// Pass a table identifier with database part, so that `tableExists` won't check temp views
This is kind of hack to me. Do you have ways to avoid hacking table identifier?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have, in my previous refactor PR: #14962
This PR aims to just fix these bugs surgically, so that it's easy to backport to 2.0
|
My comments: 1. There are too many hack about TableIdentifier.Problem:Currently, our current use of TableIdentifier is kind of ambiguous. For example, if database is not specified, sometimes we interpret the table identifier as a table in default database, sometimes, we interpret it as a temporary view. It is not good! I think if we want to call it as an Id, then it need to point to a exact table/view with no ambiguity. Suggestion:Suggest to resolve TableIdentifier to a full qualified table identifier in Analyzer stage. For a TableIdentifier from SQL parser which doesn't have a database, we should:
After analysis, there should be no TableIdentifier which doesn't have database specified. The interface change in SessionCatalog is not consistent with other methods name conversion.ProblemIn class SessionCatalog, this PR adds two method dropTempView, getTempView, but at the same time all other methods in SessionCatalog use name conversion "table" instead of "view". For example, the following methods can operate on temporary view: I think the change will make class SessionCatalog harder to understand. SuggestionWe still use the name "table", but use the TableIdentifier to differentiate whether it is a temporary view or table. |
|
@clockfly please note that this PR just want to fix these bugs surgically, to minimize the code changes and make it easier to backport to 2.0. I have a pending refactor PR to clean them up, we can discuss your proposal there. For |
|
Test build #65483 has finished for PR 15099 at commit
|
|
Test build #65491 has finished for PR 15099 at commit
|
|
LGTM, except one minor comment |
|
LGTM! |
|
@andrewor14 I like your new signature! :) |
|
Test build #65560 has finished for PR 15099 at commit
|
|
retest this please |
|
Test build #65556 has finished for PR 15099 at commit
|
|
@cloud-fan, since this bug fix is important that other patches depends on this, I am +1 on current approach. Maybe in future, we should consider making TableIdentifier a truly unique identifier in physical plan stage for all views, temp views, global views, and all tables. |
|
Test build #65565 has finished for PR 15099 at commit
|
…-name temp view exists In `SessionCatalog`, we have several operations(`tableExists`, `dropTable`, `loopupRelation`, etc) that handle both temp views and metastore tables/views. This brings some bugs to DDL commands that want to handle temp view only or metastore table/view only. These bugs are: 1. `CREATE TABLE USING` will fail if a same-name temp view exists 2. `Catalog.dropTempView`will un-cache and drop metastore table if a same-name table exists 3. `saveAsTable` will fail or have unexpected behaviour if a same-name temp view exists. These bug fixes are pulled out from #14962 and targets both master and 2.0 branch new regression tests Author: Wenchen Fan <[email protected]> Closes #15099 from cloud-fan/fix-view. (cherry picked from commit 3fe630d) Signed-off-by: Wenchen Fan <[email protected]>
|
@andrewor14 welcome back! thanks for your reviews! merging to master and 2.0! |
|
@cloud-fan I think this might be causing a failure on the branch 2.0 builds? |
|
Let me do a quick fix. |
…-name temp view exists ## What changes were proposed in this pull request? In `SessionCatalog`, we have several operations(`tableExists`, `dropTable`, `loopupRelation`, etc) that handle both temp views and metastore tables/views. This brings some bugs to DDL commands that want to handle temp view only or metastore table/view only. These bugs are: 1. `CREATE TABLE USING` will fail if a same-name temp view exists 2. `Catalog.dropTempView`will un-cache and drop metastore table if a same-name table exists 3. `saveAsTable` will fail or have unexpected behaviour if a same-name temp view exists. These bug fixes are pulled out from apache#14962 and targets both master and 2.0 branch ## How was this patch tested? new regression tests Author: Wenchen Fan <[email protected]> Closes apache#15099 from cloud-fan/fix-view.
What changes were proposed in this pull request?
In
SessionCatalog, we have several operations(tableExists,dropTable,loopupRelation, etc) that handle both temp views and metastore tables/views. This brings some bugs to DDL commands that want to handle temp view only or metastore table/view only. These bugs are:CREATE TABLE USINGwill fail if a same-name temp view existsCatalog.dropTempViewwill un-cache and drop metastore table if a same-name table existssaveAsTablewill fail or have unexpected behaviour if a same-name temp view exists.These bug fixes are pulled out from #14962 and targets both master and 2.0 branch
How was this patch tested?
new regression tests