Skip to content

Conversation

@cloud-fan
Copy link
Contributor

What changes were proposed in this pull request?

In SessionCatalog, we have several operations(tableExists, dropTable, loopupRelation, etc) that handle both temp views and metastore tables/views. This brings some bugs to DDL commands that want to handle temp view only or metastore table/view only. These bugs are:

  1. CREATE TABLE USING will fail if a same-name temp view exists
  2. Catalog.dropTempViewwill un-cache and drop metastore table if a same-name table exists
  3. saveAsTable will fail or have unexpected behaviour if a same-name temp view exists.

These bug fixes are pulled out from #14962 and targets both master and 2.0 branch

How was this patch tested?

new regression tests

@cloud-fan
Copy link
Contributor Author

cc @yhuai @gatorsmile

@SparkQA
Copy link

SparkQA commented Sep 14, 2016

Test build #65379 has finished for PR 15099 at commit 7fcee45.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 14, 2016

Test build #65383 has finished for PR 15099 at commit a582ed8.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 15, 2016

Test build #65420 has finished for PR 15099 at commit 4ab777e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

assert(table.provider.isDefined)
assert(table.schema.isEmpty)

val tableName = table.identifier.unquotedString
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a suggestion. Since we already added the database into the identifier, we can use fully qualified names in the error messages. It will be much easier for users to know which table it is. That means, we can change this line to

val tableName = tableIdentWithDB.identifier.unquotedString

Below is an example of DB2. The error message shows the fully qualified name.

db2 => drop table ta
DB21034E  The command was processed as an SQL statement because it was not a 
valid Command Line Processor command.  During SQL processing it returned:
SQL0204N  "DB2INST1.TA" is an undefined name.  SQLSTATE=42704

sessionCatalog.dropTable(TableIdentifier(viewName), ignoreIfNotExists = true, purge = false)
val maybeTempView = sparkSession.sessionState.catalog.getTempView(viewName)
if (maybeTempView.isDefined) {
val view = SubqueryAlias(viewName, maybeTempView.get, Some(TableIdentifier(viewName)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

val view = maybeTempView.get

If we simplify it to the above, the effect is the same?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no it's different. When we cache it, we use the plan returned by loopupRelation, which has the subquery.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me show an example in CachedTableSuite.scala:

  test("Drops cached temporary table") {
    testData.select('key).createOrReplaceTempView("t1")
    testData.select('key).createOrReplaceTempView("t2")
    spark.catalog.cacheTable("t1")

    assert(spark.catalog.isCached("t1"))
    assert(spark.catalog.isCached("t2"))

    spark.catalog.dropTempView("t1")
    intercept[AnalysisException](spark.table("t1"))
    assert(!spark.catalog.isCached("t2"))
    assert(spark.sharedState.cacheManager.isEmpty)
  }

It is not affected by SubqueryAlias.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you mean you pulled my PR and remove the subquery and this test still passed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. I tried it. : )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right, the CacheManager search the key by sameResult, so the subquery is useless

tempTables.get(formatTableName(name))
}

def dropTempView(name: String): Unit = synchronized {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add the description for this function too?

val sessionState = sparkSession.sessionState
if (sessionState.catalog.tableExists(table.identifier)) {
val db = table.identifier.database.getOrElse(sessionState.catalog.getCurrentDatabase)
val tableIdentWithDB = table.identifier.copy(database = Some(db))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make these three lines a util in TableIdentifier?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CreateDataSourceTableCommand is planned in physical plan stage. Maybe we should generate correct table identifier in logical plan stage?

}
case SaveMode.Overwrite =>
sparkSession.sql(s"DROP TABLE IF EXISTS $tableName")
sessionState.catalog.dropTable(tableIdentWithDB, ignoreIfNotExists = false, purge = false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If ignoreIfNotExists = false, it throws an exception if the table doesn't exist.

// File SessionCatalog line 392
      } else if (!ignoreIfNotExists) {
        throw new NoSuchTableException(db = db, table = table)
      }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

/**
* Return a temporary view exactly as it was stored.
*/
def getTempView(name: String): Option[LogicalPlan] = synchronized {
Copy link
Contributor

@clockfly clockfly Sep 16, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the same file, there are also interfaces like isTemporaryTable, tableExists, refreshTable, clearTempTables, which can accept a temporary view, but using the name "table" instead of "view"

I am not sure introducing getTempView and dropTempView will make the class SessionCatalog's interface simpler or more complicated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we already have createTempView, I don't think the new APIs make SessionCatalog more complicated.

val sessionState = sparkSession.sessionState
if (sessionState.catalog.tableExists(table.identifier)) {
val db = table.identifier.database.getOrElse(sessionState.catalog.getCurrentDatabase)
val tableIdentWithDB = table.identifier.copy(database = Some(db))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CreateDataSourceTableCommand is planned in physical plan stage. Maybe we should generate correct table identifier in logical plan stage?

if (sparkSession.sessionState.catalog.tableExists(table.identifier)) {
// Pass a table identifier with database part, so that `tableExists` won't check temp views
// unexpectedly.
if (sparkSession.sessionState.catalog.tableExists(tableIdentWithDB)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For

// Pass a table identifier with database part, so that `tableExists` won't check temp views

This is kind of hack to me. Do you have ways to avoid hacking table identifier?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have, in my previous refactor PR: #14962

This PR aims to just fix these bugs surgically, so that it's easy to backport to 2.0

@clockfly
Copy link
Contributor

My comments:

1. There are too many hack about TableIdentifier.

Problem:

Currently, our current use of TableIdentifier is kind of ambiguous.

case class TableIdentifier(table: String, database: Option[String])

For example, if database is not specified, sometimes we interpret the table identifier as a table in default database, sometimes, we interpret it as a temporary view. It is not good! I think if we want to call it as an Id, then it need to point to a exact table/view with no ambiguity.

Suggestion:

Suggest to resolve TableIdentifier to a full qualified table identifier in Analyzer stage. For a TableIdentifier from SQL parser which doesn't have a database, we should:

  1. If it is a temporary view, then we should assign it a built-in database representing temporary view.
  2. If it is a table, then we should add the database prefix.

After analysis, there should be no TableIdentifier which doesn't have database specified.

The interface change in SessionCatalog is not consistent with other methods name conversion.

Problem

In class SessionCatalog, this PR adds two method dropTempView, getTempView, but at the same time all other methods in SessionCatalog use name conversion "table" instead of "view". For example, the following methods can operate on temporary view:

isTemporaryTable, 
tableExists, 
refreshTable, 
clearTempTables

I think the change will make class SessionCatalog harder to understand.

Suggestion

We still use the name "table", but use the TableIdentifier to differentiate whether it is a temporary view or table.

@cloud-fan
Copy link
Contributor Author

@clockfly please note that this PR just want to fix these bugs surgically, to minimize the code changes and make it easier to backport to 2.0. I have a pending refactor PR to clean them up, we can discuss your proposal there.

For SessionCatalog, we already have createTempView, I'd say it's a historic problem, when we renamed temporary table to temporary view, we forgot to update the APIs in SessionCatalog. In this PR, I don't wanna fix them, but for newly added APIs in this PR, I'd like to use the corrected names.

@SparkQA
Copy link

SparkQA commented Sep 16, 2016

Test build #65483 has finished for PR 15099 at commit 1c57262.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 16, 2016

Test build #65491 has finished for PR 15099 at commit c5513d5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

LGTM, except one minor comment

@andrewor14
Copy link
Contributor

LGTM!

@gatorsmile
Copy link
Member

@andrewor14 I like your new signature! :) Apache Spark PMC member, doctor in five years

@SparkQA
Copy link

SparkQA commented Sep 18, 2016

Test build #65560 has finished for PR 15099 at commit 0223fd3.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • public static final class UnsignedPrefixComparatorNullsLast extends RadixSortSupport
    • public static final class UnsignedPrefixComparatorDescNullsFirst extends RadixSortSupport
    • public static final class SignedPrefixComparatorNullsLast extends RadixSortSupport
    • public static final class SignedPrefixComparatorDescNullsFirst extends RadixSortSupport
    • abstract sealed class NullOrdering
    • case class SortOrder(
    • case class Length(child: Expression) extends UnaryExpression with ImplicitCastInputTypes
    • class SetAccumulator[T] extends AccumulatorV2[T, java.util.Set[T]]

@cloud-fan
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Sep 18, 2016

Test build #65556 has finished for PR 15099 at commit b48e4ae.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@clockfly
Copy link
Contributor

clockfly commented Sep 18, 2016

@cloud-fan, since this bug fix is important that other patches depends on this, I am +1 on current approach. Maybe in future, we should consider making TableIdentifier a truly unique identifier in physical plan stage for all views, temp views, global views, and all tables.

@SparkQA
Copy link

SparkQA commented Sep 18, 2016

Test build #65565 has finished for PR 15099 at commit 0223fd3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • public static final class UnsignedPrefixComparatorNullsLast extends RadixSortSupport
    • public static final class UnsignedPrefixComparatorDescNullsFirst extends RadixSortSupport
    • public static final class SignedPrefixComparatorNullsLast extends RadixSortSupport
    • public static final class SignedPrefixComparatorDescNullsFirst extends RadixSortSupport
    • abstract sealed class NullOrdering
    • case class SortOrder(
    • case class Length(child: Expression) extends UnaryExpression with ImplicitCastInputTypes
    • class SetAccumulator[T] extends AccumulatorV2[T, java.util.Set[T]]

@asfgit asfgit closed this in 3fe630d Sep 18, 2016
asfgit pushed a commit that referenced this pull request Sep 18, 2016
…-name temp view exists

In `SessionCatalog`, we have several operations(`tableExists`, `dropTable`, `loopupRelation`, etc) that handle both temp views and metastore tables/views. This brings some bugs to DDL commands that want to handle temp view only or metastore table/view only. These bugs are:

1. `CREATE TABLE USING` will fail if a same-name temp view exists
2. `Catalog.dropTempView`will un-cache and drop metastore table if a same-name table exists
3. `saveAsTable` will fail or have unexpected behaviour if a same-name temp view exists.

These bug fixes are pulled out from #14962 and targets both master and 2.0 branch

new regression tests

Author: Wenchen Fan <[email protected]>

Closes #15099 from cloud-fan/fix-view.

(cherry picked from commit 3fe630d)
Signed-off-by: Wenchen Fan <[email protected]>
@cloud-fan
Copy link
Contributor Author

@andrewor14 welcome back!

thanks for your reviews! merging to master and 2.0!

@srowen
Copy link
Member

srowen commented Sep 18, 2016

@gatorsmile
Copy link
Member

Let me do a quick fix.

wgtmac pushed a commit to wgtmac/spark that referenced this pull request Sep 19, 2016
…-name temp view exists

## What changes were proposed in this pull request?

In `SessionCatalog`, we have several operations(`tableExists`, `dropTable`, `loopupRelation`, etc) that handle both temp views and metastore tables/views. This brings some bugs to DDL commands that want to handle temp view only or metastore table/view only. These bugs are:

1. `CREATE TABLE USING` will fail if a same-name temp view exists
2. `Catalog.dropTempView`will un-cache and drop metastore table if a same-name table exists
3. `saveAsTable` will fail or have unexpected behaviour if a same-name temp view exists.

These bug fixes are pulled out from apache#14962 and targets both master and 2.0 branch

## How was this patch tested?

new regression tests

Author: Wenchen Fan <[email protected]>

Closes apache#15099 from cloud-fan/fix-view.
@cloud-fan cloud-fan deleted the fix-view branch December 14, 2016 12:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants