Skip to content

Conversation

@cloud-fan
Copy link
Contributor

@cloud-fan cloud-fan commented Sep 5, 2016

What changes were proposed in this pull request?

In SessionCatalog, we have several operations(getTableMetadata, tableExists, renameTable, dropTable, loopupRelation) that handle both temp views and metastore tables/views. They can save some lines of code for some commands that need to deal with both temp views and metastore tables/views, but also introduce bugs for other commands, because the operation names say nothing about temp views and are very likely to be misused:

  • DataFrameWriter.saveAsTable/CREATE TABLE USING will fail if a same-name temp view exits
  • DataFrameWriter.saveAsTable with overwrite mode will mistakenly drop a same-name temp view.
  • Catalog.dropTempView may drop metastore table mistakenly
  • ALTER TABLE RECOVER PARTITIONS/LOAD DATA/TRUNCATE TABLE/SHOW CREATE TABLE should report "table not found" instead of "temp view is not supported", if a same-name temp view exists, because these commands don't need to deal with temp views.

In some commands we support temp views mistakenly without mentioning it in document: ShowTablePropertiesCommand, ShowColumnsCommand, Catalog.listColumns.
This PR remove the temp view support in ShowTablePropertiesCommand, as temp view doesn't have properties. And explicitly document that ShowColumnsCommand and Catalog.listColumns support temp view.

Mixing the handling of temp views and metastore tables/views also makes it harder to implement thread-safe operations. e.g. AlterViewAsCommand checks isTemporaryTable first then createTempView, which is not atomic. Most temp view related operations in SessionCatalog hold a lock on the SessionCatalog object, which is unnecessary.

This PR separates the management of temp views and metastore tables/views in SessionCatalog, any commands that need to deal with temp views should explicitly call temp view related operations in SessionCatalog, to fix existing bugs and prevent future bug like this.

How was this patch tested?

existing tests and 3 new tests.

@cloud-fan
Copy link
Contributor Author

cc @yhuai @gatorsmile @liancheng

@cloud-fan
Copy link
Contributor Author

also cc @srinathshankar , it will be very easy to implement global temp view after this

@SparkQA
Copy link

SparkQA commented Sep 5, 2016

Test build #64943 has finished for PR 14962 at commit ebed732.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class TempViewAlreadyExistsException(name: String)
    • class TempViewManager

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we avoid adding TempViewManager?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason? Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's easier to implement and reason about the thread-safe semantic for temp views if we put temp view management into one place.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just name it tempViewManager?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the goal of this PR is to add some view related API. So I think refactoring using TempViewManager is not the major goal?

Copy link
Member

@gatorsmile gatorsmile Sep 6, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we change it back to HashMap, we need to add synchronized back. Is my understanding right?

Same to all the other related functions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, now we let TempViewManager to implement the thread-safe semantic

@SparkQA
Copy link

SparkQA commented Sep 6, 2016

Test build #64973 has finished for PR 14962 at commit 8eb6c42.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class TempViewAlreadyExistsException(name: String)
    • class TempViewManager

@cloud-fan cloud-fan force-pushed the temp-view branch 2 times, most recently from 7314a33 to 0075f9d Compare September 6, 2016 09:04
@SparkQA
Copy link

SparkQA commented Sep 6, 2016

Test build #64984 has finished for PR 14962 at commit 7314a33.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class TempViewAlreadyExistsException(name: String)
    • class TempViewManager

@SparkQA
Copy link

SparkQA commented Sep 6, 2016

Test build #64987 has finished for PR 14962 at commit 0075f9d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class TempViewAlreadyExistsException(name: String)
    • class TempViewManager

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drop Table is unable to drop a temp view, right?

    spark.range(10).createTempView("tempView")
    sql("DESC tempView").show()
    sql("DROP TABLE tempView")
    sql("DESC tempView").show()

Copy link
Contributor Author

@cloud-fan cloud-fan Sep 7, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I noticed this and fixed it before, but it breaks a lot of tests, because we call "temp view" as "temp table" before. I'd like to keep this behaviour as it was, we can discuss how to fix it in follow-ups.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Thanks!

@gatorsmile
Copy link
Member

gatorsmile commented Sep 7, 2016

Found a common bug in the following ALTER TABLE commands:

    | ALTER TABLE tableIdentifier (partitionSpec)?
        SET SERDE STRING (WITH SERDEPROPERTIES tablePropertyList)?     #setTableSerDe
    | ALTER TABLE tableIdentifier (partitionSpec)?
        SET SERDEPROPERTIES tablePropertyList                          #setTableSerDe
    | ALTER TABLE tableIdentifier ADD (IF NOT EXISTS)?
        partitionSpecLocation+                                         #addTablePartition
    | ALTER VIEW tableIdentifier ADD (IF NOT EXISTS)?
        partitionSpec+                                                 #addTablePartition
    | ALTER TABLE tableIdentifier
        from=partitionSpec RENAME TO to=partitionSpec                  #renameTablePartition
    | ALTER TABLE tableIdentifier ADD (IF NOT EXISTS)?
        partitionSpecLocation+                                         #addTablePartition
    | ALTER TABLE tableIdentifier partitionSpec? SET locationSpec      #setTableLocation
    | ALTER TABLE tableIdentifier RECOVER PARTITIONS                   #recoverPartitions

We need to issue an exception when the tableType is VIEW. This is not introduced by this PR. Should we fix it here? or create a separate PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the description of TempViewManager, could we mention the name of temp view is always case sensitive? The caller is responsible for handling case-related issues.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea good idea

@gatorsmile
Copy link
Member

gatorsmile commented Sep 7, 2016

The biggest external change is in the table resolution order. Now, three common SessionCatalog functions tableExists, getTableMetadata and lookupRelation are not using the same order. tableExists and getTableMetadata do not check the existence of temp table, but lookupRelation always first checks the temp table. This inconsistency could be risky. That means, we allow users to create a table with the same name like a temp table. When we query the table, we will get the temp view at first.

In addition, we need to carefully review all the places that are using these three functions. For example, the analyzer rule PreWriteCheck is using both lookupRelation and tableExists. We will have a bug after this PR.

@cloud-fan
Copy link
Contributor Author

Found a common bug in the following ALTER TABLE commands:

let's fix them in separated PRs.

In addition, we need to carefully review all the places that are using these three functions. For example, the analyzer rule PreWriteCheck is using both lookupRelation and tableExists. We will have a bug after this PR.

yea, we should review them carefully(I have walked through all of them). PreWriteCheck does have a bug, but not caused by this PR. I'm thinking about if we should also separate lookUpRelation to a metastore version and a temp view version.

@gatorsmile
Copy link
Member

Sorry, I did not explain it clearly. In the analyzer rule PreWriteCheck, the target metastore table exists and thus tableExists returns true, but the following catalog.lookupRelation actually returns a temporary table. Thus, it could report a wrong error or miss some errors.

@cloud-fan
Copy link
Contributor Author

the target metastore table exists and thus tableExists returns true, but the following catalog.lookupRelation actually returns a temporary table.

isn't it true before this PR?

@cloud-fan cloud-fan force-pushed the temp-view branch 2 times, most recently from cec6b3e to 01659dc Compare September 8, 2016 11:04
@SparkQA
Copy link

SparkQA commented Sep 8, 2016

Test build #65100 has finished for PR 14962 at commit cec6b3e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 8, 2016

Test build #65101 has finished for PR 14962 at commit 01659dc.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 8, 2016

Test build #65102 has finished for PR 14962 at commit a8ce27f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

Found another issue we need to fix.

    val tableName = "tab1"
    withTable(tableName) {
      Seq((1, 2), (3, 4)).toDF("i", "j").createOrReplaceTempView(tableName)
      sql(s"select * from $tableName where i < 2")
        .write.mode(SaveMode.Overwrite).saveAsTable(tableName)
      sql(s"select * from $tableName").show()
    }

Before this PR, the results are correct

+---+---+
|  i|  j|
+---+---+
|  1|  2|
+---+---+

After this PR, the results is incorrect.

+---+---+
|  i|  j|
+---+---+
|  1|  2|
|  3|  4|
+---+---+

@gatorsmile
Copy link
Member

the target metastore table exists and thus tableExists returns true, but the following catalog.lookupRelation actually returns a temporary table.

Before this PR, yeah, it behaves the same. After this PR, we expect catalog.lookupRelation returns a metastore table. However, it returns a temporary table. Thus, we might misse the detection.

@cloud-fan
Copy link
Contributor Author

    val tableName = "tab1"
    withTable(tableName) {
      Seq((1, 2), (3, 4)).toDF("i", "j").createOrReplaceTempView(tableName)
      sql(s"select * from $tableName where i < 2")
        .write.mode(SaveMode.Overwrite).saveAsTable(tableName)
      sql(s"select * from $tableName").show()
    }

@gatorsmile , why it's an issue? In this case, we create a temp view called tab1 with 2 rows first. Then we create a table also called tab1 with 1 row. Finally in select we use the name tab1, which refers to the temp view, it has 2 rows.

@cloud-fan
Copy link
Contributor Author

@gatorsmile It's actually another bug in previous code. When we use DataFrameWriter.saveAsTable with overwrite mode, we may mistakenly drop a same-name temp view. I have updated the PR description to include it.

@SparkQA
Copy link

SparkQA commented Sep 9, 2016

Test build #65153 has finished for PR 14962 at commit e0d2427.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

uh... yeah, that is another existing bug. : ) Let me write a test suite for all the DDL statements when the temp view with the same name exists. We might hit more bugs...

@gatorsmile
Copy link
Member

gatorsmile commented Sep 9, 2016

wow... we are allowed to insert data into temporary views...

Updates: when writing the test cases for insertable temp views, I found a bug in my previous PR. When data sources do not extend SchemaRelationProvider, we are unable to set schema to DataSource. Will try to fix it ASAP. Thanks!

@SparkQA
Copy link

SparkQA commented Sep 10, 2016

Test build #65186 has finished for PR 14962 at commit f421355.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

private def lookupTableFromCatalog(u: UnresolvedRelation): LogicalPlan = {
try {
catalog.lookupRelation(u.tableIdentifier, u.alias)
catalog.lookupTempViewOrRelation(u.tableIdentifier, u.alias)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also for view, right? Should we just keep the old name?

@yhuai
Copy link
Contributor

yhuai commented Sep 14, 2016

Is it possible to first have a PR to fix the bugs?

assert(spark.table("default.same_name").collect().isEmpty)
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a regression test.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add comments to explain what this test is for in case we accidentally delete it in future.

asfgit pushed a commit that referenced this pull request Sep 18, 2016
…-name temp view exists

## What changes were proposed in this pull request?

In `SessionCatalog`, we have several operations(`tableExists`, `dropTable`, `loopupRelation`, etc) that handle both temp views and metastore tables/views. This brings some bugs to DDL commands that want to handle temp view only or metastore table/view only. These bugs are:

1. `CREATE TABLE USING` will fail if a same-name temp view exists
2. `Catalog.dropTempView`will un-cache and drop metastore table if a same-name table exists
3. `saveAsTable` will fail or have unexpected behaviour if a same-name temp view exists.

These bug fixes are pulled out from #14962 and targets both master and 2.0 branch

## How was this patch tested?

new regression tests

Author: Wenchen Fan <[email protected]>

Closes #15099 from cloud-fan/fix-view.
asfgit pushed a commit that referenced this pull request Sep 18, 2016
…-name temp view exists

In `SessionCatalog`, we have several operations(`tableExists`, `dropTable`, `loopupRelation`, etc) that handle both temp views and metastore tables/views. This brings some bugs to DDL commands that want to handle temp view only or metastore table/view only. These bugs are:

1. `CREATE TABLE USING` will fail if a same-name temp view exists
2. `Catalog.dropTempView`will un-cache and drop metastore table if a same-name table exists
3. `saveAsTable` will fail or have unexpected behaviour if a same-name temp view exists.

These bug fixes are pulled out from #14962 and targets both master and 2.0 branch

new regression tests

Author: Wenchen Fan <[email protected]>

Closes #15099 from cloud-fan/fix-view.

(cherry picked from commit 3fe630d)
Signed-off-by: Wenchen Fan <[email protected]>
wgtmac pushed a commit to wgtmac/spark that referenced this pull request Sep 19, 2016
…-name temp view exists

## What changes were proposed in this pull request?

In `SessionCatalog`, we have several operations(`tableExists`, `dropTable`, `loopupRelation`, etc) that handle both temp views and metastore tables/views. This brings some bugs to DDL commands that want to handle temp view only or metastore table/view only. These bugs are:

1. `CREATE TABLE USING` will fail if a same-name temp view exists
2. `Catalog.dropTempView`will un-cache and drop metastore table if a same-name table exists
3. `saveAsTable` will fail or have unexpected behaviour if a same-name temp view exists.

These bug fixes are pulled out from apache#14962 and targets both master and 2.0 branch

## How was this patch tested?

new regression tests

Author: Wenchen Fan <[email protected]>

Closes apache#15099 from cloud-fan/fix-view.
@gatorsmile
Copy link
Member

Maybe update this PR now? The only issue left is the atomicity issues. It sounds like we do not need to backport the fix, since the temporary view is session-specific. Right? We have to be very careful when implementing the global temporary views.

@cloud-fan
Copy link
Contributor Author

I'll update this PR after the global temp view

@cloud-fan
Copy link
Contributor Author

closing it as most of them were fixed in other PRs

@cloud-fan cloud-fan closed this Oct 18, 2016
@cloud-fan cloud-fan deleted the temp-view branch December 14, 2016 12:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants