Skip to content

Conversation

@cloud-fan
Copy link
Contributor

@cloud-fan cloud-fan commented Feb 12, 2020

What changes were proposed in this pull request?

No v2 command supports temp views and the ResolveCatalogs/ResolveSessionCatalog framework is designed with this assumption.

However, ResolveSessionCatalog needs to fallback to v1 commands, which do support temp views (e.g. CACHE TABLE). To work around it, we add a hack in CatalogAndIdentifier, which does not expand the given identifier with current namespace if the catalog is session catalog.

This works fine in most cases, as temp views should take precedence over tables during lookup. So if CatalogAndIdentifier returns a single name "t", the v1 commands can still resolve it to temp views correctly, or resolve it to table "default.t" if temp view doesn't exist.

However, if users write spark_catalog.t, it shouldn't be resolved to temp views as temp views don't belong to any catalog. CatalogAndIdentifier can't distinguish between spark_catalog.t and t, so the caller side may mistakenly resolve spark_catalog.t to a temp view.

This PR proposes to fix this issue by

  1. remove the hack in CatalogAndIdentifier, and clearly document that this shouldn't be used to resolve temp views.
  2. update ResolveSessionCatalog to explicitly look up temp views first before calling CatalogAndIdentifier, for v1 commands that support temp views.

Why are the changes needed?

To avoid releasing a behavior that we should not support.

Removing the hack also fixes the problem we hit in https://github.com/apache/spark/pull/27532/files#diff-57b3d87be744b7d79a9beacf8e5e5eb2R937

Does this PR introduce any user-facing change?

yes, now it's not allowed to refer to a temp view with spark_catalog prefix.

How was this patch tested?

new tests

Copy link
Contributor Author

@cloud-fan cloud-fan Feb 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is dead code, val v1TableName = parseV1Table(tbl, sql).asTableIdentifier fails earlier if length > 2

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this check should be done earlier, otherwise it's non-sense to check if the database names are specified consistently.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you look at the test, t refers to spark_catalog.default.t, testcat.t refers to testcat.t, the namespace is different, so the error message is also slightly different.

@cloud-fan
Copy link
Contributor Author

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, this adds current namespace to the returned identifier? So we cannot tell if a returned identifier is for a temp view?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think ResolveTempViews needs to be updated to look up twice (one with namespace and one without).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you look at SessionCatalog.setCurrentDatabase, we forbid to set global_temp as the current database.

This is to make the resolution of temp view simple: we look up temp view via the name parts given by users literally. No name extension is allowed.

The general rule of relation resolution is: we try to look up temp view first, then tables/permanent views. When we call CatalogAndIdentifier, the temp view lookup should already be tried, or the caller side don't want to resolve temp views.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we call CatalogAndIdentifier, the temp view lookup should already be tried, or the caller side don't want to resolve temp views.

So, ResolveCatalogs should be applied after ResolveTables and ResolveRelations?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these 2 are different frameworks so I don't see a way to guarantee the order. I think what we should do is to migrate to the new framework.

Comment on lines 427 to 433
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we mention this is happened for SHOW COLUMNS command?

@SparkQA
Copy link

SparkQA commented Feb 12, 2020

Test build #118318 has finished for PR 27550 at commit 9d59db4.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan cloud-fan force-pushed the ns branch 2 times, most recently from ff0388e to d4df542 Compare February 13, 2020 09:28
@cloud-fan
Copy link
Contributor Author

I've updated the PR description as I found a real bug.

@apache apache deleted a comment from SparkQA Feb 13, 2020
@apache apache deleted a comment from SparkQA Feb 13, 2020
@apache apache deleted a comment from SparkQA Feb 13, 2020
@apache apache deleted a comment from SparkQA Feb 13, 2020
@apache apache deleted a comment from SparkQA Feb 13, 2020
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the consequence of removing the hack. Now ResolvedTable can get qualified table name even from session catalog.

@SparkQA
Copy link

SparkQA commented Feb 13, 2020

Test build #118360 has finished for PR 27550 at commit a9aa831.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 13, 2020

Test build #118361 has finished for PR 27550 at commit 53db6f0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan cloud-fan changed the title [SPARK-30799][SQL] CatalogAndIdentifier shouldn't return wrong namespace [SPARK-30799][SQL] "spark_catalog.t" should not be resolved to temp view Feb 14, 2020
if (nameParts.length == 1) {
// If there is only one name part, it means the current catalog is the session catalog.
// Here we return the original name part, to keep the error message unchanged for
// v1 commands.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can remove this, but then I need to update many tests slightly to update the error message. We can leave it to 3.1.

DescribeColumnCommand(tbl.asTableIdentifier, colNameParts, isExtended)
}.getOrElse {
if (isView(tbl)) {
if (isTempView(tbl)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since tbl is from SessionCatalogAndTable, tbl may have the current namespace? And if the current namespace is not empty, this will always return false?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have a good point here. I think this is an existing bug. Here we look up table first (call loadTable) then look up temp view. We should look up temp view first to retain the behavior of Spark 2.4. @imback82 can you help fix it later?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, will do after this PR.

case _ if isTempView(nameParts) => Some(nameParts)
case SessionCatalogAndTable(_, tbl) =>
if (nameParts.head == CatalogManager.SESSION_CATALOG_NAME && tbl.length == 1) {
// For name parts like `spark_catalog.t`, we need to fill in the default database so
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated question, what was the reason to allow spark_catalog.t to be spark_catalog.<cur_db>.t? (different behavior from v2 catalogs)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed this well and think this is not intentional. But this is a different topic and we have many tests using spark_catalog.t.

We can open another PR to forbid it if we think we shouldn't support this feature.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks for the explanation. It makes more sense to forbid it to keep the behavior consistent.

Copy link
Contributor

@imback82 imback82 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM

if (isTemp) {
// temp func doesn't belong to any catalog and we shouldn't resolve catalog in the name.
val database = if (nameParts.length > 2) {
throw new AnalysisException(s"${nameParts.quoted} is not a valid function name.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... is not a valid function name sounds confusing? Maybe we add temporary?

* precedence over catalog name.
*
* Note that, this pattern is used to look up catalog objects like table, function, permanent
* view, etc. If you need to look up temp views, please do it separately before calling this
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

temp functions are also not to be looked up, right?

}

nameParts match {
case SessionCatalogAndTable(_, funcName) => funcName match {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we use SessionCatalogAndTable to extract a function name, sounds a bit confusing.

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good with few minor comments.

if (isTemp) {
// temp func doesn't belong to any catalog and we shouldn't resolve catalog in the name.
val database = if (nameParts.length > 2) {
throw new AnalysisException(s"Unsupported function name '${nameParts.quoted}'")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SparkQA
Copy link

SparkQA commented Feb 14, 2020

Test build #118422 has finished for PR 27550 at commit f4a3d08.

  • This patch fails PySpark pip packaging tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@imback82
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Feb 15, 2020

Test build #118454 has finished for PR 27550 at commit f4a3d08.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

thanks for the review, merging to master/3.0!

@cloud-fan cloud-fan closed this in ab07c63 Feb 17, 2020
cloud-fan added a commit that referenced this pull request Feb 17, 2020
### What changes were proposed in this pull request?

No v2 command supports temp views and the `ResolveCatalogs`/`ResolveSessionCatalog` framework is designed with this assumption.

However, `ResolveSessionCatalog` needs to fallback to v1 commands, which do support temp views (e.g. CACHE TABLE). To work around it, we add a hack in `CatalogAndIdentifier`, which does not expand the given identifier with current namespace if the catalog is session catalog.

This works fine in most cases, as temp views should take precedence over tables during lookup. So if `CatalogAndIdentifier` returns a single name "t", the v1 commands can still resolve it to temp views correctly, or resolve it to table "default.t" if temp view doesn't exist.

However, if users write `spark_catalog.t`, it shouldn't be resolved to temp views as temp views don't belong to any catalog. `CatalogAndIdentifier` can't distinguish between `spark_catalog.t` and `t`, so the caller side may mistakenly resolve `spark_catalog.t` to a temp view.

This PR proposes to fix this issue by
1. remove the hack in `CatalogAndIdentifier`, and clearly document that this shouldn't be used to resolve temp views.
2. update `ResolveSessionCatalog` to explicitly look up temp views first before calling `CatalogAndIdentifier`, for v1 commands that support temp views.

### Why are the changes needed?

To avoid releasing a behavior that we should not support.

Removing the hack also fixes the problem we hit in https://github.com/apache/spark/pull/27532/files#diff-57b3d87be744b7d79a9beacf8e5e5eb2R937

### Does this PR introduce any user-facing change?

yes, now it's not allowed to refer to a temp view with `spark_catalog` prefix.

### How was this patch tested?

new tests

Closes #27550 from cloud-fan/ns.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit ab07c63)
Signed-off-by: Wenchen Fan <[email protected]>
*
* Note that, this pattern is used to look up permanent catalog objects like table, view,
* function, etc. If you need to look up temp objects like temp view, please do it separately
* before calling this pattern, as temp objects don't belong to any catalog.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The term Catalog is being used in many places. For example, our internal SessionCatalog also manages temp objects.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

our internal SessionCatalog also manages temp objects

We can move them out and put it in a temp view manager like the GlobalTempViewManager. I'm talking more about the theory.

sjincho pushed a commit to sjincho/spark that referenced this pull request Apr 15, 2020
### What changes were proposed in this pull request?

No v2 command supports temp views and the `ResolveCatalogs`/`ResolveSessionCatalog` framework is designed with this assumption.

However, `ResolveSessionCatalog` needs to fallback to v1 commands, which do support temp views (e.g. CACHE TABLE). To work around it, we add a hack in `CatalogAndIdentifier`, which does not expand the given identifier with current namespace if the catalog is session catalog.

This works fine in most cases, as temp views should take precedence over tables during lookup. So if `CatalogAndIdentifier` returns a single name "t", the v1 commands can still resolve it to temp views correctly, or resolve it to table "default.t" if temp view doesn't exist.

However, if users write `spark_catalog.t`, it shouldn't be resolved to temp views as temp views don't belong to any catalog. `CatalogAndIdentifier` can't distinguish between `spark_catalog.t` and `t`, so the caller side may mistakenly resolve `spark_catalog.t` to a temp view.

This PR proposes to fix this issue by
1. remove the hack in `CatalogAndIdentifier`, and clearly document that this shouldn't be used to resolve temp views.
2. update `ResolveSessionCatalog` to explicitly look up temp views first before calling `CatalogAndIdentifier`, for v1 commands that support temp views.

### Why are the changes needed?

To avoid releasing a behavior that we should not support.

Removing the hack also fixes the problem we hit in https://github.com/apache/spark/pull/27532/files#diff-57b3d87be744b7d79a9beacf8e5e5eb2R937

### Does this PR introduce any user-facing change?

yes, now it's not allowed to refer to a temp view with `spark_catalog` prefix.

### How was this patch tested?

new tests

Closes apache#27550 from cloud-fan/ns.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants