-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-30799][SQL] "spark_catalog.t" should not be resolved to temp view #27550
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is dead code, val v1TableName = parseV1Table(tbl, sql).asTableIdentifier fails earlier if length > 2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this check should be done earlier, otherwise it's non-sense to check if the database names are specified consistently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you look at the test, t refers to spark_catalog.default.t, testcat.t refers to testcat.t, the namespace is different, so the error message is also slightly different.
c358e5a to
9d59db4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, this adds current namespace to the returned identifier? So we cannot tell if a returned identifier is for a temp view?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think ResolveTempViews needs to be updated to look up twice (one with namespace and one without).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you look at SessionCatalog.setCurrentDatabase, we forbid to set global_temp as the current database.
This is to make the resolution of temp view simple: we look up temp view via the name parts given by users literally. No name extension is allowed.
The general rule of relation resolution is: we try to look up temp view first, then tables/permanent views. When we call CatalogAndIdentifier, the temp view lookup should already be tried, or the caller side don't want to resolve temp views.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we call
CatalogAndIdentifier, the temp view lookup should already be tried, or the caller side don't want to resolve temp views.
So, ResolveCatalogs should be applied after ResolveTables and ResolveRelations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these 2 are different frameworks so I don't see a way to guarantee the order. I think what we should do is to migrate to the new framework.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we mention this is happened for SHOW COLUMNS command?
|
Test build #118318 has finished for PR 27550 at commit
|
ff0388e to
d4df542
Compare
|
I've updated the PR description as I found a real bug. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the consequence of removing the hack. Now ResolvedTable can get qualified table name even from session catalog.
|
Test build #118360 has finished for PR 27550 at commit
|
|
Test build #118361 has finished for PR 27550 at commit
|
| if (nameParts.length == 1) { | ||
| // If there is only one name part, it means the current catalog is the session catalog. | ||
| // Here we return the original name part, to keep the error message unchanged for | ||
| // v1 commands. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can remove this, but then I need to update many tests slightly to update the error message. We can leave it to 3.1.
| DescribeColumnCommand(tbl.asTableIdentifier, colNameParts, isExtended) | ||
| }.getOrElse { | ||
| if (isView(tbl)) { | ||
| if (isTempView(tbl)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since tbl is from SessionCatalogAndTable, tbl may have the current namespace? And if the current namespace is not empty, this will always return false?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have a good point here. I think this is an existing bug. Here we look up table first (call loadTable) then look up temp view. We should look up temp view first to retain the behavior of Spark 2.4. @imback82 can you help fix it later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, will do after this PR.
| case _ if isTempView(nameParts) => Some(nameParts) | ||
| case SessionCatalogAndTable(_, tbl) => | ||
| if (nameParts.head == CatalogManager.SESSION_CATALOG_NAME && tbl.length == 1) { | ||
| // For name parts like `spark_catalog.t`, we need to fill in the default database so |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated question, what was the reason to allow spark_catalog.t to be spark_catalog.<cur_db>.t? (different behavior from v2 catalogs)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed this well and think this is not intentional. But this is a different topic and we have many tests using spark_catalog.t.
We can open another PR to forbid it if we think we shouldn't support this feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, thanks for the explanation. It makes more sense to forbid it to keep the behavior consistent.
imback82
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM
| if (isTemp) { | ||
| // temp func doesn't belong to any catalog and we shouldn't resolve catalog in the name. | ||
| val database = if (nameParts.length > 2) { | ||
| throw new AnalysisException(s"${nameParts.quoted} is not a valid function name.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... is not a valid function name sounds confusing? Maybe we add temporary?
| * precedence over catalog name. | ||
| * | ||
| * Note that, this pattern is used to look up catalog objects like table, function, permanent | ||
| * view, etc. If you need to look up temp views, please do it separately before calling this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
temp functions are also not to be looked up, right?
| } | ||
|
|
||
| nameParts match { | ||
| case SessionCatalogAndTable(_, funcName) => funcName match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we use SessionCatalogAndTable to extract a function name, sounds a bit confusing.
viirya
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good with few minor comments.
| if (isTemp) { | ||
| // temp func doesn't belong to any catalog and we shouldn't resolve catalog in the name. | ||
| val database = if (nameParts.length > 2) { | ||
| throw new AnalysisException(s"Unsupported function name '${nameParts.quoted}'") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
Test build #118422 has finished for PR 27550 at commit
|
|
retest this please |
|
Test build #118454 has finished for PR 27550 at commit
|
|
thanks for the review, merging to master/3.0! |
### What changes were proposed in this pull request? No v2 command supports temp views and the `ResolveCatalogs`/`ResolveSessionCatalog` framework is designed with this assumption. However, `ResolveSessionCatalog` needs to fallback to v1 commands, which do support temp views (e.g. CACHE TABLE). To work around it, we add a hack in `CatalogAndIdentifier`, which does not expand the given identifier with current namespace if the catalog is session catalog. This works fine in most cases, as temp views should take precedence over tables during lookup. So if `CatalogAndIdentifier` returns a single name "t", the v1 commands can still resolve it to temp views correctly, or resolve it to table "default.t" if temp view doesn't exist. However, if users write `spark_catalog.t`, it shouldn't be resolved to temp views as temp views don't belong to any catalog. `CatalogAndIdentifier` can't distinguish between `spark_catalog.t` and `t`, so the caller side may mistakenly resolve `spark_catalog.t` to a temp view. This PR proposes to fix this issue by 1. remove the hack in `CatalogAndIdentifier`, and clearly document that this shouldn't be used to resolve temp views. 2. update `ResolveSessionCatalog` to explicitly look up temp views first before calling `CatalogAndIdentifier`, for v1 commands that support temp views. ### Why are the changes needed? To avoid releasing a behavior that we should not support. Removing the hack also fixes the problem we hit in https://github.com/apache/spark/pull/27532/files#diff-57b3d87be744b7d79a9beacf8e5e5eb2R937 ### Does this PR introduce any user-facing change? yes, now it's not allowed to refer to a temp view with `spark_catalog` prefix. ### How was this patch tested? new tests Closes #27550 from cloud-fan/ns. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit ab07c63) Signed-off-by: Wenchen Fan <[email protected]>
| * | ||
| * Note that, this pattern is used to look up permanent catalog objects like table, view, | ||
| * function, etc. If you need to look up temp objects like temp view, please do it separately | ||
| * before calling this pattern, as temp objects don't belong to any catalog. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The term Catalog is being used in many places. For example, our internal SessionCatalog also manages temp objects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
our internal SessionCatalog also manages temp objects
We can move them out and put it in a temp view manager like the GlobalTempViewManager. I'm talking more about the theory.
### What changes were proposed in this pull request? No v2 command supports temp views and the `ResolveCatalogs`/`ResolveSessionCatalog` framework is designed with this assumption. However, `ResolveSessionCatalog` needs to fallback to v1 commands, which do support temp views (e.g. CACHE TABLE). To work around it, we add a hack in `CatalogAndIdentifier`, which does not expand the given identifier with current namespace if the catalog is session catalog. This works fine in most cases, as temp views should take precedence over tables during lookup. So if `CatalogAndIdentifier` returns a single name "t", the v1 commands can still resolve it to temp views correctly, or resolve it to table "default.t" if temp view doesn't exist. However, if users write `spark_catalog.t`, it shouldn't be resolved to temp views as temp views don't belong to any catalog. `CatalogAndIdentifier` can't distinguish between `spark_catalog.t` and `t`, so the caller side may mistakenly resolve `spark_catalog.t` to a temp view. This PR proposes to fix this issue by 1. remove the hack in `CatalogAndIdentifier`, and clearly document that this shouldn't be used to resolve temp views. 2. update `ResolveSessionCatalog` to explicitly look up temp views first before calling `CatalogAndIdentifier`, for v1 commands that support temp views. ### Why are the changes needed? To avoid releasing a behavior that we should not support. Removing the hack also fixes the problem we hit in https://github.com/apache/spark/pull/27532/files#diff-57b3d87be744b7d79a9beacf8e5e5eb2R937 ### Does this PR introduce any user-facing change? yes, now it's not allowed to refer to a temp view with `spark_catalog` prefix. ### How was this patch tested? new tests Closes apache#27550 from cloud-fan/ns. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
What changes were proposed in this pull request?
No v2 command supports temp views and the
ResolveCatalogs/ResolveSessionCatalogframework is designed with this assumption.However,
ResolveSessionCatalogneeds to fallback to v1 commands, which do support temp views (e.g. CACHE TABLE). To work around it, we add a hack inCatalogAndIdentifier, which does not expand the given identifier with current namespace if the catalog is session catalog.This works fine in most cases, as temp views should take precedence over tables during lookup. So if
CatalogAndIdentifierreturns a single name "t", the v1 commands can still resolve it to temp views correctly, or resolve it to table "default.t" if temp view doesn't exist.However, if users write
spark_catalog.t, it shouldn't be resolved to temp views as temp views don't belong to any catalog.CatalogAndIdentifiercan't distinguish betweenspark_catalog.tandt, so the caller side may mistakenly resolvespark_catalog.tto a temp view.This PR proposes to fix this issue by
CatalogAndIdentifier, and clearly document that this shouldn't be used to resolve temp views.ResolveSessionCatalogto explicitly look up temp views first before callingCatalogAndIdentifier, for v1 commands that support temp views.Why are the changes needed?
To avoid releasing a behavior that we should not support.
Removing the hack also fixes the problem we hit in https://github.com/apache/spark/pull/27532/files#diff-57b3d87be744b7d79a9beacf8e5e5eb2R937
Does this PR introduce any user-facing change?
yes, now it's not allowed to refer to a temp view with
spark_catalogprefix.How was this patch tested?
new tests