-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-18217] [SQL] Disallow creating permanent views based on temporary views or UDFs #15764
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-18217] [SQL] Disallow creating permanent views based on temporary views or UDFs #15764
Conversation
| * Returns whether it is a temporary function. | ||
| */ | ||
| def isTempFunction(name: FunctionIdentifier): Boolean = { | ||
| // copied from HiveSessionCatalog |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'd update HiveSessionCatalog to say don't forget to update this place. Otherwise it will be inconsistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do. Thanks!
| child: LogicalPlan, | ||
| view: Option[TableIdentifier]) | ||
| view: Option[TableIdentifier])( | ||
| val isGeneratedByTempTable: java.lang.Boolean = false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a way to do it without introducing this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just updated the PR description. This might be the cleanest way. The reason is explained below.
When finding an unresolved temporary view, Analyzer replaces it by a
SubqueryAliaswith the corresponding logical plan, which is stored in an in-memory HashMap. After replacement, it is impossible to detect whether the SubqueryAlias is added/generated from a temporary view. Thus, to detect the usage of a temporary view in view definition, we added an extra flagisGeneratedByTempTableinto SubqueryAlias. The flag is added into the curried arguments. Via this extra flag, we can easily detect the usage of temporary view from a logical plan traversal.
Also cc @cloud-fan and @liancheng Do you have any better solution?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can do it from the unanalyzed plan, like what this PR did for temp functions.
| case s: UnresolvedRelation | ||
| if sparkSession.sessionState.catalog.isTemporaryTable(s.tableIdentifier) => | ||
| throw new AnalysisException(s"Not allowed to create a permanent view $name by " + | ||
| s"referencing a temp view ${s.tableIdentifier}. " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
temp -> temporary
| case e: UnresolvedFunction | ||
| if sparkSession.sessionState.catalog.isTemporaryFunction(e.name) => | ||
| throw new AnalysisException(s"Not allowed to create a permanent view $name by " + | ||
| s"referencing a temp function `${e.name}`. " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
temp -> temporary
| if sparkSession.sessionState.catalog.isTemporaryTable(s.tableIdentifier) => | ||
| throw new AnalysisException(s"Not allowed to create a permanent view $name by " + | ||
| s"referencing a temp view ${s.tableIdentifier}. " + | ||
| originalText.map(sql => s"""SQL: "$sql".""").getOrElse("")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we really need to show the entire sql? it's basically what the user just typed in
| if sparkSession.sessionState.catalog.isTemporaryFunction(e.name) => | ||
| throw new AnalysisException(s"Not allowed to create a permanent view $name by " + | ||
| s"referencing a temp function `${e.name}`. " + | ||
| originalText.map(sql => s"""SQL: "$sql".""").getOrElse("")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here do we really need to show the entire sql? it's basically what the user just typed in
| } | ||
|
|
||
| // When creating a permanent view, not allowed to reference temporary objects. | ||
| if (!isTemporary) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also i'd move the check into a function, so it is more obvious what's going on with the main workflow.
| /** | ||
| * Returns whether it is a temporary function. | ||
| */ | ||
| def isTemporaryFunction(name: FunctionIdentifier): Boolean = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a unit test for this function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also what's the behavior if the function doesn't exist? make sure you test it in the unit test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will resolve this tomorrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like isTemporaryTable, we return false when the function/table does not exist
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea please docuemnt it.
| // When creating a permanent view, not allowed to reference temporary objects. | ||
| if (!isTemporary) { | ||
| child.collect { | ||
| // Disallow creating permanent views based on temporary views. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'd also copy paste what you put in the pr description in here, on why you are traversing the unresolved plan.
|
Test build #68107 has finished for PR 15764 at commit
|
|
Test build #68112 has finished for PR 15764 at commit
|
| } | ||
| } | ||
|
|
||
| test("create a permanent/temp view using a hive function") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
while it's good to have separate test cases, i find these three separate cases a bit too verbose. why not just put hive, built-in, and permanent into a single query?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, will do it.
|
Test build #68116 has finished for PR 15764 at commit
|
|
Test build #68143 has finished for PR 15764 at commit
|
|
Looks good but I didn't read super carefully. cc @hvanhovell and @cloud-fan |
|
Test build #68161 has finished for PR 15764 at commit
|
| // Returns false when the function is permanent | ||
| assert(externalCatalog.listFunctions("db2", "*").toSet == Set("func1")) | ||
| assert(!sessionCatalog.isTemporaryFunction(FunctionIdentifier("func1", Some("db2")))) | ||
| assert(!sessionCatalog.isTemporaryFunction(FunctionIdentifier("db2.func1"))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's not a permanent function right? it's a function called db2.func1 which doesn't exist
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The related codes in lookupFunction looks confusing.
Here, you are right, functionRegistry does not have such a function. Let me remove it. Thanks!
| // without a database name, and is neither a built-in function nor a Hive function | ||
| name.database.isEmpty && | ||
| functionRegistry.functionExists(name.funcName) && | ||
| !FunctionRegistry.builtin.functionExists(name.funcName) && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
toLowerCase?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our built-in function registry is using SimpleFunctionRegistry, which is based on a case insensitive string key hash map.
Thus, no need to add toLowerCase. However, we need to add a test case to ensure we can capture the potential change in this part.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| * Permanent views are not allowed to reference temp objects, including temp function and views | ||
| */ | ||
| private def verifyTemporaryObjectsNotExists(sparkSession: SparkSession): Unit = { | ||
| if (!isTemporary) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it possible the child is resolved? e.g. by DataFrame API
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nvm, we can only create permanent view by SQL API
| name.database.isEmpty && | ||
| functionRegistry.functionExists(name.funcName) && | ||
| !FunctionRegistry.builtin.functionExists(name.funcName) && | ||
| !hiveFunctions.contains(name.funcName.toLowerCase) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought HiveSessionCatalog is used in context of Hive and SessionCatalog has nothing to do with Hive. So if we remove this from here and override isTemporaryFunction in HiveSessionCatalog, this would look clean. Feels like I am missing something obvious here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is true - but we are working towards getting rid of HiveSessionCatalog though (including getting rid of the 3 fallback functions), so in practice this will make no difference soon.
|
Test build #68247 has finished for PR 15764 at commit
|
|
Thanks - merging in master/branch-2.1. |
…ry views or UDFs ### What changes were proposed in this pull request? Based on the discussion in [SPARK-18209](https://issues.apache.org/jira/browse/SPARK-18209). It doesn't really make sense to create permanent views based on temporary views or temporary UDFs. To disallow the supports and issue the exceptions, this PR needs to detect whether a temporary view/UDF is being used when defining a permanent view. Basically, this PR can be split to two sub-tasks: **Task 1:** detecting a temporary view from the query plan of view definition. When finding an unresolved temporary view, Analyzer replaces it by a `SubqueryAlias` with the corresponding logical plan, which is stored in an in-memory HashMap. After replacement, it is impossible to detect whether the `SubqueryAlias` is added/generated from a temporary view. Thus, to detect the usage of a temporary view in view definition, this PR traverses the unresolved logical plan and uses the name of an `UnresolvedRelation` to detect whether it is a (global) temporary view. **Task 2:** detecting a temporary UDF from the query plan of view definition. Detecting usage of a temporary UDF in view definition is not straightfoward. First, in the analyzed plan, we are having different forms to represent the functions. More importantly, some classes (e.g., `HiveGenericUDF`) are not accessible from `CreateViewCommand`, which is part of `sql/core`. Thus, we used the unanalyzed plan `child` of `CreateViewCommand` to detect the usage of a temporary UDF. Because the plan has already been successfully analyzed, we can assume the functions have been defined/registered. Second, in Spark, the functions have four forms: Spark built-in functions, built-in hash functions, permanent UDFs and temporary UDFs. We do not have any direct way to determine whether a function is temporary or not. Thus, we introduced a function `isTemporaryFunction` in `SessionCatalog`. This function contains the detailed logics to determine whether a function is temporary or not. ### How was this patch tested? Added test cases. Author: gatorsmile <[email protected]> Closes #15764 from gatorsmile/blockTempFromPermViewCreation. (cherry picked from commit 1da64e1) Signed-off-by: Reynold Xin <[email protected]>
…detection after implementing it natively ## What changes were proposed in this pull request? In apache#15764 we added a mechanism to detect if a function is temporary or not. Hive functions are treated as non-temporary. Of the three hive functions, now "percentile" has been implemented natively, and "hash" has been removed. So we should update the list. ## How was this patch tested? Unit tests. Author: Shuai Lin <[email protected]> Closes apache#16049 from lins05/update-temp-function-detect-hive-list.
…detection after implementing it natively ## What changes were proposed in this pull request? In #15764 we added a mechanism to detect if a function is temporary or not. Hive functions are treated as non-temporary. Of the three hive functions, now "percentile" has been implemented natively, and "hash" has been removed. So we should update the list. ## How was this patch tested? Unit tests. Author: Shuai Lin <[email protected]> Closes #16049 from lins05/update-temp-function-detect-hive-list. (cherry picked from commit e64a204) Signed-off-by: gatorsmile <[email protected]>
…detection after implementing it natively ## What changes were proposed in this pull request? In apache#15764 we added a mechanism to detect if a function is temporary or not. Hive functions are treated as non-temporary. Of the three hive functions, now "percentile" has been implemented natively, and "hash" has been removed. So we should update the list. ## How was this patch tested? Unit tests. Author: Shuai Lin <[email protected]> Closes apache#16049 from lins05/update-temp-function-detect-hive-list.
…ry views or UDFs ### What changes were proposed in this pull request? Based on the discussion in [SPARK-18209](https://issues.apache.org/jira/browse/SPARK-18209). It doesn't really make sense to create permanent views based on temporary views or temporary UDFs. To disallow the supports and issue the exceptions, this PR needs to detect whether a temporary view/UDF is being used when defining a permanent view. Basically, this PR can be split to two sub-tasks: **Task 1:** detecting a temporary view from the query plan of view definition. When finding an unresolved temporary view, Analyzer replaces it by a `SubqueryAlias` with the corresponding logical plan, which is stored in an in-memory HashMap. After replacement, it is impossible to detect whether the `SubqueryAlias` is added/generated from a temporary view. Thus, to detect the usage of a temporary view in view definition, this PR traverses the unresolved logical plan and uses the name of an `UnresolvedRelation` to detect whether it is a (global) temporary view. **Task 2:** detecting a temporary UDF from the query plan of view definition. Detecting usage of a temporary UDF in view definition is not straightfoward. First, in the analyzed plan, we are having different forms to represent the functions. More importantly, some classes (e.g., `HiveGenericUDF`) are not accessible from `CreateViewCommand`, which is part of `sql/core`. Thus, we used the unanalyzed plan `child` of `CreateViewCommand` to detect the usage of a temporary UDF. Because the plan has already been successfully analyzed, we can assume the functions have been defined/registered. Second, in Spark, the functions have four forms: Spark built-in functions, built-in hash functions, permanent UDFs and temporary UDFs. We do not have any direct way to determine whether a function is temporary or not. Thus, we introduced a function `isTemporaryFunction` in `SessionCatalog`. This function contains the detailed logics to determine whether a function is temporary or not. ### How was this patch tested? Added test cases. Author: gatorsmile <[email protected]> Closes apache#15764 from gatorsmile/blockTempFromPermViewCreation.
…detection after implementing it natively ## What changes were proposed in this pull request? In apache#15764 we added a mechanism to detect if a function is temporary or not. Hive functions are treated as non-temporary. Of the three hive functions, now "percentile" has been implemented natively, and "hash" has been removed. So we should update the list. ## How was this patch tested? Unit tests. Author: Shuai Lin <[email protected]> Closes apache#16049 from lins05/update-temp-function-detect-hive-list.
…ry views or UDFs ### What changes were proposed in this pull request? PR #15764 disabled creating permanent views based on temporary views or UDFs. But AlterViewCommand didn't block temporary objects. ### Why are the changes needed? More robust view canonicalization. ### Does this PR introduce _any_ user-facing change? Yes, now if you alter a permanent view based on temporary views or UDFs, the operation will fail. ### How was this patch tested? Add new unit tests. Closes #33204 from jerqi/alter_view. Authored-by: RoryQi <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
…ry views or UDFs ### What changes were proposed in this pull request? PR #15764 disabled creating permanent views based on temporary views or UDFs. But AlterViewCommand didn't block temporary objects. ### Why are the changes needed? More robust view canonicalization. ### Does this PR introduce _any_ user-facing change? Yes, now if you alter a permanent view based on temporary views or UDFs, the operation will fail. ### How was this patch tested? Add new unit tests. Closes #33204 from jerqi/alter_view. Authored-by: RoryQi <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit e0c6b2e) Signed-off-by: Wenchen Fan <[email protected]>
What changes were proposed in this pull request?
Based on the discussion in SPARK-18209. It doesn't really make sense to create permanent views based on temporary views or temporary UDFs.
To disallow the supports and issue the exceptions, this PR needs to detect whether a temporary view/UDF is being used when defining a permanent view. Basically, this PR can be split to two sub-tasks:
Task 1: detecting a temporary view from the query plan of view definition.
When finding an unresolved temporary view, Analyzer replaces it by a
SubqueryAliaswith the corresponding logical plan, which is stored in an in-memory HashMap. After replacement, it is impossible to detect whether theSubqueryAliasis added/generated from a temporary view. Thus, to detect the usage of a temporary view in view definition, this PR traverses the unresolved logical plan and uses the name of anUnresolvedRelationto detect whether it is a (global) temporary view.Task 2: detecting a temporary UDF from the query plan of view definition.
Detecting usage of a temporary UDF in view definition is not straightfoward.
First, in the analyzed plan, we are having different forms to represent the functions. More importantly, some classes (e.g.,
HiveGenericUDF) are not accessible fromCreateViewCommand, which is part ofsql/core. Thus, we used the unanalyzed planchildofCreateViewCommandto detect the usage of a temporary UDF. Because the plan has already been successfully analyzed, we can assume the functions have been defined/registered.Second, in Spark, the functions have four forms: Spark built-in functions, built-in hash functions, permanent UDFs and temporary UDFs. We do not have any direct way to determine whether a function is temporary or not. Thus, we introduced a function
isTemporaryFunctioninSessionCatalog. This function contains the detailed logics to determine whether a function is temporary or not.How was this patch tested?
Added test cases.