-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-20918] [SQL] Use FunctionIdentifier as function identifiers in FunctionRegistry #18142
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| } | ||
|
|
||
| override def listFunction(): Seq[String] = synchronized { | ||
| functionBuilders.iterator.map(_._1).toList.sorted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sorted is useless. Thus, I removed it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think sorted output can make users easy to search for a function, shall we still keep it?
| class SimpleFunctionRegistry extends FunctionRegistry { | ||
|
|
||
| protected val functionBuilders = | ||
| StringKeyHashMap[(ExpressionInfo, FunctionBuilder)](caseSensitive = false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before this PR, the codes has a bug. The database name could be case sensitive.
|
Test build #77519 has started for PR 18142 at commit |
| // TODO: just make function registry take in FunctionIdentifier instead of duplicating this | ||
| val database = name.database.orElse(Some(currentDb)).map(formatDatabaseName) | ||
| val qualifiedName = name.copy(database = database) | ||
| functionRegistry.lookupFunction(name.funcName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also sounds a bug. This line before this PR ignores the database name.
| val loadedFunctions = | ||
| StringUtils.filterPattern(functionRegistry.listFunction(), pattern).map { f => | ||
| val loadedFunctions = StringUtils | ||
| .filterPattern(functionRegistry.listFunction().map(_.unquotedString), pattern).map { f => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR keeps the current behavior. However, I think it is also a bug. The user-specified pattern should not consider the database name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can fix it as a follow-up
| builder: FunctionBuilder): Unit | ||
|
|
||
| /* Create or replace a temporary function. */ | ||
| final def createOrReplaceTempFunction(name: String, builder: FunctionBuilder): Unit = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we already expose FunctionRegistry to the stable class UDFRegistration, I added this extra API for a helper function.
Ideally, this function should only exist in SessionCatalog.
|
Test build #77520 has started for PR 18142 at commit |
| // in the metastore). We need to first put the function in the FunctionRegistry. | ||
| // TODO: why not just check whether the function exists first? | ||
| val catalogFunction = try { | ||
| externalCatalog.getFunction(currentDb, name.funcName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is another bug that blocks users to use the function qualified for the other database. For example,
USE default;
SELECT db1.test_avg(1)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Submitted a separate PR #18146 for easily backporting to the previous branch.
|
Test build #77538 has finished for PR 18142 at commit
|
|
Test build #77602 has finished for PR 18142 at commit
|
| private val functionBuilders = | ||
| new mutable.HashMap[FunctionIdentifier, (ExpressionInfo, FunctionBuilder)] | ||
|
|
||
| // Resolution of the function name is always case insensitive, but the database name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks weird, database name is always case sensitive and function name is always case insenstive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is the resolution rule we are using now. : (
| """.stripMargin) | ||
|
|
||
| functionRegistry.registerFunction(name, udf.builder) | ||
| functionRegistry.createOrReplaceTempFunction(name, udf.builder) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it same as before? what if the name contains database part?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our current register function APIs in UDFRegistration does not allow users to specify the database names as part of name, since the registered functions are temporary.
|
retest this please |
|
Test build #77815 has finished for PR 18142 at commit
|
7358cce to
5635c27
Compare
|
Test build #77825 has finished for PR 18142 at commit
|
|
LGTM, merging to master! |
|
Guys - please in the future separate bug fixes with refactoring. Don't mix a bunch of cosmetic changes with actual bug fixes together. |
|
I am leaving a note (at least to myself) since it looked anyhow caused behaviour change. With this statements in Hive side: CREATE TABLE emp AS SELECT 'user' AS name, 'address' as address;
CREATE DATABASE d100;
CREATE FUNCTION d100.udf100 AS 'org.apache.hadoop.hive.ql.udf.generic.GenericUDFUpper';Hive Spark Before: After: MySQL This change causes a inconsistency with Hive although looks consistent compating to MySQL. |
|
@HyukjinKwon Thanks for the note! I think this behavior is better, I'm adding a |
|
I agree with this change too for clarification. |
|
BTW, this is changed at Spark 2.3.0. How did we handle this before? |
|
hmm, then it's too late. Maybe we can add it in Spark 2.3.2 release notes, cc @jerryshao |
|
I see. Thanks for the note. |
|
@cloud-fan, should we update migration guide as well? |
|
After a second thought, isn't it a bug? This clearly violates the SQL semantic: the string inside backticks should be treated as a string literal. I think we should update the JIRA ticket to explain this bug, but we don't need to put it in a release notes or migration guide. I'll update the ticket. |
|
Yea that was my impression as well. Let me bring this back when we're clear if this is a bug or not. |
BTW, I believe there's no particular standard for backticks themselves since different DBMS uses different backtick implementations. |
|
One explicit problem here is, we claim Hive compatibility in Spark. The difference should be explained when we are clear on this. |
You are right, but SQL standard does define how to quote identifiers, by using So for this case, I think it's obvious that users want to quote the function name, and we have a bug.
Do we? Sometimes we follow hive behaviors, but we never guarantee that IIRC. |
|
We do not need to follow Hive if Hive does not follow SQL compliance. Our main goal is to follow the mainstream DBMS vendors. BTW, we can enhance our parser to recognize the other symbols (e.g., double quotes) as the quotes instead of forcing users to choose backtick. cc @maropu |
|
I mean https://spark.apache.org/docs/latest/sql-programming-guide.html#supported-hive-features and https://spark.apache.org/docs/latest/sql-programming-guide.html#unsupported-hive-functionality The issue is also related with VIEW support as well. Should be good to note. I don't mean we should necessarily follow Hive's behaviour for clarification .. Also, I agree with this change .. |
This is different from Anyway I'm happy to see a PR to note it. It's good to be verbose in the doc. But I won't treat it a must-have. |
|
Yea, I didn't mean it super seriously @cloud-fan - I just left a comment in case for a better documentation since I see many users go from Hive to Spark. |
|
@gatorsmile btw, we've already file a jira to track these kinds of all the SQL-compiliant issues? not yet? |
What changes were proposed in this pull request?
Currently, the unquoted string of a function identifier is being used as the function identifier in the function registry. This could cause the incorrect the behavior when users use
.in the function names. This PR is to take theFunctionIdentifieras the identifier in the function registry.createOrReplaceTempFunctiontoFunctionRegistryHow was this patch tested?
Add extra test cases to verify the inclusive bug fixes.