[SPARK-20918] [SQL] Use FunctionIdentifier as function identifiers in FunctionRegistry #18142

gatorsmile · 2017-05-30T05:02:23Z

What changes were proposed in this pull request?

Currently, the unquoted string of a function identifier is being used as the function identifier in the function registry. This could cause the incorrect the behavior when users use . in the function names. This PR is to take the FunctionIdentifier as the identifier in the function registry.

Add one new function createOrReplaceTempFunction to FunctionRegistry

final def createOrReplaceTempFunction(name: String, builder: FunctionBuilder): Unit

How was this patch tested?

Add extra test cases to verify the inclusive bug fixes.

gatorsmile · 2017-05-30T05:05:53Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala

  }

-  override def listFunction(): Seq[String] = synchronized {
-    functionBuilders.iterator.map(_._1).toList.sorted


This sorted is useless. Thus, I removed it.

I think sorted output can make users easy to search for a function, shall we still keep it?

gatorsmile · 2017-05-30T05:07:13Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala

 class SimpleFunctionRegistry extends FunctionRegistry {

-  protected val functionBuilders =
-    StringKeyHashMap[(ExpressionInfo, FunctionBuilder)](caseSensitive = false)


Before this PR, the codes has a bug. The database name could be case sensitive.

SparkQA · 2017-05-30T05:07:36Z

Test build #77519 has started for PR 18142 at commit 201787f.

gatorsmile · 2017-05-30T05:09:14Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala

    // TODO: just make function registry take in FunctionIdentifier instead of duplicating this
    val database = name.database.orElse(Some(currentDb)).map(formatDatabaseName)
    val qualifiedName = name.copy(database = database)
-    functionRegistry.lookupFunction(name.funcName)


This also sounds a bug. This line before this PR ignores the database name.

gatorsmile · 2017-05-30T05:11:29Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala

-    val loadedFunctions =
-      StringUtils.filterPattern(functionRegistry.listFunction(), pattern).map { f =>
+    val loadedFunctions = StringUtils
+      .filterPattern(functionRegistry.listFunction().map(_.unquotedString), pattern).map { f =>


This PR keeps the current behavior. However, I think it is also a bug. The user-specified pattern should not consider the database name.

we can fix it as a follow-up

gatorsmile · 2017-05-30T05:24:19Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala

+    builder: FunctionBuilder): Unit
+
+  /* Create or replace a temporary function. */
+  final def createOrReplaceTempFunction(name: String, builder: FunctionBuilder): Unit = {


Since we already expose FunctionRegistry to the stable class UDFRegistration, I added this extra API for a helper function.

Ideally, this function should only exist in SessionCatalog.

SparkQA · 2017-05-30T05:27:33Z

Test build #77520 has started for PR 18142 at commit 3f253f3.

gatorsmile · 2017-05-30T17:10:52Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala

    // in the metastore). We need to first put the function in the FunctionRegistry.
    // TODO: why not just check whether the function exists first?
    val catalogFunction = try {
-      externalCatalog.getFunction(currentDb, name.funcName)


This is another bug that blocks users to use the function qualified for the other database. For example,

USE default; SELECT db1.test_avg(1)

Submitted a separate PR #18146 for easily backporting to the previous branch.

SparkQA · 2017-05-30T19:29:46Z

Test build #77538 has finished for PR 18142 at commit e8a534a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-05-31T21:10:52Z

Test build #77602 has finished for PR 18142 at commit 794de15.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-06-02T21:52:56Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala

+  private val functionBuilders =
+    new mutable.HashMap[FunctionIdentifier, (ExpressionInfo, FunctionBuilder)]
+
+  // Resolution of the function name is always case insensitive, but the database name


this looks weird, database name is always case sensitive and function name is always case insenstive?

That is the resolution rule we are using now. : (

cloud-fan · 2017-06-02T21:55:55Z

sql/core/src/main/scala/org/apache/spark/sql/UDFRegistration.scala

      """.stripMargin)

-    functionRegistry.registerFunction(name, udf.builder)
+    functionRegistry.createOrReplaceTempFunction(name, udf.builder)


is it same as before? what if the name contains database part?

Our current register function APIs in UDFRegistration does not allow users to specify the database names as part of name, since the registered functions are temporary.

gatorsmile · 2017-06-08T18:20:46Z

retest this please

SparkQA · 2017-06-08T20:45:53Z

Test build #77815 has finished for PR 18142 at commit 794de15.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-06-09T01:49:12Z

Test build #77825 has finished for PR 18142 at commit 5635c27.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-06-09T17:16:43Z

LGTM, merging to master!

rxin · 2017-06-09T20:45:41Z

Guys - please in the future separate bug fixes with refactoring. Don't mix a bunch of cosmetic changes with actual bug fixes together.

HyukjinKwon · 2018-09-06T02:10:52Z

I am leaving a note (at least to myself) since it looked anyhow caused behaviour change.

With this statements in Hive side:

CREATE TABLE emp AS SELECT 'user' AS name, 'address' as address;
CREATE DATABASE d100;
CREATE FUNCTION d100.udf100 AS 'org.apache.hadoop.hive.ql.udf.generic.GenericUDFUpper';

Hive

hive> SELECT d100.udf100(`emp`.`name`) FROM `emp`;
USER
hive> SELECT `d100.udf100`(`emp`.`name`) FROM `emp`;
USER

Spark

Before:

scala> spark.sql("SELECT d100.udf100(`emp`.`name`) FROM `emp`").show
+-----------------+
|d100.udf100(name)|
+-----------------+
|             USER|
+-----------------+


scala> spark.sql("SELECT `d100.udf100`(`emp`.`name`) FROM `emp`").show
+-----------------+
|d100.udf100(name)|
+-----------------+
|             USER|
+-----------------+

After:

scala> spark.sql("SELECT d100.udf100(`emp`.`name`) FROM `emp`").show
+-----------------+
|d100.udf100(name)|
+-----------------+
|             USER|
+-----------------+


scala> spark.sql("SELECT `d100.udf100`(`emp`.`name`) FROM `emp`").show
org.apache.spark.sql.AnalysisException: Undefined function: 'd100.udf100'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 1 pos 7

MySQL

This change causes a inconsistency with Hive although looks consistent compating to MySQL.

mysql> SELECT `d100.udf100`(`emp`.`name`) FROM `emp`;
ERROR 1305 (42000): FUNCTION hkwon.d100.udf100 does not exist
mysql> SELECT d100.udf100(`emp`.`name`) FROM `emp`;
+---------------------------+
| d100.udf100(`emp`.`name`) |
+---------------------------+
| Hello, user!              |
+---------------------------+
1 row in set (0.01 sec)

cloud-fan · 2018-09-06T02:47:37Z

@HyukjinKwon Thanks for the note! I think this behavior is better, I'm adding a release_note tag to the JIRA ticket, so that we don't forget to mention it in release notes.

HyukjinKwon · 2018-09-06T02:49:29Z

I agree with this change too for clarification.

dongjoon-hyun · 2018-09-06T04:42:02Z

BTW, this is changed at Spark 2.3.0. How did we handle this before?

cloud-fan · 2018-09-06T07:12:14Z

hmm, then it's too late. Maybe we can add it in Spark 2.3.2 release notes, cc @jerryshao

jerryshao · 2018-09-06T07:51:44Z

I see. Thanks for the note.

HyukjinKwon · 2018-09-10T02:46:59Z

@cloud-fan, should we update migration guide as well?

cloud-fan · 2018-09-10T03:31:56Z

After a second thought, isn't it a bug?

hive> SELECT `d100.udf100`(`emp`.`name`) FROM `emp`;
USER

This clearly violates the SQL semantic: the string inside backticks should be treated as a string literal. I think we should update the JIRA ticket to explain this bug, but we don't need to put it in a release notes or migration guide.

I'll update the ticket.

HyukjinKwon · 2018-09-10T04:26:05Z

Yea that was my impression as well. Let me bring this back when we're clear if this is a bug or not.

HyukjinKwon · 2018-09-10T04:44:09Z

This clearly violates the SQL semantic: the string inside backticks should be treated as a string literal.

BTW, I believe there's no particular standard for backticks themselves since different DBMS uses different backtick implementations.

HyukjinKwon · 2018-09-10T04:45:15Z

One explicit problem here is, we claim Hive compatibility in Spark. The difference should be explained when we are clear on this.

cloud-fan · 2018-09-10T06:02:36Z

BTW, I believe there's no particular standard for backticks themselves since different DBMS uses different backtick implementations.

You are right, but SQL standard does define how to quote identifiers, by using "xyz". Databases usually support one more syntax to quote identifiers, e.g. backticks, square brackets, etc.

So for this case, I think it's obvious that users want to quote the function name, and we have a bug.

One explicit problem here is, we claim Hive compatibility in Spark.

Do we? Sometimes we follow hive behaviors, but we never guarantee that IIRC.

gatorsmile · 2018-09-10T06:11:56Z

We do not need to follow Hive if Hive does not follow SQL compliance. Our main goal is to follow the mainstream DBMS vendors.

BTW, we can enhance our parser to recognize the other symbols (e.g., double quotes) as the quotes instead of forcing users to choose backtick. cc @maropu

HyukjinKwon · 2018-09-10T06:44:05Z

I mean https://spark.apache.org/docs/latest/sql-programming-guide.html#supported-hive-features and https://spark.apache.org/docs/latest/sql-programming-guide.html#unsupported-hive-functionality

The issue is also related with VIEW support as well. Should be good to note.

I don't mean we should necessarily follow Hive's behaviour for clarification .. Also, I agree with this change ..

cloud-fan · 2018-09-10T06:50:09Z

Spark SQL is designed to be compatible with the Hive Metastore, SerDes and UDFs.

This is different from Spark can run any Hive SQL. Spark can load and use Hive UDFs, with the right SQL syntax.

Anyway I'm happy to see a PR to note it. It's good to be verbose in the doc. But I won't treat it a must-have.

HyukjinKwon · 2018-09-10T06:53:31Z

Yea, I didn't mean it super seriously @cloud-fan - I just left a comment in case for a better documentation since I see many users go from Hive to Spark.

maropu · 2018-09-10T13:38:19Z

@gatorsmile btw, we've already file a jira to track these kinds of all the SQL-compiliant issues? not yet?

gatorsmile added 2 commits May 29, 2017 21:38

fix.

a374e9f

fix.

201787f

gatorsmile commented May 30, 2017

View reviewed changes

fix.

3f253f3

gatorsmile commented May 30, 2017

View reviewed changes

fix.

e8a534a

gatorsmile commented May 30, 2017

View reviewed changes

gatorsmile mentioned this pull request May 30, 2017

[SPARK-15988] [SQL] Implement DDL commands: Create/Drop temporary macro #13706

Closed

gatorsmile added 2 commits May 30, 2017 14:19

Merge remote-tracking branch 'upstream/master' into fuctionRegistry

a43f17c

more test cases

794de15

gatorsmile mentioned this pull request Jun 2, 2017

[SPARK-20383][SQL] Supporting Create [temporary] Function with the keyword 'OR REPLACE' and 'IF NOT EXISTS' #17681

Closed

cloud-fan reviewed Jun 2, 2017

View reviewed changes

Merge remote-tracking branch 'upstream/master' into fuctionRegistry

5dd90be

fix.

5635c27

gatorsmile force-pushed the fuctionRegistry branch from 7358cce to 5635c27 Compare June 8, 2017 23:19

asfgit closed this in 5716354 Jun 9, 2017

HyukjinKwon mentioned this pull request Sep 6, 2018

[SPARK-25301][SQL] When a view uses an UDF from a non default database, Spark analyser throws AnalysisException #22307

Closed

[SPARK-20918] [SQL] Use FunctionIdentifier as function identifiers in FunctionRegistry #18142

[SPARK-20918] [SQL] Use FunctionIdentifier as function identifiers in FunctionRegistry #18142

Uh oh!

Conversation

gatorsmile commented May 30, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

gatorsmile May 30, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile May 30, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 30, 2017

Uh oh!

gatorsmile May 30, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 30, 2017

Uh oh!

gatorsmile May 30, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 30, 2017

Uh oh!

SparkQA commented May 31, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile commented Jun 8, 2017

Uh oh!

SparkQA commented Jun 8, 2017

Uh oh!

SparkQA commented Jun 9, 2017

Uh oh!

cloud-fan commented Jun 9, 2017

Uh oh!

rxin commented Jun 9, 2017

Uh oh!

HyukjinKwon commented Sep 6, 2018

Uh oh!

cloud-fan commented Sep 6, 2018

Uh oh!

HyukjinKwon commented Sep 6, 2018

Uh oh!

dongjoon-hyun commented Sep 6, 2018

Uh oh!

cloud-fan commented Sep 6, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jerryshao commented Sep 6, 2018

Uh oh!

HyukjinKwon commented Sep 10, 2018

Uh oh!

cloud-fan commented Sep 10, 2018

Uh oh!

HyukjinKwon commented Sep 10, 2018

Uh oh!

gatorsmile commented May 30, 2017 •

edited

Loading

gatorsmile May 30, 2017 •

edited

Loading

gatorsmile May 30, 2017 •

edited

Loading

gatorsmile May 30, 2017 •

edited

Loading

gatorsmile May 30, 2017 •

edited

Loading

cloud-fan commented Sep 6, 2018 •

edited

Loading

maropu commented Sep 10, 2018 •

edited

Loading