Skip to content

Conversation

@gatorsmile
Copy link
Member

@gatorsmile gatorsmile commented May 30, 2017

What changes were proposed in this pull request?

Currently, the unquoted string of a function identifier is being used as the function identifier in the function registry. This could cause the incorrect the behavior when users use . in the function names. This PR is to take the FunctionIdentifier as the identifier in the function registry.

  • Add one new function createOrReplaceTempFunction to FunctionRegistry
final def createOrReplaceTempFunction(name: String, builder: FunctionBuilder): Unit

How was this patch tested?

Add extra test cases to verify the inclusive bug fixes.

}

override def listFunction(): Seq[String] = synchronized {
functionBuilders.iterator.map(_._1).toList.sorted
Copy link
Member Author

@gatorsmile gatorsmile May 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sorted is useless. Thus, I removed it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think sorted output can make users easy to search for a function, shall we still keep it?

class SimpleFunctionRegistry extends FunctionRegistry {

protected val functionBuilders =
StringKeyHashMap[(ExpressionInfo, FunctionBuilder)](caseSensitive = false)
Copy link
Member Author

@gatorsmile gatorsmile May 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before this PR, the codes has a bug. The database name could be case sensitive.

@SparkQA
Copy link

SparkQA commented May 30, 2017

Test build #77519 has started for PR 18142 at commit 201787f.

// TODO: just make function registry take in FunctionIdentifier instead of duplicating this
val database = name.database.orElse(Some(currentDb)).map(formatDatabaseName)
val qualifiedName = name.copy(database = database)
functionRegistry.lookupFunction(name.funcName)
Copy link
Member Author

@gatorsmile gatorsmile May 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also sounds a bug. This line before this PR ignores the database name.

val loadedFunctions =
StringUtils.filterPattern(functionRegistry.listFunction(), pattern).map { f =>
val loadedFunctions = StringUtils
.filterPattern(functionRegistry.listFunction().map(_.unquotedString), pattern).map { f =>
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR keeps the current behavior. However, I think it is also a bug. The user-specified pattern should not consider the database name.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can fix it as a follow-up

builder: FunctionBuilder): Unit

/* Create or replace a temporary function. */
final def createOrReplaceTempFunction(name: String, builder: FunctionBuilder): Unit = {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we already expose FunctionRegistry to the stable class UDFRegistration, I added this extra API for a helper function.

Ideally, this function should only exist in SessionCatalog.

@SparkQA
Copy link

SparkQA commented May 30, 2017

Test build #77520 has started for PR 18142 at commit 3f253f3.

// in the metastore). We need to first put the function in the FunctionRegistry.
// TODO: why not just check whether the function exists first?
val catalogFunction = try {
externalCatalog.getFunction(currentDb, name.funcName)
Copy link
Member Author

@gatorsmile gatorsmile May 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another bug that blocks users to use the function qualified for the other database. For example,

USE default;
SELECT db1.test_avg(1)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Submitted a separate PR #18146 for easily backporting to the previous branch.

@SparkQA
Copy link

SparkQA commented May 30, 2017

Test build #77538 has finished for PR 18142 at commit e8a534a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 31, 2017

Test build #77602 has finished for PR 18142 at commit 794de15.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

private val functionBuilders =
new mutable.HashMap[FunctionIdentifier, (ExpressionInfo, FunctionBuilder)]

// Resolution of the function name is always case insensitive, but the database name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks weird, database name is always case sensitive and function name is always case insenstive?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is the resolution rule we are using now. : (

""".stripMargin)

functionRegistry.registerFunction(name, udf.builder)
functionRegistry.createOrReplaceTempFunction(name, udf.builder)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it same as before? what if the name contains database part?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our current register function APIs in UDFRegistration does not allow users to specify the database names as part of name, since the registered functions are temporary.

@gatorsmile
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented Jun 8, 2017

Test build #77815 has finished for PR 18142 at commit 794de15.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 9, 2017

Test build #77825 has finished for PR 18142 at commit 5635c27.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

LGTM, merging to master!

@asfgit asfgit closed this in 5716354 Jun 9, 2017
@rxin
Copy link
Contributor

rxin commented Jun 9, 2017

Guys - please in the future separate bug fixes with refactoring. Don't mix a bunch of cosmetic changes with actual bug fixes together.

@HyukjinKwon
Copy link
Member

I am leaving a note (at least to myself) since it looked anyhow caused behaviour change.

With this statements in Hive side:

CREATE TABLE emp AS SELECT 'user' AS name, 'address' as address;
CREATE DATABASE d100;
CREATE FUNCTION d100.udf100 AS 'org.apache.hadoop.hive.ql.udf.generic.GenericUDFUpper';

Hive

hive> SELECT d100.udf100(`emp`.`name`) FROM `emp`;
USER
hive> SELECT `d100.udf100`(`emp`.`name`) FROM `emp`;
USER

Spark

Before:

scala> spark.sql("SELECT d100.udf100(`emp`.`name`) FROM `emp`").show
+-----------------+
|d100.udf100(name)|
+-----------------+
|             USER|
+-----------------+


scala> spark.sql("SELECT `d100.udf100`(`emp`.`name`) FROM `emp`").show
+-----------------+
|d100.udf100(name)|
+-----------------+
|             USER|
+-----------------+

After:

scala> spark.sql("SELECT d100.udf100(`emp`.`name`) FROM `emp`").show
+-----------------+
|d100.udf100(name)|
+-----------------+
|             USER|
+-----------------+


scala> spark.sql("SELECT `d100.udf100`(`emp`.`name`) FROM `emp`").show
org.apache.spark.sql.AnalysisException: Undefined function: 'd100.udf100'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 1 pos 7

MySQL

This change causes a inconsistency with Hive although looks consistent compating to MySQL.

mysql> SELECT `d100.udf100`(`emp`.`name`) FROM `emp`;
ERROR 1305 (42000): FUNCTION hkwon.d100.udf100 does not exist
mysql> SELECT d100.udf100(`emp`.`name`) FROM `emp`;
+---------------------------+
| d100.udf100(`emp`.`name`) |
+---------------------------+
| Hello, user!              |
+---------------------------+
1 row in set (0.01 sec)

@cloud-fan
Copy link
Contributor

@HyukjinKwon Thanks for the note! I think this behavior is better, I'm adding a release_note tag to the JIRA ticket, so that we don't forget to mention it in release notes.

@HyukjinKwon
Copy link
Member

I agree with this change too for clarification.

@dongjoon-hyun
Copy link
Member

BTW, this is changed at Spark 2.3.0. How did we handle this before?

@cloud-fan
Copy link
Contributor

cloud-fan commented Sep 6, 2018

hmm, then it's too late. Maybe we can add it in Spark 2.3.2 release notes, cc @jerryshao

@jerryshao
Copy link
Contributor

I see. Thanks for the note.

@HyukjinKwon
Copy link
Member

@cloud-fan, should we update migration guide as well?

@cloud-fan
Copy link
Contributor

After a second thought, isn't it a bug?

hive> SELECT `d100.udf100`(`emp`.`name`) FROM `emp`;
USER

This clearly violates the SQL semantic: the string inside backticks should be treated as a string literal. I think we should update the JIRA ticket to explain this bug, but we don't need to put it in a release notes or migration guide.

I'll update the ticket.

@HyukjinKwon
Copy link
Member

Yea that was my impression as well. Let me bring this back when we're clear if this is a bug or not.

@HyukjinKwon
Copy link
Member

This clearly violates the SQL semantic: the string inside backticks should be treated as a string literal.

BTW, I believe there's no particular standard for backticks themselves since different DBMS uses different backtick implementations.

@HyukjinKwon
Copy link
Member

One explicit problem here is, we claim Hive compatibility in Spark. The difference should be explained when we are clear on this.

@cloud-fan
Copy link
Contributor

BTW, I believe there's no particular standard for backticks themselves since different DBMS uses different backtick implementations.

You are right, but SQL standard does define how to quote identifiers, by using "xyz". Databases usually support one more syntax to quote identifiers, e.g. backticks, square brackets, etc.

So for this case, I think it's obvious that users want to quote the function name, and we have a bug.

One explicit problem here is, we claim Hive compatibility in Spark.

Do we? Sometimes we follow hive behaviors, but we never guarantee that IIRC.

@gatorsmile
Copy link
Member Author

We do not need to follow Hive if Hive does not follow SQL compliance. Our main goal is to follow the mainstream DBMS vendors.

BTW, we can enhance our parser to recognize the other symbols (e.g., double quotes) as the quotes instead of forcing users to choose backtick. cc @maropu

@HyukjinKwon
Copy link
Member

I mean https://spark.apache.org/docs/latest/sql-programming-guide.html#supported-hive-features and https://spark.apache.org/docs/latest/sql-programming-guide.html#unsupported-hive-functionality

The issue is also related with VIEW support as well. Should be good to note.

I don't mean we should necessarily follow Hive's behaviour for clarification .. Also, I agree with this change ..

@cloud-fan
Copy link
Contributor

Spark SQL is designed to be compatible with the Hive Metastore, SerDes and UDFs.

This is different from Spark can run any Hive SQL. Spark can load and use Hive UDFs, with the right SQL syntax.

Anyway I'm happy to see a PR to note it. It's good to be verbose in the doc. But I won't treat it a must-have.

@HyukjinKwon
Copy link
Member

Yea, I didn't mean it super seriously @cloud-fan - I just left a comment in case for a better documentation since I see many users go from Hive to Spark.

@maropu
Copy link
Member

maropu commented Sep 10, 2018

@gatorsmile btw, we've already file a jira to track these kinds of all the SQL-compiliant issues? not yet?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants