Skip to content

Conversation

@vinodkc
Copy link
Contributor

@vinodkc vinodkc commented Sep 1, 2018

What changes were proposed in this pull request?

When a hive view uses an UDF from a non default database, Spark analyser throws AnalysisException

Steps to simulate this issue

Step 1: Run following statements in Hive

CREATE TABLE emp AS SELECT 'user' AS name, 'address' as address;
CREATE DATABASE d100;
CREATE FUNCTION d100.udf100 as 'org.apache.hadoop.hive.ql.udf.generic.GenericUDFUpper'; // Note: udf100 is created in d100
CREATE VIEW d100.v100 AS SELECT d100.udf100(name) FROM default.emp; 
SELECT * FROM d100.v100; // query on view d100.v100 gives correct result

Step2 : Run following statement in Spark-shell

spark.sql("SELECT * FROM d100.v100").show

throws

org.apache.spark.sql.AnalysisException: Undefined function: 'd100.udf100'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'

This is because, while parsing the SQL statement of the View
'select `d100.udf100`(`emp`.`name`) from `default`.`emp`' , spark parser fails to split database name and udf name and hence Spark function registry tries to load the UDF 'd100.udf100' from 'default' database.

To solve this issue, before creating 'FunctionIdentifier' , try to get actual database name and then create FunctionIdentifier using that database name and function name

How was this patch tested?

Added 1 unit test

val functionClass =
classOf[org.apache.hadoop.hive.ql.udf.generic.GenericUDFUpper].getCanonicalName
sql(s"CREATE FUNCTION $db.$functionNameUpper AS '$functionClass'")
val ds = sql(s"SELECT `$db.$functionNameUpper`(`$table`.`c1`) FROM `$db`.`$table`")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the expected behaivour of backquotes; ANTLR parses inputs like;

  • `testdb.f1` => funcName: f1 dbName: testdb
  • testdb.f1 => funcName: testdb.f1 dbName: default

@SparkQA
Copy link

SparkQA commented Sep 1, 2018

Test build #95573 has finished for PR 22307 at commit 60cc1c9.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

@vinodkc, do you have the JAR for /usr/udf/masking.jar? Want to reproduce and check.

@HyukjinKwon
Copy link
Member

The problem here looks some inconsistency between Hive and Spark - since Spark claims Hive compatibility, looks we should either explain the difference or fix it.

@vinodkc
Copy link
Contributor Author

vinodkc commented Sep 3, 2018

@HyukjinKwon , even with this
create function d100.udf100 as 'org.apache.hadoop.hive.ql.udf.generic.GenericUDFUpper'; we can simulate this issue.
I've updated PR description.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Sep 4, 2018

The problem here sounds:

org.apache.hadoop.hive.ql.metadata.Table.getViewExpandedText()

is used to build the view which is ran again by SparkSQL parser. The Hive API avove returns:

SELECT `d100.udf100`(`emp`.`name`) FROM `default`.`emp`

The root cause is that the code above `d100.udf100` is recognised as a single identifier within Spark side. So, it seeks the function called d100.udf100 within the default database whereas Hive looks seeking the function udf100 under d100.

If Hive's behaviour is correct, we need a strong justification to fix Spark's behaviour, or document the differences. If not, Hive should fix this.

ctx.identifier().asScala.map(_.getText) match {
case Seq(db, fn) => FunctionIdentifier(fn, Option(db))
case Seq(fn) => FunctionIdentifier(fn, None)
case Seq(fn) => fn.split('.').toSeq match {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This, at the very least, breaks users app ... even if this is a correct behaviour within Hive, we should target this 3.0.0.

@HyukjinKwon
Copy link
Member

@vinod, see the discussion made in #18142. Shall we close this? cc @cloud-fan as well.

@vinodkc
Copy link
Contributor Author

vinodkc commented Sep 6, 2018

@HyukjinKwon , I'll close this PR

@vinodkc vinodkc closed this Sep 6, 2018
@vinodkc vinodkc deleted the br_fix_view_with_udf_issue branch May 25, 2021 07:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants