[SPARK-20586] [SQL] Add deterministic to ScalaUDF #17848

gatorsmile · 2017-05-03T22:34:39Z

What changes were proposed in this pull request?

Like Hive UDFType, we should allow users to add the extra flags for ScalaUDF and JavaUDF too. stateful/impliesOrder are not applicable to our Scala UDF. Thus, we only add the following two flags.

deterministic: Certain optimizations should not be applied if UDF is not deterministic. Deterministic UDF returns same result each time it is invoked with a particular input. This determinism just needs to hold within the context of a query.

When the deterministic flag is not correctly set, the results could be wrong.

For ScalaUDF in Dataset APIs, users can call the following extra APIs for UserDefinedFunction to make the corresponding changes.

nonDeterministic: Updates UserDefinedFunction to non-deterministic.

Also fixed the Java UDF name loss issue.

Will submit a separate PR for distinctLike for UDAF

How was this patch tested?

Added test cases for both ScalaUDF

gatorsmile · 2017-05-03T22:35:43Z

sql/core/src/test/scala/org/apache/spark/sql/UDFSuite.scala

Even our test case does not follow our assumption. We do not expect users to define non-deterministic UDF before this PR.

gatorsmile · 2017-05-03T22:36:49Z

sql/core/src/test/scala/org/apache/spark/sql/SQLContextSuite.scala

It sounds like Scala compiler is not smart enough...

this sounds like a source code compatibility issue, can we look into it?

SparkQA · 2017-05-03T22:48:43Z

Test build #76428 has finished for PR 17848 at commit 88fde5f.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

zero323 · 2017-05-04T00:15:10Z

Disabling optimizations aside, to what extent can we actually support nondeterministic functions? Right now a common user mistake is to run RNG inside an UDF. nonDeterministiccould suggest it is fine, but I don't think we can guarantee this without reliable cache, can we?

maropu · 2017-05-08T13:48:02Z

sql/core/src/main/scala/org/apache/spark/sql/expressions/UserDefinedFunction.scala

Since this breaks bin. compatibility (I know you are familiar with the point), the target of this pr is spark-v3.0.0?

maropu · 2017-05-08T13:51:34Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala

Hive UDFs ignores distinctLike and AFAIK there is no optimisation rules for distinctLike though, do we need this param now?

It is being used by Hive optimizer.

maropu · 2017-05-08T13:59:22Z

sql/core/src/main/scala/org/apache/spark/sql/UDFRegistration.scala

I basically think these parameters are useful for users though, do we always need to set deterministic and distinctLike when registering UDFs? ISTM this is a little annoying for users, so we'd better to use default parameters for these parameters?

After discussion with others, we decide to use the way you proposed. The related two PRs have been merged. Will make a change in this PR too. Thanks!

gatorsmile · 2017-05-14T06:48:28Z

@zero323 Which caches? Could you give an example?

gatorsmile · 2017-05-14T06:53:21Z

Sorry for a late update. Taking care of two kids alone is really a challenging task. Will update the PR now.

zero323 · 2017-05-14T11:02:04Z

My concern is that people trying non-deterministic UDFs get tripped by repeated computations at least as often as by internal optimizations, and nonDeterministic flag might send a wrong message.

In particular let's say we have this fan-out - fan-in worfklow depending on a non-deterministic x:

where dotted edges represent an arbitrary chain of transformations. Can we ensure that the state of each foodescendant in sink will be consistent (x hasn't been recomputed, including cases of non-fatal failures)? I hope my point here is clear.

gatorsmile · 2017-05-14T23:21:06Z

@zero323

When x is non-deterministic, all the expressions that are derived from x (i.e., y_i, z_i, v_i) will be non-deterministic.
When x is first materialized and computed, that means, the generated columns are deterministic. Thus, the results will be consistent.

Not sure whether it answers your concern?

gatorsmile · 2017-05-15T00:24:32Z

sql/core/src/test/java/test/org/apache/spark/sql/JavaUDFSuite.java

    }
    Assert.assertEquals(55, sum);
+    Assert.assertTrue("EXPLAIN outputs are expected to contain the UDF name.",
+        spark.sql("EXPLAIN SELECT inc(1) AS f").collectAsList().toString().contains("inc"));


This is to fix the issue of name loss for JavaUDF in the explain command.

SparkQA · 2017-05-15T00:25:19Z

Test build #76915 has finished for PR 17848 at commit 00b4dff.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
throw new IOException(s\"UDF class $className doesn't implement any UDF interface\")
throw new IOException(s\"It is invalid to implement multiple UDF interfaces, UDF class $className\")
case n => logError(s\"UDF class with $n type arguments is not supported \")
logError(s\"Can not instantiate class $className, please make sure it has public non argument constructor\")
case e: ClassNotFoundException => logError(s\"Can not load class $className, please make sure it is on the classpath\")

SparkQA · 2017-05-15T02:11:41Z

Test build #76918 has finished for PR 17848 at commit c496b62.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-05-15T02:17:01Z

Test build #76919 has finished for PR 17848 at commit 387af4b.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-05-15T04:17:10Z

Test build #76922 has finished for PR 17848 at commit d276b44.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2017-05-16T21:03:01Z

sql/core/src/main/scala/org/apache/spark/sql/UDFRegistration.scala

          val inputTypes = Try($inputTypes).toOption
          def builder(e: Seq[Expression]) = ScalaUDF(func, dataType, e, inputTypes.getOrElse(Nil), Some(name), nullable)
          functionRegistry.registerFunction(name, builder)
          UserDefinedFunction(func, dataType, inputTypes).withName(name).withNullability(nullable)


We can't directly call register(name, func, deterministic = true, distinctLike = false) here?

It will break the JAVA applications that call our Scala APIs with default arguments.

I probably miss your point though, I suggested code below;

/** * Registers a Scala closure of 0 arguments as user-defined function (UDF). * @tparam RT return type of UDF. * @since 1.3.0 */ def register[RT: TypeTag](name: String, func: Function0[RT]): UserDefinedFunction = { register(name, func, deterministic = true, distinctLike = false) } /** * Registers a Scala closure of 0 arguments as user-defined function (UDF). * @tparam RT return type of UDF. * @since 2.3.0 */ def register[RT: TypeTag](name: String, func: Function0[RT], deterministic: Boolean, distinctLike: Boolean): UserDefinedFunction = { val ScalaReflection.Schema(dataType, nullable) = ScalaReflection.schemaFor[RT] val inputTypes = Try(Nil).toOption def builder(e: Seq[Expression]) = ScalaUDF(func, dataType, e, inputTypes.getOrElse(Nil), Some(name), nullable, deterministic, distinctLike) functionRegistry.registerFunction(name, builder) val udf = UserDefinedFunction(func, dataType, inputTypes).withName(name).withNullability(nullable) val withDeterminism = if (!deterministic) udf.nonDeterministic() else udf val withDistinctLike = if (distinctLike) withDeterminism.withDistinctLike() else withDeterminism withDistinctLike }

gatorsmile · 2017-05-16T23:11:51Z

sql/core/src/main/scala/org/apache/spark/sql/expressions/UserDefinedFunction.scala

-    this._nameOption = Option(name)
-    this
+    val udf = copyAll()
+    udf._nameOption = Option(name)


I know your intention here (you probably mean we should not update values even in var variables) though, is it okay that the code below has four times object allocation in the worst case? I'm a bit worried about this point;

val udf = UserDefinedFunction(func, dataType, inputTypes).withName(name).withNullability(nullable) val withDeterminism = if (!deterministic) udf.nonDeterministic() else udf val withDistinctLike = if (distinctLike) withDeterminism.withDistinctLike() else withDeterminism

@maropu We should make a copy when calling withName, instead of returning this object.

yea, I know. I just meant we added an interface newInstance(name, nullable, determinism) there.

SparkQA · 2017-05-16T23:15:56Z

Test build #76986 has finished for PR 17848 at commit f738e9c.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-07-04T05:45:28Z

Test build #79130 has finished for PR 17848 at commit 0aa6475.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-07-05T16:43:30Z

Test build #79228 has finished for PR 17848 at commit 0c65322.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-07-14T00:57:18Z

Test build #79594 has finished for PR 17848 at commit eb9a7fc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-07-14T01:07:55Z

cc @cloud-fan @sameeragarwal

cloud-fan · 2017-07-19T02:39:44Z

sql/core/src/main/scala/org/apache/spark/sql/UDFRegistration.scala

+          registerUDF(name, func, deterministic = false)
+        }
+
+        private def registerUDF[$typeTags](name: String, func: Function$x[$types], deterministic: Boolean): UserDefinedFunction = {


can we make this public?

My only concern is we have many public functions with different names that are doing the similar things.

rxin · 2017-07-19T06:09:22Z

sql/core/src/main/scala/org/apache/spark/sql/expressions/UserDefinedFunction.scala

   */
  def nullable: Boolean = _nullable

+  /**


Returns true iff the UDF is deterministic, i.e. the UDF produces the same output given the same input.

SparkQA · 2017-07-19T21:39:52Z

Test build #79776 has finished for PR 17848 at commit d0a9086.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-07-19T22:07:08Z

Test build #79779 has finished for PR 17848 at commit 43bb9a9.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-07-20T00:41:19Z

Test build #79780 has finished for PR 17848 at commit 0ea4691.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-07-20T07:11:16Z

sql/core/src/main/scala/org/apache/spark/sql/expressions/UserDefinedFunction.scala

   */
-  def withNullability(nullable: Boolean): UserDefinedFunction = {
+  def asNonNullabe(): UserDefinedFunction = {
    if (nullable == _nullable) {


nit: if (!nullable)

cloud-fan · 2017-07-20T07:11:51Z

sql/core/src/main/scala/org/apache/spark/sql/UDFRegistration.scala

-   * API (i.e. of type UserDefinedFunction).
+   * Registers a user-defined function (UDF), for a UDF that's already defined using the Dataset
+   * API (i.e. of type UserDefinedFunction). To change a UDF to nondeterministic, call the API
+   * `UserDefinedFunction.asNondeterministic()`.


let's also mention how to turn the UDF to be non-nullable.

a good example will be

val foo = udf(() => { "hello" }) spark.udf.register("stringConstant", foo.asNonNullable())

Although the return type of the UDF is String and is nullable, but we know that this UDF will never return null.

Sure. Will do

cloud-fan · 2017-07-20T07:37:27Z

sql/core/src/main/scala/org/apache/spark/sql/UDFRegistration.scala

+         | * Registers a user-defined function with ${i} arguments.
+         | * @since 2.3.0
+         | */
+         |def register(name: String, f: UDF$i[$extTypeArgs], returnType: DataType, deterministic: Boolean): Unit = {


do we need this? I think for java UDF we can also build a UserDefiendFunction first and call asNondeterminstic or asNonNullable.

BTW although we don't have def udf(f: UDF1[_, _]): UserDefinedFunction APIs, we do have def udf(f: AnyRef, dataType: DataType): UserDefinedFunction which can be used for java udf.

So far, the impl of def udf(f: AnyRef, dataType: DataType) does not support Java UDF

as we have to add new APIs, why not we add a bunch of def udf(f: UDF1[_, _]): UserDefinedFunction instead of a bunch of def register(name: String, f: UDF$i[$extTypeArgs], returnType: DataType, deterministic: Boolean)?

cloud-fan · 2017-07-20T07:40:21Z

sql/core/src/main/scala/org/apache/spark/sql/UDFRegistration.scala

+   * @param deterministic  True if the UDF is deterministic. Deterministic UDF returns same result
+   *                       each time it is invoked with a particular input.
+   */
+  private[sql] def registerJava(


do we need a new method? it's private.

This is for PySpark.

uh... We can feel free to make the change on the interface. : )

Actually, our JAVA API can directly use it. To JAVA APIs, they are not private at all.

then let's remove the private[sql] and add since tag

BTW can we use default parameter? will it break java compatibility?

cloud-fan · 2017-07-20T07:42:27Z

sql/core/src/main/scala/org/apache/spark/sql/functions.scala

  /**
-   * Defines a user-defined function (UDF) using a Scala closure. For this variant, the caller must
-   * specify the output data type, and there is no automatic input type coercion.
+   * Defines a deterministic user-defined function (UDF) using a Scala closure. For this variant,


not only scala closure, I think java UDF class is also supported here.

Unfortunately nope, although we accept AnyRef.

java.lang.ClassCastException: java.lang.Class cannot be cast to scala.Function1 at org.apache.spark.sql.catalyst.expressions.ScalaUDF.<init>(ScalaUDF.scala:92) at org.apache.spark.sql.expressions.UserDefinedFunction.apply(UserDefinedFunction.scala:70) at org.apache.spark.sql.UDFRegistration.org$apache$spark$sql$UDFRegistration$$builder$2(UDFRegistration.scala:99)

SparkQA · 2017-07-20T22:07:31Z

Test build #79810 has started for PR 17848 at commit 8422c42.

cloud-fan · 2017-07-25T14:50:07Z

let's leave the java UDF API unchanged and think about whether we should add java UDF API in functions later. @gatorsmile can you update the PR title? Thanks!

cloud-fan · 2017-07-25T14:51:05Z

LGTM BTW.

gatorsmile · 2017-07-25T15:59:26Z

Thanks! @cloud-fan

SparkQA · 2017-07-25T16:21:39Z

Test build #79935 has finished for PR 17848 at commit 1b3aa22.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-07-25T16:37:00Z

Test build #79936 has finished for PR 17848 at commit bf060d6.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

…dfRegister3

SparkQA · 2017-07-25T22:43:52Z

Test build #79942 has finished for PR 17848 at commit a54010a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-07-26T00:20:04Z

Thanks! Merging to master.

gatorsmile commented May 3, 2017

View reviewed changes

gatorsmile mentioned this pull request May 3, 2017

[SPARK-20416][SQL] Print UDF names in EXPLAIN #17712

Closed

zero323 mentioned this pull request May 4, 2017

[SPARK-18777][PYTHON][SQL] Return UDF from udf.register #17831

Closed

maropu reviewed May 8, 2017

View reviewed changes

gatorsmile force-pushed the udfRegister branch from 88fde5f to 00b4dff Compare May 14, 2017 22:36

gatorsmile commented May 15, 2017

View reviewed changes

maropu reviewed May 16, 2017

View reviewed changes

gatorsmile commented May 16, 2017

View reviewed changes

gatorsmile added 2 commits July 13, 2017 15:08

fix.

deb82f6

fix.

eb9a7fc

gatorsmile force-pushed the udfRegister branch from 0c65322 to eb9a7fc Compare July 13, 2017 22:12

gatorsmile changed the title ~~[SPARK-20586] [SQL] Add deterministic and distinctLike to ScalaUDF and JavaUDF [WIP]~~ [SPARK-20586] [SQL] Add deterministic and distinctLike to ScalaUDF and JavaUDF Jul 13, 2017

cloud-fan reviewed Jul 19, 2017

View reviewed changes

rxin reviewed Jul 19, 2017

View reviewed changes

fix.

d0a9086

gatorsmile added 2 commits July 19, 2017 14:50

fix.

43bb9a9

fix.

0ea4691

cloud-fan reviewed Jul 20, 2017

View reviewed changes

fix.

8422c42

cloud-fan added 2 commits July 25, 2017 22:35

revert java UDF changes (#2)

1b3aa22

fix java udf name (#3)

bf060d6

gatorsmile changed the title ~~[SPARK-20586] [SQL] Add deterministic to ScalaUDF and JavaUDF~~ [SPARK-20586] [SQL] Add deterministic to ScalaUDF Jul 25, 2017

gatorsmile added 2 commits July 25, 2017 13:09

Merge remote-tracking branch 'refs/remotes/origin/udfRegister' into u…

f6b4cea

…dfRegister3

fix test case.

a54010a

asfgit closed this in ebc24a9 Jul 26, 2017

[SPARK-20586] [SQL] Add deterministic to ScalaUDF #17848

[SPARK-20586] [SQL] Add deterministic to ScalaUDF #17848

Uh oh!

Conversation

gatorsmile commented May 3, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 3, 2017

Uh oh!

zero323 commented May 4, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile commented May 14, 2017

Uh oh!

gatorsmile commented May 14, 2017

Uh oh!

zero323 commented May 14, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gatorsmile commented May 14, 2017

Uh oh!

gatorsmile May 15, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 15, 2017

Uh oh!

SparkQA commented May 15, 2017

Uh oh!

SparkQA commented May 15, 2017

Uh oh!

SparkQA commented May 15, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 16, 2017

Uh oh!

SparkQA commented Jul 4, 2017

Uh oh!

SparkQA commented Jul 5, 2017

Uh oh!

SparkQA commented Jul 14, 2017

Uh oh!

gatorsmile commented Jul 14, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rxin Jul 19, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

gatorsmile commented May 3, 2017 •

edited

Loading

zero323 commented May 14, 2017 •

edited

Loading

gatorsmile May 15, 2017 •

edited

Loading

rxin Jul 19, 2017 •

edited

Loading

cloud-fan Jul 20, 2017 •

edited

Loading

gatorsmile Jul 20, 2017 •

edited

Loading