Skip to content

Conversation

@gatorsmile
Copy link
Member

@gatorsmile gatorsmile commented May 3, 2017

What changes were proposed in this pull request?

Like Hive UDFType, we should allow users to add the extra flags for ScalaUDF and JavaUDF too. stateful/impliesOrder are not applicable to our Scala UDF. Thus, we only add the following two flags.

  • deterministic: Certain optimizations should not be applied if UDF is not deterministic. Deterministic UDF returns same result each time it is invoked with a particular input. This determinism just needs to hold within the context of a query.

When the deterministic flag is not correctly set, the results could be wrong.

For ScalaUDF in Dataset APIs, users can call the following extra APIs for UserDefinedFunction to make the corresponding changes.

  • nonDeterministic: Updates UserDefinedFunction to non-deterministic.

Also fixed the Java UDF name loss issue.

Will submit a separate PR for distinctLike for UDAF

How was this patch tested?

Added test cases for both ScalaUDF

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even our test case does not follow our assumption. We do not expect users to define non-deterministic UDF before this PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds like Scala compiler is not smart enough...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this sounds like a source code compatibility issue, can we look into it?

@SparkQA
Copy link

SparkQA commented May 3, 2017

Test build #76428 has finished for PR 17848 at commit 88fde5f.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zero323
Copy link
Member

zero323 commented May 4, 2017

Disabling optimizations aside, to what extent can we actually support nondeterministic functions? Right now a common user mistake is to run RNG inside an UDF. nonDeterministiccould suggest it is fine, but I don't think we can guarantee this without reliable cache, can we?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this breaks bin. compatibility (I know you are familiar with the point), the target of this pr is spark-v3.0.0?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hive UDFs ignores distinctLike and AFAIK there is no optimisation rules for distinctLike though, do we need this param now?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is being used by Hive optimizer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I basically think these parameters are useful for users though, do we always need to set deterministic and distinctLike when registering UDFs? ISTM this is a little annoying for users, so we'd better to use default parameters for these parameters?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After discussion with others, we decide to use the way you proposed. The related two PRs have been merged. Will make a change in this PR too. Thanks!

@gatorsmile
Copy link
Member Author

@zero323 Which caches? Could you give an example?

@gatorsmile
Copy link
Member Author

Sorry for a late update. Taking care of two kids alone is really a challenging task. Will update the PR now.

@zero323
Copy link
Member

zero323 commented May 14, 2017

My concern is that people trying non-deterministic UDFs get tripped by repeated computations at least as often as by internal optimizations, and nonDeterministic flag might send a wrong message.

In particular let's say we have this fan-out - fan-in worfklow depending on a non-deterministic x:

image

where dotted edges represent an arbitrary chain of transformations. Can we ensure that the state of each foodescendant in sink will be consistent (x hasn't been recomputed, including cases of non-fatal failures)? I hope my point here is clear.

@gatorsmile
Copy link
Member Author

@zero323

  • When x is non-deterministic, all the expressions that are derived from x (i.e., y_i, z_i, v_i) will be non-deterministic.
  • When x is first materialized and computed, that means, the generated columns are deterministic. Thus, the results will be consistent.

Not sure whether it answers your concern?

}
Assert.assertEquals(55, sum);
Assert.assertTrue("EXPLAIN outputs are expected to contain the UDF name.",
spark.sql("EXPLAIN SELECT inc(1) AS f").collectAsList().toString().contains("inc"));
Copy link
Member Author

@gatorsmile gatorsmile May 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to fix the issue of name loss for JavaUDF in the explain command.

@SparkQA
Copy link

SparkQA commented May 15, 2017

Test build #76915 has finished for PR 17848 at commit 00b4dff.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • throw new IOException(s\"UDF class $className doesn't implement any UDF interface\")
  • throw new IOException(s\"It is invalid to implement multiple UDF interfaces, UDF class $className\")
  • case n => logError(s\"UDF class with $n type arguments is not supported \")
  • logError(s\"Can not instantiate class $className, please make sure it has public non argument constructor\")
  • case e: ClassNotFoundException => logError(s\"Can not load class $className, please make sure it is on the classpath\")

@SparkQA
Copy link

SparkQA commented May 15, 2017

Test build #76918 has finished for PR 17848 at commit c496b62.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 15, 2017

Test build #76919 has finished for PR 17848 at commit 387af4b.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 15, 2017

Test build #76922 has finished for PR 17848 at commit d276b44.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

val inputTypes = Try($inputTypes).toOption
def builder(e: Seq[Expression]) = ScalaUDF(func, dataType, e, inputTypes.getOrElse(Nil), Some(name), nullable)
functionRegistry.registerFunction(name, builder)
UserDefinedFunction(func, dataType, inputTypes).withName(name).withNullability(nullable)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't directly call register(name, func, deterministic = true, distinctLike = false) here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will break the JAVA applications that call our Scala APIs with default arguments.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I probably miss your point though, I suggested code below;

  /**
   * Registers a Scala closure of 0 arguments as user-defined function (UDF).
   * @tparam RT return type of UDF.
   * @since 1.3.0
   */
  def register[RT: TypeTag](name: String, func: Function0[RT]): UserDefinedFunction = {
    register(name, func, deterministic = true, distinctLike = false)
  }

  /**
   * Registers a Scala closure of 0 arguments as user-defined function (UDF).
   * @tparam RT return type of UDF.
   * @since 2.3.0
   */
  def register[RT: TypeTag](name: String, func: Function0[RT], deterministic: Boolean, distinctLike: Boolean): UserDefinedFunction = {
    val ScalaReflection.Schema(dataType, nullable) = ScalaReflection.schemaFor[RT]
    val inputTypes = Try(Nil).toOption
    def builder(e: Seq[Expression]) = ScalaUDF(func, dataType, e, inputTypes.getOrElse(Nil), Some(name), nullable, deterministic, distinctLike)
    functionRegistry.registerFunction(name, builder)
    val udf = UserDefinedFunction(func, dataType, inputTypes).withName(name).withNullability(nullable)
    val withDeterminism = if (!deterministic) udf.nonDeterministic() else udf
    val withDistinctLike = if (distinctLike) withDeterminism.withDistinctLike() else withDeterminism
    withDistinctLike
  }

this._nameOption = Option(name)
this
val udf = copyAll()
udf._nameOption = Option(name)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @maropu

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know your intention here (you probably mean we should not update values even in var variables) though, is it okay that the code below has four times object allocation in the worst case? I'm a bit worried about this point;

val udf = UserDefinedFunction(func, dataType, inputTypes).withName(name).withNullability(nullable)
val withDeterminism = if (!deterministic) udf.nonDeterministic() else udf
val withDistinctLike = if (distinctLike) withDeterminism.withDistinctLike() else withDeterminism

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maropu We should make a copy when calling withName, instead of returning this object.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, I know. I just meant we added an interface newInstance(name, nullable, determinism) there.

@SparkQA
Copy link

SparkQA commented May 16, 2017

Test build #76986 has finished for PR 17848 at commit f738e9c.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 4, 2017

Test build #79130 has finished for PR 17848 at commit 0aa6475.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 5, 2017

Test build #79228 has finished for PR 17848 at commit 0c65322.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile gatorsmile changed the title [SPARK-20586] [SQL] Add deterministic and distinctLike to ScalaUDF and JavaUDF [WIP] [SPARK-20586] [SQL] Add deterministic and distinctLike to ScalaUDF and JavaUDF Jul 13, 2017
@SparkQA
Copy link

SparkQA commented Jul 14, 2017

Test build #79594 has finished for PR 17848 at commit eb9a7fc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member Author

cc @cloud-fan @sameeragarwal

registerUDF(name, func, deterministic = false)
}

private def registerUDF[$typeTags](name: String, func: Function$x[$types], deterministic: Boolean): UserDefinedFunction = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we make this public?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only concern is we have many public functions with different names that are doing the similar things.

*/
def nullable: Boolean = _nullable

/**
Copy link
Contributor

@rxin rxin Jul 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returns true iff the UDF is deterministic, i.e. the UDF produces the same output given the same input.

@SparkQA
Copy link

SparkQA commented Jul 19, 2017

Test build #79776 has finished for PR 17848 at commit d0a9086.

  • This patch fails to generate documentation.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 19, 2017

Test build #79779 has finished for PR 17848 at commit 43bb9a9.

  • This patch fails to generate documentation.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 20, 2017

Test build #79780 has finished for PR 17848 at commit 0ea4691.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

*/
def withNullability(nullable: Boolean): UserDefinedFunction = {
def asNonNullabe(): UserDefinedFunction = {
if (nullable == _nullable) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: if (!nullable)

* API (i.e. of type UserDefinedFunction).
* Registers a user-defined function (UDF), for a UDF that's already defined using the Dataset
* API (i.e. of type UserDefinedFunction). To change a UDF to nondeterministic, call the API
* `UserDefinedFunction.asNondeterministic()`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's also mention how to turn the UDF to be non-nullable.

Copy link
Contributor

@cloud-fan cloud-fan Jul 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a good example will be

val foo = udf(() => { "hello" })
spark.udf.register("stringConstant", foo.asNonNullable())

Although the return type of the UDF is String and is nullable, but we know that this UDF will never return null.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Will do

| * Registers a user-defined function with ${i} arguments.
| * @since 2.3.0
| */
|def register(name: String, f: UDF$i[$extTypeArgs], returnType: DataType, deterministic: Boolean): Unit = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this? I think for java UDF we can also build a UserDefiendFunction first and call asNondeterminstic or asNonNullable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW although we don't have def udf(f: UDF1[_, _]): UserDefinedFunction APIs, we do have def udf(f: AnyRef, dataType: DataType): UserDefinedFunction which can be used for java udf.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far, the impl of def udf(f: AnyRef, dataType: DataType) does not support Java UDF

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as we have to add new APIs, why not we add a bunch of def udf(f: UDF1[_, _]): UserDefinedFunction instead of a bunch of def register(name: String, f: UDF$i[$extTypeArgs], returnType: DataType, deterministic: Boolean)?

* @param deterministic True if the UDF is deterministic. Deterministic UDF returns same result
* each time it is invoked with a particular input.
*/
private[sql] def registerJava(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need a new method? it's private.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for PySpark.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uh... We can feel free to make the change on the interface. : )

Copy link
Member Author

@gatorsmile gatorsmile Jul 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, our JAVA API can directly use it. To JAVA APIs, they are not private at all.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then let's remove the private[sql] and add since tag

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW can we use default parameter? will it break java compatibility?

/**
* Defines a user-defined function (UDF) using a Scala closure. For this variant, the caller must
* specify the output data type, and there is no automatic input type coercion.
* Defines a deterministic user-defined function (UDF) using a Scala closure. For this variant,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not only scala closure, I think java UDF class is also supported here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately nope, although we accept AnyRef.

java.lang.ClassCastException: java.lang.Class cannot be cast to scala.Function1

	at org.apache.spark.sql.catalyst.expressions.ScalaUDF.<init>(ScalaUDF.scala:92)
	at org.apache.spark.sql.expressions.UserDefinedFunction.apply(UserDefinedFunction.scala:70)
	at org.apache.spark.sql.UDFRegistration.org$apache$spark$sql$UDFRegistration$$builder$2(UDFRegistration.scala:99)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

damn...

@SparkQA
Copy link

SparkQA commented Jul 20, 2017

Test build #79810 has started for PR 17848 at commit 8422c42.

@cloud-fan
Copy link
Contributor

let's leave the java UDF API unchanged and think about whether we should add java UDF API in functions later. @gatorsmile can you update the PR title? Thanks!

@cloud-fan
Copy link
Contributor

LGTM BTW.

@gatorsmile gatorsmile changed the title [SPARK-20586] [SQL] Add deterministic to ScalaUDF and JavaUDF [SPARK-20586] [SQL] Add deterministic to ScalaUDF Jul 25, 2017
@gatorsmile
Copy link
Member Author

Thanks! @cloud-fan

@SparkQA
Copy link

SparkQA commented Jul 25, 2017

Test build #79935 has finished for PR 17848 at commit 1b3aa22.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 25, 2017

Test build #79936 has finished for PR 17848 at commit bf060d6.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 25, 2017

Test build #79942 has finished for PR 17848 at commit a54010a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member Author

Thanks! Merging to master.

@asfgit asfgit closed this in ebc24a9 Jul 26, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants