[WIP][SPARK-20396][SQL][PySpark][FOLLOW-UP] groupby().apply() with pandas udf #19505

ueshin · 2017-10-16T07:08:18Z

What changes were proposed in this pull request?

This is a follow-up of #18732.
This pr introduces @pandas_grouped_udf decorator for grouped vectorized UDF instead of reusing @pandas_udf decorator.

How was this patch tested?

Exisiting tests.

cloud-fan · 2017-10-16T07:12:20Z

python/pyspark/sql/functions.py

    .. versionadded:: 1.3
    """
-    def __init__(self, func, returnType, name=None, vectorized=False):
+    def __init__(self, func, returnType, name=None, vectorized=False, grouped=False):


vectorized=False, grouped=True is an invalid combination. How about we introduce a pythonUdfType and 0 means normal udf, 1 means pandas udf, and 2 means pandas grouped udf? We can create something like object PythonEvalType to sync this encoding between python and java.

Sounds good. I'll modify it.

cloud-fan · 2017-10-16T07:14:29Z

python/pyspark/sql/functions.py

+
+
+@since(2.3)
+def pandas_grouped_udf(f=None, returnType=StructType()):


how about returnTypes without default value? pandas_grouped_udf always return a DataFrame and we should just ask users to give the data type of each column.

inside this method we can create a StructType with returnTypes and pass to _create_udf

The fields of the return type are used as the output of the plan. I guess the field names are also useful for users.

ah i see, make sense

SparkQA · 2017-10-16T07:21:54Z

Test build #82790 has finished for PR 19505 at commit 4d2bd95.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-10-16T11:44:39Z

Test build #82793 has finished for PR 19505 at commit f096870.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class PythonUdfType(object):

viirya · 2017-10-16T13:01:14Z

python/pyspark/sql/functions.py

            if len(argspec.args) == 0 and argspec.varargs is None:
                raise ValueError(
                    "0-arg pandas_udfs are not supported. "
                    "Instead, create a 1-arg pandas_udf and ignore the arg in your function."


Maybe also update this error message, like "0-arg pandas_udfs/pandas_grouped_udfs are not supported. ...

Thanks! I'll update the message.

Hmm, when pandas_grouped_udfs, the number of args should be only 1?

I think so. If it didn't become too complicated, maybe we can also check it for pandas_grouped_udf.

Thanks, let me try.

cloud-fan · 2017-10-16T14:18:09Z

python/pyspark/sql/functions.py

-        if vectorized:
+    def _udf(f, returnType=StringType(), pythonUdfType=pythonUdfType):
+        if pythonUdfType == PythonUdfType.PANDAS_UDF \
+           or pythonUdfType == PythonUdfType.PANDAS_GROUPED_UDF:


shall we add the check that PANDAS_GROUPED_UDF can only take one parameter?

Yes, I'll add it.

This reverts commit 122a7bc.

SparkQA · 2017-10-16T17:04:32Z

Test build #82803 has finished for PR 19505 at commit 10512a6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-10-16T17:24:34Z

python/pyspark/sql/functions.py

+        udf_obj = UserDefinedFunction(f, returnType, pythonUdfType=pythonUdfType)
        return udf_obj._wrapped()

    # decorator @udf, @udf(), @udf(dataType()), or similar with @pandas_udf


Nit: update this comment

Thanks! I'll update it.

SparkQA · 2017-10-16T17:31:56Z

Test build #82805 has finished for PR 19505 at commit 789e642.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-10-16T17:35:23Z

python/pyspark/sql/functions.py

                                  sc.pythonVer, broadcast_vars, sc._javaAccumulator)


+class PythonUdfType(object):


Could you also add the descriptions about these three UDF types?

NORMAL_UDF: row-based UDFs

PANDAS_UDF: scalar vectorized UDFs

PANDAS_GROUPED_UDF: grouped vectorized UDFs

Sure, I'll add the descriptions.

gatorsmile · 2017-10-16T17:42:59Z

python/pyspark/sql/functions.py

+
+    This udf is used with :meth:`pyspark.sql.DataFrame.withColumn` and
+    :meth:`pyspark.sql.DataFrame.select`.
+    The returnType should be a primitive data type, e.g., `DoubleType()`.


What happened if we do not pass a primitive data type? Do we have a test case for this?

It will fail in runtime. I'll add tests.

gatorsmile · 2017-10-16T17:44:40Z

python/pyspark/sql/functions.py

+    This udf is used with :meth:`pyspark.sql.DataFrame.withColumn` and
+    :meth:`pyspark.sql.DataFrame.select`.
+    The returnType should be a primitive data type, e.g., `DoubleType()`.
+    The length of the returned `pandas.Series` must be of the same as the input `pandas.Series`.


Is this just a fact? or an input requirement?

It's an output requirement.

Can users break this requirement? If so, what happened?

Yes, they can and it will fail.

spark/python/pyspark/sql/tests.py

Lines 3316 to 3325 in 122a7bc

def test_vectorized_udf_invalid_length(self):

from pyspark.sql.functions import pandas_udf, col

import pandas as pd

df = self.spark.range(10)

raise_exception = pandas_udf(lambda _: pd.Series(1), LongType())

with QuietTest(self.sc):

with self.assertRaisesRegexp(

Exception,

'Result vector from pandas_udf was not the required length'):

df.select(raise_exception(col('id'))).collect()

I see. Thanks!

icexelloss · 2017-10-16T18:45:25Z

python/pyspark/sql/functions.py

+
+
+@since(2.3)
+def pandas_grouped_udf(f=None, returnType=StructType()):


Per discussion here:
#18732 (comment)

Should we consider convert pandas_udf to pandas_grouped_udf implicitly in groupby apply and not introduce pandas_grouped_udf as a user facing API?

groupby apply implies the udf is a grouped udf, so there should not be ambiguity here.

I submitted another pr #19517 based on this as a comparison.
I guess it covers what you are thinking.

Thanks @ueshin , yes that's what I am thinking.

Here is a summary of the current proposal during some offline disuccsion:

I. Use only pandas_udf

The main issues with this approach as a few people comment out is that it is hard to know what the udf does without look at the implementation.
For instance, for a udf:

@pandas_udf(DoubleType()) def foo(v): ...

It's hard to tell whether this function is a reduction that returns a scalar double, or a transform function that returns a pd.Series of double.

This is less than ideal because:

The user of the udf cannot tell which functions this udf can be used with. i.e, can this be used with groupby().apply() or withColumn or groupby().agg()?

Catalyst cannot do validation at planning phase, i.e., it cannot throw exception if user passes a transformation function rather than aggregation function to groupby().agg()

II. Use different decorators. i,e, pandas_udf (or pandas_scalar_udf), pandas_grouped_udf, pandas_udaf

The idea of this approach is to use pandas_grouped_udf for all group udfs, and pandas_scalar_udf for scalar pandas udfs that gets used with "withColumn". This helps with distinguish between some scalar udf and group udfs. However, this approach doesn't help to distinguish among group udfs. For instance, the group transform and group aggregation examples above.

III. Use pandas_udf decorate and a function type enum for "one-step" vectorized udf and pandas_udaf for multi-step aggregation function

This approach uses a function type enum to describe what the udf does. Here are the proposed function types:

transform
A pd.Series(s) -> pd.Series transformation that is independent of the grouping. This is the existing scalar pandas udf.

group_transform
A pd.Series(s) -> pd.Series transformation that is dependent of the grouping. e.g.

@pandas_udf(DoubleType(), GROUP_TRANSFORM): def foo(v): return (v - v.mean()) / v.std()

group_aggregate:
A pd.Series(s) -> scalar function, e.g.

@pandas_udf(DoubleType(), GROUP_AGGREGATE): def foo(v): return v.mean()

group_map (maybe a better name):
This defines a pd.DataFrame -> pd.DataFrame transformation. This is the current groupby().apply() udf

These types also works with window functions because window functions are either (1) group_transform (rank) or (2) group_aggregate (first, last)

I am in favor of (3). What do you guys think?

Post it in another PR #19517? This discussion thread will be collapsed when Takuya made a code change.

I guess we should consider merging #19517 first because it's an improvement of the behavior by introducing PythonUdfType instead of the hack to detect the udf type by the return type at worker, without any user-facing API changes from #18732.
The proposal and discussion should be in this pr but out of any thread to avoid being collapsed.

Proposal 3 looks great! one minor question: what's the difference between transform and group_transform? Seems we don't need to care about it both in usage and implementation.

Sorry for the late reply.

@gatorsmile Sounds good. I will copy the discussion in this PR as @ueshin suggested.

@ueshin +1 to merge #19517. I think it's a good change and will make it easier for later changes.

@cloud-fan transform defines a transformation that doesn't reply on grouping semantics: for instance, this is a wrong udf definition:

@pandas_udf(DoubleType(), TRANSFORM): def foo(v): return (v - v.mean() / v.std())

because the transformation is replying some kind of "grouping semantics", otherwise v.mean() and v.std() has no meaning for arbitrary grouping.

SparkQA · 2017-10-16T19:49:28Z

Test build #82810 has finished for PR 19505 at commit 122a7bc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-10-16T20:03:21Z

Test build #82811 has finished for PR 19505 at commit fdafb35.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-10-16T21:46:56Z

Test build #82813 has finished for PR 19505 at commit 7332969.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-10-17T02:02:19Z

sql/core/src/main/scala/org/apache/spark/sql/execution/python/ExtractPythonUDFs.scala

          udf.references.subsetOf(child.outputSet)
        }
        if (validUdfs.nonEmpty) {
+          if (validUdfs.find(_.pythonUdfType == PythonUdfType.PANDAS_GROUPED_UDF).isDefined) {


nit: maybe

validUdfs.exists(_.pythonUdfType == PythonUdfType.PANDAS_GROUPED_UDF)

Thanks! I'll update it.

SparkQA · 2017-10-17T07:05:01Z

Test build #82831 has finished for PR 19505 at commit 85f250d.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-10-17T07:09:56Z

retest this please

HyukjinKwon · 2017-10-17T07:46:36Z

Change itself LGTM if we are okay to go separating this.

SparkQA · 2017-10-17T10:26:47Z

Test build #82835 has finished for PR 19505 at commit 85f250d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-10-17T18:30:35Z

Test build #82843 has finished for PR 19505 at commit 1ef25c3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-10-18T13:43:27Z

So, looks we are good to go?

ueshin · 2017-10-18T16:20:55Z

I'd mark this pr as [WIP] for now because we don't reach consensus on API changes. Thanks.

HyukjinKwon · 2017-10-19T00:13:17Z

I meant to ask if others agree with the current change as I could not see the ongoing discussion at that time.

icexelloss · 2017-10-19T18:39:58Z

Here is a summary of the current proposal during some offline disuccsion:

1. Use only `pandas_udf`

The main issues with this approach as a few people comment out is that it is hard to know what the udf does without look at the implementation.
For instance, for a udf:

@pandas_udf(DoubleType())
def foo(v):
      ...

It's hard to tell whether this function is a reduction that returns a scalar double, or a transform function that returns a pd.Series of double.

This is less than ideal because:

The user of the udf cannot tell which functions this udf can be used with. i.e, can this be used with groupby().apply() or withColumn or groupby().agg()?
Catalyst cannot do validation at planning phase, i.e., it cannot throw exception if user passes a transformation function rather than aggregation function to groupby().agg()

2. Use different decorators. i,e, `pandas_udf` (or `pandas_scalar_udf`), `pandas_grouped_udf`, `pandas_udaf`

The idea of this approach is to use pandas_grouped_udf for all group udfs, and pandas_scalar_udf for scalar pandas udfs that gets used with "withColumn". This helps with distinguish between some scalar udf and group udfs. However, this approach doesn't help to distinguish among group udfs. For instance, the group transform and group aggregation examples above.

3. Use `pandas_udf` decorate and a function type enum for "one-step" vectorized udf and `pandas_udaf` for multi-step aggregation function

This approach uses a function type enum to describe what the udf does. Here are the proposed function types:

scalar_transform

A pd.Series(s) -> pd.Series transformation that is independent of the grouping. This is the existing scalar pandas udf.

@pandas_udf(DoubleType(), SCALAR_TRANSFORM):
def plus_one(v):
      return v + 1

`scalar_transform` can be used with `withColumn`, `select` and etc:

df = df.withColumn("v", plus_one(v))

group_transform

A pd.Series(s) -> pd.Series transformation that is dependent of the grouping. e.g.

@pandas_udf(DoubleType(), GROUP_TRANSFORM):
def rank(v):
      return v.rank()

`group_transform` can be used with:

window

window = Window.partitionBy('date')

df = df.withColumn('rank', rank(df.v).over(w))

groupby

or maybe something like this in the future (Not available with the current API):

df = df.withColumn('rank', df.groupby('id').v.transform(rank))

for reference, in pandas you would write sth like this:

df = df.assign(rank=df.groupby('id').v.transform(lambda v: v.rank()))

although it's also a Series -> Series transformation, `group_transform` will also be rejected by `withColumn`, `select`, etc

# This doesn't make sense and will throw exception
df.withColumn(rank(df.v))

group_aggregate:

A pd.Series(s) -> scalar function, e.g.

@pandas_udf(DoubleType(), GROUP_AGGREGATE):
def mean(v):
      return v.mean()

can be used with:

window

window = Window.partitionBy('date')

df = df.withColumn('mean', mean(df.v).over(w))

groupby

df = df.groupby('id').agg(mean(df.v))

group_map (maybe a better name):

This defines a pd.DataFrame -> pd.DataFrame transformation. This is the current groupby().apply() udf

@pandas_udf(df.schema, GROUP_MAP):
def foo(pdf):
      pdf = pdf.assign(v1 = df.v1 - df.v1.mean())
      pdf = pdf.assign(v2 = df.v2 / df.v2.std())
      return pdf

Can be used with groupby apply:

df.groupby('date').apply(foo)

I am in favor of (3). What do you guys think?

icexelloss · 2017-10-19T18:42:49Z

@cloud-fan asked:
"
what's the difference between transform and group_transform? Seems we don't need to care about it both in usage and implementation.
"

My answer is:
transform defines a transformation that doesn't reply on grouping semantics: for instance, this is a wrong udf definition:

@pandas_udf(DoubleType(), TRANSFORM):
def foo(v):
     return (v - v.mean() / v.std())

because the transformation is replying some kind of "grouping semantics", otherwise v.mean() and v.std() has no meaning for arbitrary grouping. Although Catalyst cannot detect this error, but the people seeing this code can identify this error easier as the type is not group transform but the user-defined function is replying on grouping semantics. transform type also allows user to test such function by passing arbitrary grouping and verifying the results are the same.

Also, catalyst can throw exception for the code example below:

@pandas_udf(DoubleType(), GROUP_TRANSFORM):
def foo(v):
      return (v - v.mean()) / v.std()

# Should throw exception here, it should only take `transform` not `group_transform` type
df = df.withColumn(foo(df.v))

viirya · 2017-10-20T00:52:00Z

Btw, I think the scope of this change is more than just a follow-up. Should we create another JIRA for it?

viirya · 2017-10-20T01:00:53Z

@icexelloss The summary and the proposal 3 looks great. To prevent confusing, can you also put the usage of each function type in proposal 3? E.g., group_map is for groupby().apply(), transform is for withColumn, etc? Thanks.

HyukjinKwon · 2017-10-20T01:25:40Z

+1 for separate JIRA to clarify the proposal and +0 for 3. out of those three, too.

viirya · 2017-10-20T03:08:07Z

The group_transform udfs looks a bit weird to me. @icexelloss Can you explain the use case of it? When do we need this grouping semantics?

icexelloss · 2017-10-20T04:11:23Z

@viirya @cloud-fan I updated my original summary. I think it answers group_transform question. I also added more example to each type.

@HyukjinKwon @viirya I agree we can move this to a separate Jira and merge current PR of @ueshin. Maybe I can open another PR with just the proposal design doc? Not sure what's the best way is.

gatorsmile · 2017-10-20T04:42:21Z

@ueshin Maybe close this PR?

ueshin · 2017-10-20T05:42:15Z

Sure, I'd close this.
@icexelloss Of course you can open a separate JIRA and another PR. Thanks!

Introduce @pandas_grouped_udf decorator for grouped vectorized UDF.

4d2bd95

ueshin mentioned this pull request Oct 16, 2017

[SPARK-20396][SQL][PySpark] groupby().apply() with pandas udf #18732

Closed

cloud-fan reviewed Oct 16, 2017

View reviewed changes

Use PythonUdfType instead of vectorized and grouped.

f096870

viirya reviewed Oct 16, 2017

View reviewed changes

ueshin added 3 commits October 16, 2017 22:42

Update an error message.

639af2c

Add a test to use data type string.

10512a6

Restrict the number of arguments for grouped udf to only 1.

789e642

cloud-fan reviewed Oct 16, 2017

View reviewed changes

ueshin added 2 commits October 17, 2017 01:24

Restrict checking the number of arguments.

122a7bc

Revert "Restrict checking the number of arguments."

fdafb35

This reverts commit 122a7bc.

gatorsmile reviewed Oct 16, 2017

View reviewed changes

ueshin added 2 commits October 17, 2017 03:10

Address comments.

94d05f4

Add tests for unsupported type.

7332969

icexelloss reviewed Oct 16, 2017

View reviewed changes

HyukjinKwon reviewed Oct 17, 2017

View reviewed changes

Address a comment.

85f250d

Update descriptions.

1ef25c3

ueshin changed the title ~~[SPARK-20396][SQL][PySpark][FOLLOW-UP] groupby().apply() with pandas udf~~ [WIP][SPARK-20396][SQL][PySpark][FOLLOW-UP] groupby().apply() with pandas udf Oct 18, 2017

ueshin closed this Oct 20, 2017

icexelloss mentioned this pull request Oct 20, 2017

[SPARK-22323] Design doc for pandas_udf #19544

Closed



		@since(2.3)
		def pandas_grouped_udf(f=None, returnType=StructType()):

		sc.pythonVer, broadcast_vars, sc._javaAccumulator)


		class PythonUdfType(object):

	def test_vectorized_udf_invalid_length(self):
	from pyspark.sql.functions import pandas_udf, col
	import pandas as pd
	df = self.spark.range(10)
	raise_exception = pandas_udf(lambda _: pd.Series(1), LongType())
	with QuietTest(self.sc):
	with self.assertRaisesRegexp(
	Exception,
	'Result vector from pandas_udf was not the required length'):
	df.select(raise_exception(col('id'))).collect()

[WIP][SPARK-20396][SQL][PySpark][FOLLOW-UP] groupby().apply() with pandas udf #19505

[WIP][SPARK-20396][SQL][PySpark][FOLLOW-UP] groupby().apply() with pandas udf #19505

Uh oh!

Conversation

ueshin commented Oct 16, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

cloud-fan Oct 16, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ueshin Oct 16, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 16, 2017

Uh oh!

SparkQA commented Oct 16, 2017

Uh oh!

viirya Oct 16, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 16, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 16, 2017

Uh oh!

gatorsmile Oct 16, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ueshin Oct 16, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan Oct 16, 2017 •

edited

Loading

ueshin Oct 16, 2017 •

edited

Loading

viirya Oct 16, 2017 •

edited

Loading

gatorsmile Oct 16, 2017 •

edited

Loading

ueshin Oct 16, 2017 •

edited

Loading

icexelloss Oct 18, 2017 •

edited

Loading

I. Use only `pandas_udf`

II. Use different decorators. i,e, `pandas_udf` (or `pandas_scalar_udf`), `pandas_grouped_udf`, `pandas_udaf`

III. Use `pandas_udf` decorate and a function type enum for "one-step" vectorized udf and `pandas_udaf` for multi-step aggregation function