-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-23645][MINOR][DOCS][PYTHON] Add docs RE pandas_udf with keyword args
#20900
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@HyukjinKwon the old pr: #20798 was a disaster from a git-cleanliness perspective so i've updated here. |
|
ok to test |
|
Test build #88566 has finished for PR 20900 at commit
|
|
@mstewart141, just to be clear, the error: is from deprecated with the reproducer below: from functools import partial
from pyspark.sql.functions import pandas_udf
def test_func(a, b):
return a + b
pandas_udf(partial(test_func, b='id'), "string")I think this should work like a normal udf from functools import partial
from pyspark.sql.functions import udf
def test_func(a, b):
return a + b
normal_udf = udf(partial(test_func, b='id'), "string")
df = spark.createDataFrame([["a"]])
df.select(normal_udf("_1")).show()So, I think we should add the support for callable objects / partial functions in Pandas UDFs in Python 2 too. Would you be interested in filling JIRA(s) and proceeding? If you are busy, I am willing to do it as well. cc @ueshin, @BryanCutler and @icexelloss FYI. |
python/pyspark/sql/functions.py
Outdated
| in boolean expressions and it ends up with being executed all internally. If the functions | ||
| can fail on special rows, the workaround is to incorporate the condition into the functions. | ||
| .. note:: The user-defined functions do not take keyword arguments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add on the calling side (I meant not on function declaration side) at the end for clarification.
|
From a very quick look for the case "Try to be sneaky and don't use keywords with partial:". Seems it's due to type mismatch. This seems working fine (in Python 3): I think we can take this example out in the description. |
|
LGTM except #20900 (comment) |
|
Many (though not all, I don't think In the incremental commits of this PR as at: You can see the kind of things I was investigating to try and fix that case. Indeed my original PR was (ambitiously) titled something about enabling kw args for all pandas_udfs. This is actually very easy to do for functions on python3 (and maybe plain functions in py2 also, but I suspect that this is also rather tricky as My summary is that the change to allow kw for functions is simple (at least in py3 -- indeed my incremental commit referenced above does this), but for partial fns maybe not so much. I'm pretty confident I'm most of the way to accomplishing the former, but not that latter. However, I have no substantial knowledge of the pyspark codebase so you will likely have better luck there, should you go down that route :) TL;DR: I could work on a PR to allow keyword arguments for python3 functions (not partials, and not py2), but that is likely too narrow a goal given the broader context. One general question: how do we tend to think about the py2/3 split for api quirks/features? Must everything that is added for py3 also be functional in py2? |
ideally, is there something you have in mind that would not work in py2? (Aside from kw in your example) |
|
Test build #88573 has finished for PR 20900 at commit
|
|
I think we should generally make everything works in both Python 2 and Python 3 but I want to know if there are special chases that I am missing too if there are any. |
…ord args
## What changes were proposed in this pull request?
Add documentation about the limitations of `pandas_udf` with keyword arguments and related concepts, like `functools.partial` fn objects.
NOTE: intermediate commits on this PR show some of the steps that can be taken to fix some (but not all) of these pain points.
### Survey of problems we face today:
(Initialize) Note: python 3.6 and spark 2.4snapshot.
```
from pyspark.sql import SparkSession
import inspect, functools
from pyspark.sql.functions import pandas_udf, PandasUDFType, col, lit, udf
spark = SparkSession.builder.getOrCreate()
print(spark.version)
df = spark.range(1,6).withColumn('b', col('id') * 2)
def ok(a,b): return a+b
```
Using a keyword argument at the call site `b=...` (and yes, *full* stack trace below, haha):
```
---> 14 df.withColumn('ok', pandas_udf(f=ok, returnType='bigint')('id', b='id')).show() # no kwargs
TypeError: wrapper() got an unexpected keyword argument 'b'
```
Using partial with a keyword argument where the kw-arg is the first argument of the fn:
*(Aside: kind of interesting that lines 15,16 work great and then 17 explodes)*
```
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-9-e9f31b8799c1> in <module>()
15 df.withColumn('ok', pandas_udf(f=functools.partial(ok, 7), returnType='bigint')('id')).show()
16 df.withColumn('ok', pandas_udf(f=functools.partial(ok, b=7), returnType='bigint')('id')).show()
---> 17 df.withColumn('ok', pandas_udf(f=functools.partial(ok, a=7), returnType='bigint')('id')).show()
/Users/stu/ZZ/spark/python/pyspark/sql/functions.py in pandas_udf(f, returnType, functionType)
2378 return functools.partial(_create_udf, returnType=return_type, evalType=eval_type)
2379 else:
-> 2380 return _create_udf(f=f, returnType=return_type, evalType=eval_type)
2381
2382
/Users/stu/ZZ/spark/python/pyspark/sql/udf.py in _create_udf(f, returnType, evalType)
54 argspec.varargs is None:
55 raise ValueError(
---> 56 "Invalid function: 0-arg pandas_udfs are not supported. "
57 "Instead, create a 1-arg pandas_udf and ignore the arg in your function."
58 )
ValueError: Invalid function: 0-arg pandas_udfs are not supported. Instead, create a 1-arg pandas_udf and ignore the arg in your function.
```
Author: Michael (Stu) Stewart <[email protected]>
Closes #20900 from mstewart141/udfkw2.
(cherry picked from commit 087fb31)
Signed-off-by: hyukjinkwon <[email protected]>
Signed-off-by: hyukjinkwon <[email protected]>
|
Merged to master and branch-2.3 anyway. |
|
Thank you @mstewart141 for looking into this. @HyukjinKwon should we open Jira for supporting kw args and partial functions in python UDFs? If I understand correctly, this is related to both regular python UDFs and pandas UDFs, is that right? |
|
Partials (and callable objects) are supported in UDF but not |
|
@icexelloss, yup ^ is correct. IIRC, we have some tests for normal udfs with callable objects and partial functions separately but seems the problem is in Pandas UDF. I think the fix itself will relativrly minimal (just from my wild guess). would you be inretested in doing this? |
|
to be clear, I think both functions below class F(object):
def __call__(...):
...
func = F()def naive_func(a, b):
...
func = partial(naive_func, 1)should work with Pandas UDF but seems not working given my test #20900 (comment) |
|
The issue itself here (SPARK-23645) describes kwargs arguments support in both UDF and Pandas UDF on calling side. Seems not working but the fix looks going to be quite invasive and big. So, I suggested to fix the documentation for now. Maybe, we should revisit in the future. Let's monitor mailing list and JIRAs. #20900 (comment) with #20900 (comment) is a separate issue about partial functions and callable objects in Pandas UDF, I found during review. |
|
@HyukjinKwon Thanks for the explanation. I will create Jira for partial functions and callable objects in Pandas UDF. I am happy to take a look at it. |
|
@icexelloss as a daily user of (take as just one data point, certainly, others may have differing opinions.) |
…ord args
## What changes were proposed in this pull request?
Add documentation about the limitations of `pandas_udf` with keyword arguments and related concepts, like `functools.partial` fn objects.
NOTE: intermediate commits on this PR show some of the steps that can be taken to fix some (but not all) of these pain points.
### Survey of problems we face today:
(Initialize) Note: python 3.6 and spark 2.4snapshot.
```
from pyspark.sql import SparkSession
import inspect, functools
from pyspark.sql.functions import pandas_udf, PandasUDFType, col, lit, udf
spark = SparkSession.builder.getOrCreate()
print(spark.version)
df = spark.range(1,6).withColumn('b', col('id') * 2)
def ok(a,b): return a+b
```
Using a keyword argument at the call site `b=...` (and yes, *full* stack trace below, haha):
```
---> 14 df.withColumn('ok', pandas_udf(f=ok, returnType='bigint')('id', b='id')).show() # no kwargs
TypeError: wrapper() got an unexpected keyword argument 'b'
```
Using partial with a keyword argument where the kw-arg is the first argument of the fn:
*(Aside: kind of interesting that lines 15,16 work great and then 17 explodes)*
```
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-9-e9f31b8799c1> in <module>()
15 df.withColumn('ok', pandas_udf(f=functools.partial(ok, 7), returnType='bigint')('id')).show()
16 df.withColumn('ok', pandas_udf(f=functools.partial(ok, b=7), returnType='bigint')('id')).show()
---> 17 df.withColumn('ok', pandas_udf(f=functools.partial(ok, a=7), returnType='bigint')('id')).show()
/Users/stu/ZZ/spark/python/pyspark/sql/functions.py in pandas_udf(f, returnType, functionType)
2378 return functools.partial(_create_udf, returnType=return_type, evalType=eval_type)
2379 else:
-> 2380 return _create_udf(f=f, returnType=return_type, evalType=eval_type)
2381
2382
/Users/stu/ZZ/spark/python/pyspark/sql/udf.py in _create_udf(f, returnType, evalType)
54 argspec.varargs is None:
55 raise ValueError(
---> 56 "Invalid function: 0-arg pandas_udfs are not supported. "
57 "Instead, create a 1-arg pandas_udf and ignore the arg in your function."
58 )
ValueError: Invalid function: 0-arg pandas_udfs are not supported. Instead, create a 1-arg pandas_udf and ignore the arg in your function.
```
Author: Michael (Stu) Stewart <[email protected]>
Closes apache#20900 from mstewart141/udfkw2.
(cherry picked from commit 087fb31)
Signed-off-by: hyukjinkwon <[email protected]>
Signed-off-by: hyukjinkwon <[email protected]>
What changes were proposed in this pull request?
Add documentation about the limitations of
pandas_udfwith keyword arguments and related concepts, likefunctools.partialfn objects.NOTE: intermediate commits on this PR show some of the steps that can be taken to fix some (but not all) of these pain points.
Survey of problems we face today:
(Initialize) Note: python 3.6 and spark 2.4snapshot.
Using a keyword argument at the call site
b=...(and yes, full stack trace below, haha):Using partial with a keyword argument where the kw-arg is the first argument of the fn:
(Aside: kind of interesting that lines 15,16 work great and then 17 explodes)