@@ -3236,6 +3236,58 @@ def pandas_udf(f=None, returnType=None, functionType=None):
32363236 | 1| 21|
32373237 +---+---+
32383238
3239+ 6. COGROUPED_MAP
3240+
3241+ A cogrouped map UDF defines transformation: (`pandas.DataFrame`, `pandas.DataFrame`) ->
3242+ `pandas.DataFrame`. The `returnType` should be a :class:`StructType` describing the schema
3243+ of the returned `pandas.DataFrame`. The column labels of the returned `pandas.DataFrame`
3244+ must either match the field names in the defined `returnType` schema if specified as strings,
3245+ or match the field data types by position if not strings, e.g. integer indices. The length
3246+ of the returned `pandas.DataFrame` can be arbitrary.
3247+
3248+ CoGrouped map UDFs are used with :meth:`pyspark.sql.CoGroupedData.apply`.
3249+
3250+ >>> from pyspark.sql.functions import pandas_udf, PandasUDFType
3251+ >>> df1 = spark.createDataFrame(
3252+ ... [(20000101, 1, 1.0), (20000101, 2, 2.0), (20000102, 1, 3.0), (20000102, 2, 4.0)],
3253+ ... ("time", "id", "v1"))
3254+ >>> df2 = spark.createDataFrame(
3255+ ... [(20000101, 1, "x"), (20000101, 2, "y")],
3256+ ... ("time", "id", "v2"))
3257+ >>> @pandas_udf("time int, id int, v1 double, v2 string",
3258+ ... PandasUDFType.COGROUPED_MAP) # doctest: +SKIP
3259+ ... def asof_join(l, r):
3260+ ... return pd.merge_asof(l, r, on="time", by="id")
3261+ >>> df1.groupby("id").cogroup(df2.groupby("id")).apply(asof_join).show() # doctest: +SKIP
3262+ +---------+---+---+---+
3263+ | time| id| v1| v2|
3264+ +---------+---+---+---+
3265+ | 20000101| 1|1.0| x|
3266+ | 20000102| 1|3.0| x|
3267+ | 20000101| 2|2.0| y|
3268+ | 20000102| 2|4.0| y|
3269+ +---------+---+---+---+
3270+
3271+ Alternatively, the user can define a function that takes three arguments. In this case,
3272+ the grouping key(s) will be passed as the first argument and the data will be passed as the
3273+ second and third arguments. The grouping key(s) will be passed as a tuple of numpy data
3274+ types, e.g., `numpy.int32` and `numpy.float64`. The data will still be passed in as two
3275+ `pandas.DataFrame` containing all columns from the original Spark DataFrames.
3276+ >>> @pandas_udf("time int, id int, v1 double, v2 string",
3277+ ... PandasUDFType.COGROUPED_MAP) # doctest: +SKIP
3278+ ... def asof_join(k, l, r):
3279+ ... if k == (1,):
3280+ ... return pd.merge_asof(l, r, on="time", by="id")
3281+ ... else:
3282+ ... return pd.DataFrame(columns=['time', 'id', 'v1', 'v2'])
3283+ >>> df1.groupby("id").cogroup(df2.groupby("id")).apply(asof_join).show() # doctest: +SKIP
3284+ +---------+---+---+---+
3285+ | time| id| v1| v2|
3286+ +---------+---+---+---+
3287+ | 20000101| 1|1.0| x|
3288+ | 20000102| 1|3.0| x|
3289+ +---------+---+---+---+
3290+
32393291 .. note:: The user-defined functions are considered deterministic by default. Due to
32403292 optimization, duplicate invocations may be eliminated or the function may even be invoked
32413293 more times than it is present in the query. If your function is not deterministic, call
0 commit comments