[SPARK-29126][PYSPARK][DOC] Pandas Cogroup udf usage guide #26110

d80tb7 · 2019-10-14T12:01:39Z

This PR adds some extra documentation for the new Cogrouped map Pandas udfs. Specifically:

Updated the usage guide for the new COGROUPED_MAP Pandas udfs added in [SPARK-27463][PYTHON] Support Dataframe Cogroup via Pandas UDFs #24981
Updated the docstring for pandas_udf to include the COGROUPED_MAP type as suggested by @HyukjinKwon in [SPARK-27463][PYTHON][FOLLOW-UP] Miscellaneous documentation and code cleanup of cogroup pandas UDF #25939

…9126-cogroup-udf-usage-guide # Conflicts: # python/pyspark/sql/cogroup.py

SparkQA · 2019-10-14T12:05:03Z

Test build #112022 has finished for PR 26110 at commit 0ecba8a.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-10-14T12:39:33Z

Test build #112025 has finished for PR 26110 at commit 1802cbd.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-10-14T13:45:59Z

Test build #112029 has finished for PR 26110 at commit da4f00b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-10-23T11:00:05Z

retest this please

HyukjinKwon · 2019-10-23T11:00:23Z

From a cursory look, seems fine. cc @icexelloss, @BryanCutler, @viirya

HyukjinKwon · 2019-10-23T11:00:47Z

docs/sql-pyspark-pandas-with-arrow.md


+### Cogrouped Map
+
+CoGrouped map Pandas UDFs allow two DataFrames to be cogrouped a by a common key and then a python function applied to


Is it CoGrouped or Cogrouped :-)?

HyukjinKwon · 2019-10-23T11:02:02Z

docs/sql-pyspark-pandas-with-arrow.md

+on how to label columns when constructing a `pandas.DataFrame`.
+
+Note that all data for a cogroup will be loaded into memory before the function is applied. This can lead to out of
+memory exceptions, especially if the group sizes are skewed. The configuration for[maxRecordsPerBatch](#setting-arrow-batch-size)


typoe -> for[maxRecordsPerBatch] -> for [maxRecordsPerBatch]

SparkQA · 2019-10-23T11:50:56Z

Test build #112538 has finished for PR 26110 at commit da4f00b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler

Thanks for doing this @d80tb7 , I just had some minor comments but looks good overall.

BryanCutler · 2019-10-23T22:08:39Z

docs/sql-pyspark-pandas-with-arrow.md


+### Cogrouped Map
+
+CoGrouped map Pandas UDFs allow two DataFrames to be cogrouped a by a common key and then a python function applied to


cogrouped a by -> cogrouped by

BryanCutler · 2019-10-23T22:09:43Z

docs/sql-pyspark-pandas-with-arrow.md

+each cogroup.  They are used with `groupBy().cogroup().apply()` which consists of the following steps:
+
+* Shuffle the data such that the groups of each dataframe which share a key are cogrouped together.
+* Apply a function to each cogroup.  The input of of the function is two `pandas.DataFrame` (with an optional Tuple


duplicate of in input of of

BryanCutler · 2019-10-23T22:11:40Z

docs/sql-pyspark-pandas-with-arrow.md

+* Shuffle the data such that the groups of each dataframe which share a key are cogrouped together.
+* Apply a function to each cogroup.  The input of of the function is two `pandas.DataFrame` (with an optional Tuple
+representing the key).  The output of the function is a `pandas.DataFrame`.
+* Combine the results into a new `DataFrame`.


Maybe elaborate to explain results are pandas.DataFrames from all groups that are combined in a new pyspark.DataFrame

BryanCutler · 2019-10-23T22:19:25Z

python/pyspark/sql/functions.py


+    6. COGROUPED_MAP
+
+       A cogrouped map UDF defines transformation: two `pandas.DataFrame` -> `pandas.DataFrame`


I think instead of "two pandas.DataFrame", better to show "(pandas.DataFrame, pandas.DataFrame) -> pandas.DataFrame"

BryanCutler · 2019-10-23T22:22:14Z

python/pyspark/sql/functions.py

+    6. COGROUPED_MAP
+
+       A cogrouped map UDF defines transformation: two `pandas.DataFrame` -> `pandas.DataFrame`
+       The returnType should be a :class:`StructType` describing the schema of the returned


returnType -> returnType

…9126-cogroup-udf-usage-guide

SparkQA · 2019-10-30T17:24:39Z

Test build #112939 has finished for PR 26110 at commit 81713b7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-10-30T17:43:40Z

Test build #112944 has finished for PR 26110 at commit f7b9b80.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-10-31T01:41:26Z

Merged to master.

d80tb7 added 2 commits September 29, 2019 21:27

updated user guide ofr cogroup pandas udf

695e0ec

Merge branch 'master' of https://github.com/apache/spark into SPARK-2…

0ecba8a

…9126-cogroup-udf-usage-guide # Conflicts: # python/pyspark/sql/cogroup.py

formatting fixes

1802cbd

stop sphinx thinking we need to substitute

da4f00b

dongjoon-hyun added the DOCUMENTATION label Oct 14, 2019

HyukjinKwon reviewed Oct 23, 2019

View reviewed changes

BryanCutler reviewed Oct 23, 2019

View reviewed changes

d80tb7 added 2 commits October 27, 2019 08:38

Merge branch 'master' of https://github.com/apache/spark into SPARK-2…

81713b7

…9126-cogroup-udf-usage-guide

code review comments

f7b9b80

HyukjinKwon closed this in c294943 Oct 31, 2019


		### Cogrouped Map

		CoGrouped map Pandas UDFs allow two DataFrames to be cogrouped a by a common key and then a python function applied to


		6. COGROUPED_MAP

		A cogrouped map UDF defines transformation: two `pandas.DataFrame` -> `pandas.DataFrame`

[SPARK-29126][PYSPARK][DOC] Pandas Cogroup udf usage guide #26110

[SPARK-29126][PYSPARK][DOC] Pandas Cogroup udf usage guide #26110

Uh oh!

Conversation

d80tb7 commented Oct 14, 2019

Uh oh!

SparkQA commented Oct 14, 2019

Uh oh!

SparkQA commented Oct 14, 2019

Uh oh!

SparkQA commented Oct 14, 2019

Uh oh!

HyukjinKwon commented Oct 23, 2019

Uh oh!

HyukjinKwon commented Oct 23, 2019

Uh oh!

HyukjinKwon Oct 23, 2019

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Oct 23, 2019

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 23, 2019

Uh oh!

BryanCutler left a comment

Choose a reason for hiding this comment

Uh oh!

BryanCutler Oct 23, 2019

Choose a reason for hiding this comment

Uh oh!

BryanCutler Oct 23, 2019

Choose a reason for hiding this comment

Uh oh!

BryanCutler Oct 23, 2019

Choose a reason for hiding this comment

Uh oh!

BryanCutler Oct 23, 2019

Choose a reason for hiding this comment

Uh oh!

BryanCutler Oct 23, 2019

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 30, 2019

Uh oh!

SparkQA commented Oct 30, 2019

Uh oh!

HyukjinKwon commented Oct 31, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants