- 
                Notifications
    You must be signed in to change notification settings 
- Fork 28.9k
[SPARK-29126][PYSPARK][DOC] Pandas Cogroup udf usage guide #26110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…9126-cogroup-udf-usage-guide # Conflicts: # python/pyspark/sql/cogroup.py
| Test build #112022 has finished for PR 26110 at commit  
 | 
| Test build #112025 has finished for PR 26110 at commit  
 | 
| Test build #112029 has finished for PR 26110 at commit  
 | 
| retest this please | 
| From a cursory look, seems fine. cc @icexelloss, @BryanCutler, @viirya | 
|  | ||
| ### Cogrouped Map | ||
|  | ||
| CoGrouped map Pandas UDFs allow two DataFrames to be cogrouped a by a common key and then a python function applied to | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it CoGrouped or Cogrouped :-)?
| on how to label columns when constructing a `pandas.DataFrame`. | ||
|  | ||
| Note that all data for a cogroup will be loaded into memory before the function is applied. This can lead to out of | ||
| memory exceptions, especially if the group sizes are skewed. The configuration for[maxRecordsPerBatch](#setting-arrow-batch-size) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typoe -> for[maxRecordsPerBatch] -> for [maxRecordsPerBatch]
| Test build #112538 has finished for PR 26110 at commit  
 | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this @d80tb7 , I just had some minor comments but looks good overall.
|  | ||
| ### Cogrouped Map | ||
|  | ||
| CoGrouped map Pandas UDFs allow two DataFrames to be cogrouped a by a common key and then a python function applied to | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cogrouped a by -> cogrouped by
| each cogroup. They are used with `groupBy().cogroup().apply()` which consists of the following steps: | ||
|  | ||
| * Shuffle the data such that the groups of each dataframe which share a key are cogrouped together. | ||
| * Apply a function to each cogroup. The input of of the function is two `pandas.DataFrame` (with an optional Tuple | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
duplicate of in input of of
| * Shuffle the data such that the groups of each dataframe which share a key are cogrouped together. | ||
| * Apply a function to each cogroup. The input of of the function is two `pandas.DataFrame` (with an optional Tuple | ||
| representing the key). The output of the function is a `pandas.DataFrame`. | ||
| * Combine the results into a new `DataFrame`. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe elaborate to explain results are pandas.DataFrames from all groups that are combined in a new pyspark.DataFrame
        
          
                python/pyspark/sql/functions.py
              
                Outdated
          
        
      | 6. COGROUPED_MAP | ||
| A cogrouped map UDF defines transformation: two `pandas.DataFrame` -> `pandas.DataFrame` | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think instead of "two pandas.DataFrame", better to show "(pandas.DataFrame, pandas.DataFrame) -> pandas.DataFrame"
        
          
                python/pyspark/sql/functions.py
              
                Outdated
          
        
      | 6. COGROUPED_MAP | ||
| A cogrouped map UDF defines transformation: two `pandas.DataFrame` -> `pandas.DataFrame` | ||
| The returnType should be a :class:`StructType` describing the schema of the returned | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
returnType -> returnType
…9126-cogroup-udf-usage-guide
| Test build #112939 has finished for PR 26110 at commit  
 | 
| Test build #112944 has finished for PR 26110 at commit  
 | 
| Merged to master. | 
This PR adds some extra documentation for the new Cogrouped map Pandas udfs. Specifically:
COGROUPED_MAPPandas udfs added in [SPARK-27463][PYTHON] Support Dataframe Cogroup via Pandas UDFs #24981