Skip to content

Conversation

@d80tb7
Copy link
Contributor

@d80tb7 d80tb7 commented Oct 14, 2019

This PR adds some extra documentation for the new Cogrouped map Pandas udfs. Specifically:

@SparkQA
Copy link

SparkQA commented Oct 14, 2019

Test build #112022 has finished for PR 26110 at commit 0ecba8a.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 14, 2019

Test build #112025 has finished for PR 26110 at commit 1802cbd.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 14, 2019

Test build #112029 has finished for PR 26110 at commit da4f00b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

retest this please

@HyukjinKwon
Copy link
Member

From a cursory look, seems fine. cc @icexelloss, @BryanCutler, @viirya


### Cogrouped Map

CoGrouped map Pandas UDFs allow two DataFrames to be cogrouped a by a common key and then a python function applied to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it CoGrouped or Cogrouped :-)?

on how to label columns when constructing a `pandas.DataFrame`.

Note that all data for a cogroup will be loaded into memory before the function is applied. This can lead to out of
memory exceptions, especially if the group sizes are skewed. The configuration for[maxRecordsPerBatch](#setting-arrow-batch-size)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typoe -> for[maxRecordsPerBatch] -> for [maxRecordsPerBatch]

@SparkQA
Copy link

SparkQA commented Oct 23, 2019

Test build #112538 has finished for PR 26110 at commit da4f00b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@BryanCutler BryanCutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this @d80tb7 , I just had some minor comments but looks good overall.


### Cogrouped Map

CoGrouped map Pandas UDFs allow two DataFrames to be cogrouped a by a common key and then a python function applied to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cogrouped a by -> cogrouped by

each cogroup. They are used with `groupBy().cogroup().apply()` which consists of the following steps:

* Shuffle the data such that the groups of each dataframe which share a key are cogrouped together.
* Apply a function to each cogroup. The input of of the function is two `pandas.DataFrame` (with an optional Tuple
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

duplicate of in input of of

* Shuffle the data such that the groups of each dataframe which share a key are cogrouped together.
* Apply a function to each cogroup. The input of of the function is two `pandas.DataFrame` (with an optional Tuple
representing the key). The output of the function is a `pandas.DataFrame`.
* Combine the results into a new `DataFrame`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe elaborate to explain results are pandas.DataFrames from all groups that are combined in a new pyspark.DataFrame

6. COGROUPED_MAP
A cogrouped map UDF defines transformation: two `pandas.DataFrame` -> `pandas.DataFrame`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think instead of "two pandas.DataFrame", better to show "(pandas.DataFrame, pandas.DataFrame) -> pandas.DataFrame"

6. COGROUPED_MAP
A cogrouped map UDF defines transformation: two `pandas.DataFrame` -> `pandas.DataFrame`
The returnType should be a :class:`StructType` describing the schema of the returned
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

returnType -> returnType

@SparkQA
Copy link

SparkQA commented Oct 30, 2019

Test build #112939 has finished for PR 26110 at commit 81713b7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 30, 2019

Test build #112944 has finished for PR 26110 at commit f7b9b80.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Merged to master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants