Skip to content

Conversation

@vectorijk
Copy link
Contributor

What changes were proposed in this pull request?

Guide for

  • UDFs with dapply, dapplyCollect
  • spark.lapply for running parallel R functions

How was this patch tested?

build locally
screen shot 2016-06-14 at 03 12 56

@vectorijk
Copy link
Contributor Author

cc @jkbradley @shivaram

@SparkQA
Copy link

SparkQA commented Jun 14, 2016

Test build #60485 has finished for PR 13660 at commit 2611549.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

docs/sparkr.md Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be good to add an introduction here that there are two kinds of user-defined functions we support in SparkR. Something like

In SparkR we support two kinds for user-defined functions
1. Run a given function on a large dataset using dapply. 
2. Run many functions in parallel using spark.lapply. 

@shivaram
Copy link
Contributor

Thanks @vectorijk - I left some comments inline.

cc @felixcheung

docs/sparkr.md Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps explain why the schema needs to be passed here?

@jkbradley
Copy link
Member

Ping @vectorijk

@vectorijk vectorijk force-pushed the spark-15672-R-guide-update branch from 063bc8e to 920c975 Compare June 19, 2016 00:45
@vectorijk
Copy link
Contributor Author

@jkbradley @shivaram @felixcheung addressed comments.

@SparkQA
Copy link

SparkQA commented Jun 19, 2016

Test build #60788 has finished for PR 13660 at commit 920c975.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 19, 2016

Test build #60787 has finished for PR 13660 at commit 063bc8e.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@vectorijk
Copy link
Contributor Author

Jenkins test this again.

@felixcheung
Copy link
Member

great! please see pending PR #13752 on removing sc parameter from spark.lapply

@NarineK
Copy link
Contributor

NarineK commented Jun 20, 2016

Hi @vectorijk , @felixcheung , @sun-rui , @shivaram
As I was looking at the dapply's documentation generated in R I've noticed that there is some duplicated information. I'm not sure if this is the right place to ask about it, but I thought you might have seen it.
In I help I see the following:

Arguments

x   
A SparkDataFrame
func    
A function to be applied to each partition of the SparkDataFrame. func should have only one parameter, to which a data.frame corresponds to each partition will be passed. The output of func should be a data.frame.
schema  
The schema of the resulting SparkDataFrame after the function is applied. It must match the output of func.
x   
A SparkDataFrame
func    
A function to be applied to each partition of the SparkDataFrame. func should have only one parameter, to which a data.frame corresponds to each partition will be passed. The output of func should be a data.frame.
See Also

Other SparkDataFrame functions: SparkDataFrame-class, [[, agg, arrange, as.data.frame, attach, cache, collect, colnames, coltypes, columns, count, createOrReplaceTempView, describe, dim, distinct, dropDuplicates, dropna, drop, dtypes, except, explain, filter, first, gapplyCollect, gapply, group_by, head, histogram, insertInto, intersect, isLocal, join, limit, merge, mutate, ncol, persist, printSchema, rename, repartition, sample, saveAsTable, selectExpr, select, showDF, show, str, take, unionAll, unpersist, withColumn, with, write.df, write.jdbc, write.json, write.parquet, write.text

Other SparkDataFrame functions: SparkDataFrame-class, [[, agg, arrange, as.data.frame, attach, cache, collect, colnames, coltypes, columns, count, createOrReplaceTempView, describe, dim, distinct, dropDuplicates, dropna, drop, dtypes, except, explain, filter, first, gapplyCollect, gapply, group_by, head, histogram, insertInto, intersect, isLocal, join, limit, merge, mutate, ncol, persist, printSchema, rename, repartition, sample, saveAsTable, selectExpr, select, showDF, show, str, take, unionAll, unpersist, withColumn, with, write.df, write.jdbc, write.json, write.parquet, write.text

Is this on purpose ?

@felixcheung
Copy link
Member

@NarineK That is sort of unrelated to this PR since this PR is about the programming guide?

But in short, this happens because in the R code both dapply and dapplyCollect has the @rdname tag to "dapply". I'm not sure if we need to do that. But the first copy of "x ..." and "func ..." is from "dapply" and the second is from "dapplyCollect".

@shivaram
Copy link
Contributor

Yeah we can remove the duplication by having separate rd files or by just removing documentation for the overlapping arguments (I think in this case x and func are the same for dapply and dapplyCollect).

@NarineK feel free to open a separate JIRA/PR for this

@SparkQA
Copy link

SparkQA commented Jun 20, 2016

Test build #60863 has finished for PR 13660 at commit 3f2aea9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

docs/sparkr.md Outdated
</div>

##### dapplyCollect
Like `dapply`, apply a function to each partition of `SparkDataFrame` and collect the result back.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think its good to say a couple of things here. First that we don't require any schema to be passed in to dapplyCollect (unlike dapply). The other thing is that its good to remind users that this should be used only if the output of the UDF run on all the partitions can fit in driver memory.

@SparkQA
Copy link

SparkQA commented Jun 21, 2016

Test build #60918 has finished for PR 13660 at commit ae26233.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sun-rui
Copy link
Contributor

sun-rui commented Jun 21, 2016

Can you add documentation for gapply() and gapplyCollect() together here? or @NarineK will do in another PR?

docs/sparkr.md Outdated
</div>

### Applying User-defined Function
In SparkR, we support several kinds for User-defined Functions:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

several kinds of?

docs/sparkr.md Outdated
</div>

##### dapplyCollect
Like `dapply`, apply a function to each partition of `SparkDataFrame` and collect the result back. The output of function
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apply a function to each partition of a SparkDataFrame

@SparkQA
Copy link

SparkQA commented Jun 21, 2016

Test build #60957 has finished for PR 13660 at commit 8d4f163.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@shivaram
Copy link
Contributor

@felixcheung @jkbradley any more comments on this ?

@jkbradley
Copy link
Member

LGTM. I'll merge this with master and branch-2.0
Thanks!

@asfgit asfgit closed this in 43b04b7 Jun 22, 2016
asfgit pushed a commit that referenced this pull request Jun 22, 2016
## What changes were proposed in this pull request?
Guide for
- UDFs with dapply, dapplyCollect
- spark.lapply for running parallel R functions

## How was this patch tested?
build locally
<img width="654" alt="screen shot 2016-06-14 at 03 12 56" src="https://cloud.githubusercontent.com/assets/3419881/16039344/12a3b6a0-31de-11e6-8d77-fe23308075c0.png">

Author: Kai Jiang <[email protected]>

Closes #13660 from vectorijk/spark-15672-R-guide-update.

(cherry picked from commit 43b04b7)
Signed-off-by: Joseph K. Bradley <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants