-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-16053][R] Add spark_partition_id in SparkR
#13768
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
spark_partition_id in SparkRspark_partition_id in SparkR
|
Test build #60803 has finished for PR 13768 at commit
|
|
Test build #60806 has finished for PR 13768 at commit
|
|
Hi, @davies . |
| #' @rdname spark_partition_id | ||
| #' @export | ||
| setGeneric("spark_partition_id", function(x) { standardGeneric("spark_partition_id") }) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this go to L1080? this should be sorted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean before sd?
Currently, it is sorted already, isn't it?
soundex -> spark_partition_id -> stddev?
|
LGTM |
|
Thank you for review, @davies ! |
R/pkg/R/functions.R
Outdated
| column(jc) | ||
| }) | ||
|
|
||
| #' spark_partition_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nit: The convention we are using in SparkR is to have a descriptive title for the function. So in this case it would be something like Return the partition ID as a column. (There might be other places which need to fixed to match this convention as well -- We discussed this in #13394
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see. I'll fix them.
R/pkg/R/functions.R
Outdated
|
|
||
| #' Return the partition ID as a column | ||
| #' | ||
| #' Return the column for partition ID of the Spark task. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple of minor nits:
- To be consistent with the title this line can be
Return the partition ID of the Spark task a SparkDataFrame column - I think
nondeterministicis more suitable thanindeterministic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. That seems better.
|
Test build #60861 has finished for PR 13768 at commit
|
|
Test build #60867 has finished for PR 13768 at commit
|
|
Thanks for the updates. LGTM. Merging this to master, branch-2.0 |
## What changes were proposed in this pull request?
This PR adds `spark_partition_id` virtual column function in SparkR for API parity.
The following is just an example to illustrate a SparkR usage on a partitioned parquet table created by `spark.range(10).write.mode("overwrite").parquet("/tmp/t1")`.
```r
> collect(select(read.parquet('/tmp/t1'), c('id', spark_partition_id())))
id SPARK_PARTITION_ID()
1 3 0
2 4 0
3 8 1
4 9 1
5 0 2
6 1 3
7 2 4
8 5 5
9 6 6
10 7 7
```
## How was this patch tested?
Pass the Jenkins tests (including new testcase).
Author: Dongjoon Hyun <[email protected]>
Closes #13768 from dongjoon-hyun/SPARK-16053.
(cherry picked from commit b0f2fb5)
Signed-off-by: Shivaram Venkataraman <[email protected]>
|
Thank you for merging, @shivaram ! |
What changes were proposed in this pull request?
This PR adds
spark_partition_idvirtual column function in SparkR for API parity.The following is just an example to illustrate a SparkR usage on a partitioned parquet table created by
spark.range(10).write.mode("overwrite").parquet("/tmp/t1").How was this patch tested?
Pass the Jenkins tests (including new testcase).