[SPARK-16053][R] Add `spark_partition_id` in SparkR #13768

dongjoon-hyun · 2016-06-19T07:11:12Z

What changes were proposed in this pull request?

This PR adds spark_partition_id virtual column function in SparkR for API parity.

The following is just an example to illustrate a SparkR usage on a partitioned parquet table created by spark.range(10).write.mode("overwrite").parquet("/tmp/t1").

> collect(select(read.parquet('/tmp/t1'), c('id', spark_partition_id())))
   id SPARK_PARTITION_ID()
1   3                    0
2   4                    0
3   8                    1
4   9                    1
5   0                    2
6   1                    3
7   2                    4
8   5                    5
9   6                    6
10  7                    7

How was this patch tested?

Pass the Jenkins tests (including new testcase).

SparkQA · 2016-06-19T07:15:02Z

Test build #60803 has finished for PR 13768 at commit 26b9781.

This patch fails R style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-06-19T07:58:28Z

Test build #60806 has finished for PR 13768 at commit e7e471c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2016-06-19T08:07:47Z

Hi, @davies .
Could you review this PR?

felixcheung · 2016-06-20T09:36:37Z

R/pkg/R/generics.R

+#' @rdname spark_partition_id
+#' @export
+setGeneric("spark_partition_id", function(x) { standardGeneric("spark_partition_id") })
+


shouldn't this go to L1080? this should be sorted

Do you mean before sd?
Currently, it is sorted already, isn't it?
soundex -> spark_partition_id -> stddev?

davies · 2016-06-20T15:26:14Z

LGTM

dongjoon-hyun · 2016-06-20T15:39:26Z

Thank you for review, @davies !

shivaram · 2016-06-20T18:23:59Z

R/pkg/R/functions.R

            column(jc)
          })

+#' spark_partition_id


Minor nit: The convention we are using in SparkR is to have a descriptive title for the function. So in this case it would be something like Return the partition ID as a column. (There might be other places which need to fixed to match this convention as well -- We discussed this in #13394

Oh, I see. I'll fix them.

dongjoon-hyun · 2016-06-20T18:55:16Z

Thank you, @shivaram .
According to your advice and #13394 , I fixed the title convention.
It seems that's all for this PR.

shivaram · 2016-06-20T19:11:51Z

R/pkg/R/functions.R


+#' Return the partition ID as a column
+#'
+#' Return the column for partition ID of the Spark task.


Couple of minor nits:

To be consistent with the title this line can be Return the partition ID of the Spark task a SparkDataFrame column

I think nondeterministic is more suitable than indeterministic

Thanks. That seems better.

SparkQA · 2016-06-20T19:35:11Z

Test build #60861 has finished for PR 13768 at commit cbd54b2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-06-20T19:58:28Z

Test build #60867 has finished for PR 13768 at commit ddb2102.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

shivaram · 2016-06-20T20:40:30Z

Thanks for the updates. LGTM. Merging this to master, branch-2.0

## What changes were proposed in this pull request? This PR adds `spark_partition_id` virtual column function in SparkR for API parity. The following is just an example to illustrate a SparkR usage on a partitioned parquet table created by `spark.range(10).write.mode("overwrite").parquet("/tmp/t1")`. ```r > collect(select(read.parquet('/tmp/t1'), c('id', spark_partition_id()))) id SPARK_PARTITION_ID() 1 3 0 2 4 0 3 8 1 4 9 1 5 0 2 6 1 3 7 2 4 8 5 5 9 6 6 10 7 7 ``` ## How was this patch tested? Pass the Jenkins tests (including new testcase). Author: Dongjoon Hyun <[email protected]> Closes #13768 from dongjoon-hyun/SPARK-16053. (cherry picked from commit b0f2fb5) Signed-off-by: Shivaram Venkataraman <[email protected]>

dongjoon-hyun · 2016-06-20T20:51:51Z

Thank you for merging, @shivaram !

dongjoon-hyun changed the title ~~Add spark_partition_id in SparkR~~ [SPARK-16053][R] Add spark_partition_id in SparkR Jun 19, 2016

Add spark_partition_id in SparkR

e7e471c

felixcheung reviewed Jun 20, 2016
View reviewed changes

shivaram reviewed Jun 20, 2016
View reviewed changes

Use new title/description convention.

cbd54b2

shivaram reviewed Jun 20, 2016
View reviewed changes

Update function description.

ddb2102

asfgit closed this in b0f2fb5 Jun 20, 2016

dongjoon-hyun deleted the SPARK-16053 branch July 20, 2016 07:39

[SPARK-16053][R] Add spark_partition_id in SparkR #13768

[SPARK-16053][R] Add spark_partition_id in SparkR #13768

Uh oh!

Conversation

dongjoon-hyun commented Jun 19, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Jun 19, 2016

Uh oh!

SparkQA commented Jun 19, 2016

Uh oh!

dongjoon-hyun commented Jun 19, 2016

Uh oh!

felixcheung Jun 20, 2016

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Jun 20, 2016

Choose a reason for hiding this comment

Uh oh!

davies commented Jun 20, 2016

Uh oh!

dongjoon-hyun commented Jun 20, 2016

Uh oh!

shivaram Jun 20, 2016

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Jun 20, 2016

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Jun 20, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shivaram Jun 20, 2016

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Jun 20, 2016

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 20, 2016

Uh oh!

SparkQA commented Jun 20, 2016

Uh oh!

shivaram commented Jun 20, 2016

Uh oh!

dongjoon-hyun commented Jun 20, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[SPARK-16053][R] Add `spark_partition_id` in SparkR #13768

[SPARK-16053][R] Add `spark_partition_id` in SparkR #13768

dongjoon-hyun commented Jun 20, 2016 •

edited

Loading