Skip to content

Conversation

@dongjoon-hyun
Copy link
Member

What changes were proposed in this pull request?

This PR adds spark_partition_id virtual column function in SparkR for API parity.

The following is just an example to illustrate a SparkR usage on a partitioned parquet table created by spark.range(10).write.mode("overwrite").parquet("/tmp/t1").

> collect(select(read.parquet('/tmp/t1'), c('id', spark_partition_id())))
   id SPARK_PARTITION_ID()
1   3                    0
2   4                    0
3   8                    1
4   9                    1
5   0                    2
6   1                    3
7   2                    4
8   5                    5
9   6                    6
10  7                    7

How was this patch tested?

Pass the Jenkins tests (including new testcase).

@dongjoon-hyun dongjoon-hyun changed the title Add spark_partition_id in SparkR [SPARK-16053][R] Add spark_partition_id in SparkR Jun 19, 2016
@SparkQA
Copy link

SparkQA commented Jun 19, 2016

Test build #60803 has finished for PR 13768 at commit 26b9781.

  • This patch fails R style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 19, 2016

Test build #60806 has finished for PR 13768 at commit e7e471c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Hi, @davies .
Could you review this PR?

#' @rdname spark_partition_id
#' @export
setGeneric("spark_partition_id", function(x) { standardGeneric("spark_partition_id") })

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this go to L1080? this should be sorted

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean before sd?
Currently, it is sorted already, isn't it?
soundex -> spark_partition_id -> stddev?

@davies
Copy link
Contributor

davies commented Jun 20, 2016

LGTM

@dongjoon-hyun
Copy link
Member Author

Thank you for review, @davies !

column(jc)
})

#' spark_partition_id
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit: The convention we are using in SparkR is to have a descriptive title for the function. So in this case it would be something like Return the partition ID as a column. (There might be other places which need to fixed to match this convention as well -- We discussed this in #13394

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see. I'll fix them.

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Jun 20, 2016

Thank you, @shivaram .
According to your advice and #13394 , I fixed the title convention.
It seems that's all for this PR.


#' Return the partition ID as a column
#'
#' Return the column for partition ID of the Spark task.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of minor nits:

  • To be consistent with the title this line can be Return the partition ID of the Spark task a SparkDataFrame column
  • I think nondeterministic is more suitable than indeterministic

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. That seems better.

@SparkQA
Copy link

SparkQA commented Jun 20, 2016

Test build #60861 has finished for PR 13768 at commit cbd54b2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 20, 2016

Test build #60867 has finished for PR 13768 at commit ddb2102.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@shivaram
Copy link
Contributor

Thanks for the updates. LGTM. Merging this to master, branch-2.0

@asfgit asfgit closed this in b0f2fb5 Jun 20, 2016
asfgit pushed a commit that referenced this pull request Jun 20, 2016
## What changes were proposed in this pull request?

This PR adds `spark_partition_id` virtual column function in SparkR for API parity.

The following is just an example to illustrate a SparkR usage on a partitioned parquet table created by `spark.range(10).write.mode("overwrite").parquet("/tmp/t1")`.
```r
> collect(select(read.parquet('/tmp/t1'), c('id', spark_partition_id())))
   id SPARK_PARTITION_ID()
1   3                    0
2   4                    0
3   8                    1
4   9                    1
5   0                    2
6   1                    3
7   2                    4
8   5                    5
9   6                    6
10  7                    7
```

## How was this patch tested?

Pass the Jenkins tests (including new testcase).

Author: Dongjoon Hyun <[email protected]>

Closes #13768 from dongjoon-hyun/SPARK-16053.

(cherry picked from commit b0f2fb5)
Signed-off-by: Shivaram Venkataraman <[email protected]>
@dongjoon-hyun
Copy link
Member Author

Thank you for merging, @shivaram !

@dongjoon-hyun dongjoon-hyun deleted the SPARK-16053 branch July 20, 2016 07:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants