[SPARK-19366][SQL] add getNumPartitions to Dataset #16708

felixcheung · 2017-01-26T06:06:31Z

What changes were proposed in this pull request?

As suggested by @cloud-fan here, adding a simple wrapper in Scala can help avoid inefficiency with non-JVM cases

How was this patch tested?

unit tests

rxin · 2017-01-26T06:07:40Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

+   * Returns the number of partitions of this Dataset.
+   * @group basic
+   * @since 2.2.0
+   */


why is this not just numPartitions?

rxin · 2017-01-26T06:08:26Z

Actually - why do we need this? I worry it can be a confusing API due to optimizer behavior.

SparkQA · 2017-01-26T06:13:10Z

Test build #72011 has finished for PR 16708 at commit 68cb3e2.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-26T07:28:40Z

Test build #72016 has started for PR 16708 at commit 048759b.

AmplabJenkins · 2017-01-26T08:05:09Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72016/
Test FAILed.

felixcheung · 2017-01-26T20:32:59Z

@rxin this was as suggested by @cloud-fan in more details in this thread here

The original concerns were around the overhead with the extra conversion needed in Python and R (to PythonRDD, RRDD) and that it would be much lighter weight to have a method in Scala for this.

Now that we have a simple workaround in R (by just calling the Scala method without conversion), I'm not feeling strongly about this so I'm ok to close this.

I do agree with the optimizer behavior but this has been a very frequently requested method and its uses as x.rdd.getNumPartitions is all over PySpark code and documentation.

Perhaps it is worthwhile to explain this is a number to expect but can be optimized out.

rxin · 2017-01-26T21:16:51Z

Basically I want to push back against exposing this as a public API ...

shivaram · 2017-01-26T23:04:45Z

@rxin you think this will be confusing as the results might change over time ?

rxin · 2017-01-26T23:33:14Z

Yes.

shivaram · 2017-01-26T23:46:09Z

Isnt the dataset immutable ? i.e. the optimizer is called once when the RDD is materialized and the RDD doesn't change after that ?

cloud-fan · 2017-02-23T01:35:38Z

@shivaram yea, once the Dataset is materialized, the partition number won't change. I think rxin's concern is, when people trying to get the partition number, the result is unpredictable as too many factors can affect it.

add getNumPartitions and test

68cb3e2

rxin reviewed Jan 26, 2017

View reviewed changes

commit the right stuff

048759b

felixcheung mentioned this pull request Jan 26, 2017

[SPARK-18788][SPARKR] Add API for getNumPartitions #16668

Closed

felixcheung closed this Feb 17, 2017

zhengruifeng mentioned this pull request Sep 4, 2023

[SPARK-45049][CONNECT][DOCS][TESTS] Refine docstrings of coalesce/repartition/repartitionByRange #42770

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-19366][SQL] add getNumPartitions to Dataset #16708

[SPARK-19366][SQL] add getNumPartitions to Dataset #16708

Uh oh!

felixcheung commented Jan 26, 2017

Uh oh!

rxin Jan 26, 2017

Uh oh!

rxin commented Jan 26, 2017

Uh oh!

SparkQA commented Jan 26, 2017

Uh oh!

SparkQA commented Jan 26, 2017

Uh oh!

AmplabJenkins commented Jan 26, 2017

Uh oh!

felixcheung commented Jan 26, 2017 •

edited

Loading

Uh oh!

rxin commented Jan 26, 2017

Uh oh!

shivaram commented Jan 26, 2017

Uh oh!

rxin commented Jan 26, 2017

Uh oh!

shivaram commented Jan 26, 2017

Uh oh!

cloud-fan commented Feb 23, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

[SPARK-19366][SQL] add getNumPartitions to Dataset #16708

[SPARK-19366][SQL] add getNumPartitions to Dataset #16708

Uh oh!

Conversation

felixcheung commented Jan 26, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

rxin Jan 26, 2017

Choose a reason for hiding this comment

Uh oh!

rxin commented Jan 26, 2017

Uh oh!

SparkQA commented Jan 26, 2017

Uh oh!

SparkQA commented Jan 26, 2017

Uh oh!

AmplabJenkins commented Jan 26, 2017

Uh oh!

felixcheung commented Jan 26, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rxin commented Jan 26, 2017

Uh oh!

shivaram commented Jan 26, 2017

Uh oh!

rxin commented Jan 26, 2017

Uh oh!

shivaram commented Jan 26, 2017

Uh oh!

cloud-fan commented Feb 23, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

felixcheung commented Jan 26, 2017 •

edited

Loading