Skip to content

Conversation

@felixcheung
Copy link
Member

What changes were proposed in this pull request?

As suggested by @cloud-fan here, adding a simple wrapper in Scala can help avoid inefficiency with non-JVM cases

How was this patch tested?

unit tests

* Returns the number of partitions of this Dataset.
* @group basic
* @since 2.2.0
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this not just numPartitions?

@rxin
Copy link
Contributor

rxin commented Jan 26, 2017

Actually - why do we need this? I worry it can be a confusing API due to optimizer behavior.

@SparkQA
Copy link

SparkQA commented Jan 26, 2017

Test build #72011 has finished for PR 16708 at commit 68cb3e2.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 26, 2017

Test build #72016 has started for PR 16708 at commit 048759b.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72016/
Test FAILed.

@felixcheung
Copy link
Member Author

felixcheung commented Jan 26, 2017

@rxin this was as suggested by @cloud-fan in more details in this thread here

The original concerns were around the overhead with the extra conversion needed in Python and R (to PythonRDD, RRDD) and that it would be much lighter weight to have a method in Scala for this.

Now that we have a simple workaround in R (by just calling the Scala method without conversion), I'm not feeling strongly about this so I'm ok to close this.

I do agree with the optimizer behavior but this has been a very frequently requested method and its uses as x.rdd.getNumPartitions is all over PySpark code and documentation.

Perhaps it is worthwhile to explain this is a number to expect but can be optimized out.

@rxin
Copy link
Contributor

rxin commented Jan 26, 2017

Basically I want to push back against exposing this as a public API ...

@shivaram
Copy link
Contributor

@rxin you think this will be confusing as the results might change over time ?

@rxin
Copy link
Contributor

rxin commented Jan 26, 2017

Yes.

@shivaram
Copy link
Contributor

Isnt the dataset immutable ? i.e. the optimizer is called once when the RDD is materialized and the RDD doesn't change after that ?

@cloud-fan
Copy link
Contributor

@shivaram yea, once the Dataset is materialized, the partition number won't change. I think rxin's concern is, when people trying to get the partition number, the result is unpredictable as too many factors can affect it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants