-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-19366][SQL] add getNumPartitions to Dataset #16708
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| * Returns the number of partitions of this Dataset. | ||
| * @group basic | ||
| * @since 2.2.0 | ||
| */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this not just numPartitions?
|
Actually - why do we need this? I worry it can be a confusing API due to optimizer behavior. |
|
Test build #72011 has finished for PR 16708 at commit
|
|
Test build #72016 has started for PR 16708 at commit |
|
Test FAILed. |
|
@rxin this was as suggested by @cloud-fan in more details in this thread here The original concerns were around the overhead with the extra conversion needed in Python and R (to PythonRDD, RRDD) and that it would be much lighter weight to have a method in Scala for this. Now that we have a simple workaround in R (by just calling the Scala method without conversion), I'm not feeling strongly about this so I'm ok to close this. I do agree with the optimizer behavior but this has been a very frequently requested method and its uses as Perhaps it is worthwhile to explain this is a number to expect but can be optimized out. |
|
Basically I want to push back against exposing this as a public API ... |
|
@rxin you think this will be confusing as the results might change over time ? |
|
Yes. |
|
Isnt the dataset immutable ? i.e. the optimizer is called once when the RDD is materialized and the RDD doesn't change after that ? |
|
@shivaram yea, once the Dataset is materialized, the partition number won't change. I think rxin's concern is, when people trying to get the partition number, the result is unpredictable as too many factors can affect it. |
What changes were proposed in this pull request?
As suggested by @cloud-fan here, adding a simple wrapper in Scala can help avoid inefficiency with non-JVM cases
How was this patch tested?
unit tests