-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-17817][PySpark][FOLLOWUP] PySpark RDD Repartitioning Results in Highly Skewed Partition Sizes #15445
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #66789 has finished for PR 15445 at commit
|
|
Do you have benchmark on this change? |
|
For the change related to performance, should be verified by benchmark, unless it's obvious. |
|
Just saw the result from that PR (posting here would be great), we may don't need this PR if there is no noticable difference (even for complicated types). |
|
@felixcheung I post the benchmark in #15389. Now post here too. |
|
@davies @felixcheung I ran another benchmark as follows: _to_java_object_rdd(): 424.308749914 The time difference is not obvious. However, when I ran another benchmark with numpy array. I found that the When running the following code:
Consider the issue of pickling python object in converting to java rdd, I think this PR might be better solution. |
|
ping @davies @felixcheung Could you take a look to see if we want to apply this? Thanks! |
|
ping @davies @felixcheung May you review this again? Thanks. |
|
@viirya That's good point, javaToPython can only used for known types that could deserialized in Java (for example, some types in sql and ml), this PR make better sense. Merging this into master. |
|
@davies @felixcheung Thanks! |
… in Highly Skewed Partition Sizes ## What changes were proposed in this pull request? This change is a followup for apache#15389 which calls `_to_java_object_rdd()` to solve this issue. Due to the concern of the possible expensive cost of the call, we can choose to decrease the batch size to solve this issue too. Simple benchmark: import time num_partitions = 20000 a = sc.parallelize(range(int(1e6)), 2) start = time.time() l = a.repartition(num_partitions).glom().map(len).collect() end = time.time() print(end - start) Before: 419.447577953 _to_java_object_rdd(): 421.916361094 decreasing the batch size: 423.712255955 ## How was this patch tested? Jenkins tests. Author: Liang-Chi Hsieh <[email protected]> Closes apache#15445 from viirya/repartition-batch-size.
… in Highly Skewed Partition Sizes ## What changes were proposed in this pull request? This change is a followup for apache#15389 which calls `_to_java_object_rdd()` to solve this issue. Due to the concern of the possible expensive cost of the call, we can choose to decrease the batch size to solve this issue too. Simple benchmark: import time num_partitions = 20000 a = sc.parallelize(range(int(1e6)), 2) start = time.time() l = a.repartition(num_partitions).glom().map(len).collect() end = time.time() print(end - start) Before: 419.447577953 _to_java_object_rdd(): 421.916361094 decreasing the batch size: 423.712255955 ## How was this patch tested? Jenkins tests. Author: Liang-Chi Hsieh <[email protected]> Closes apache#15445 from viirya/repartition-batch-size.
What changes were proposed in this pull request?
This change is a followup for #15389 which calls
_to_java_object_rdd()to solve this issue. Due to the concern of the possible expensive cost of the call, we can choose to decrease the batch size to solve this issue too.Simple benchmark:
Before: 419.447577953
_to_java_object_rdd(): 421.916361094
decreasing the batch size: 423.712255955
How was this patch tested?
Jenkins tests.