-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-32079][PYTHON] Remove namedtuple hack by replacing built-in pickle to cloudpickle #34688
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
a7c71a2 to
7fe5438
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
7fe5438 to
8962685
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
I added a couple of micro benchmarks to the PR description. |
viirya
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have we kept same test coverage for two serializers?
viirya
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good if we've not missed current test coverage.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
…le to cloudpickle
5fb45a2 to
1c3c50a
Compare
|
Rebased. |
|
I'll merge this one in few days if there aren't any comment. |
|
Test build #145607 has finished for PR 34688 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Thanks, @viirya. Merged to master. |
…ple hack ### What changes were proposed in this pull request? This PR is a followup of #34688 that adds a switch to turn on and off the namedtuple hack. ### Why are the changes needed? There are still behaviour differences between regular pickle and Cloudpickle e.g., bug fixes from the upstream. It's safer to have a switch to turn on and off for the time being. ### Does this PR introduce _any_ user-facing change? This remains as an internal environment so ideally no. In fact the main change itself was the internal change too. ### How was this patch tested? Manually tested. Closes #38700 from HyukjinKwon/SPARK-41189. Lead-authored-by: Hyukjin Kwon <[email protected]> Co-authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
…ple hack ### What changes were proposed in this pull request? This PR is a followup of apache#34688 that adds a switch to turn on and off the namedtuple hack. ### Why are the changes needed? There are still behaviour differences between regular pickle and Cloudpickle e.g., bug fixes from the upstream. It's safer to have a switch to turn on and off for the time being. ### Does this PR introduce _any_ user-facing change? This remains as an internal environment so ideally no. In fact the main change itself was the internal change too. ### How was this patch tested? Manually tested. Closes apache#38700 from HyukjinKwon/SPARK-41189. Lead-authored-by: Hyukjin Kwon <[email protected]> Co-authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
…ple hack ### What changes were proposed in this pull request? This PR is a followup of apache#34688 that adds a switch to turn on and off the namedtuple hack. ### Why are the changes needed? There are still behaviour differences between regular pickle and Cloudpickle e.g., bug fixes from the upstream. It's safer to have a switch to turn on and off for the time being. ### Does this PR introduce _any_ user-facing change? This remains as an internal environment so ideally no. In fact the main change itself was the internal change too. ### How was this patch tested? Manually tested. Closes apache#38700 from HyukjinKwon/SPARK-41189. Lead-authored-by: Hyukjin Kwon <[email protected]> Co-authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
…ple hack ### What changes were proposed in this pull request? This PR is a followup of apache#34688 that adds a switch to turn on and off the namedtuple hack. ### Why are the changes needed? There are still behaviour differences between regular pickle and Cloudpickle e.g., bug fixes from the upstream. It's safer to have a switch to turn on and off for the time being. ### Does this PR introduce _any_ user-facing change? This remains as an internal environment so ideally no. In fact the main change itself was the internal change too. ### How was this patch tested? Manually tested. Closes apache#38700 from HyukjinKwon/SPARK-41189. Lead-authored-by: Hyukjin Kwon <[email protected]> Co-authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
What changes were proposed in this pull request?
This PR proposes to replace Python's built-in CPickle to CPickle-based cloudpickle (requires Python 3.8+).
For Python 3.7 and below, it still uses the legacy built-in CPickle for the performance matter.
I did a bit of benchmark with basic cases, and I have seen no performance penalty (attached one of the benchmarks below).
Why are the changes needed?
To remove named tuple hack for the issues such as: SPARK-32079, SPARK-22674 and SPARK-27810.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Micro benchmark:
E2E benchmark:
Before:
2.3216118812561035 (sec)
After:
2.279919147491455 (sec)
Existing test cases should cover all test cases.