-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-18161] [Python] Update cloudpickle to v0.6.1 #20691
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@holdenk review needed |
holdenk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Jenkins, ok to test.
|
hmmm - jenkins seems not to be playing ball |
|
ok to test |
|
Have you tried serializing an array larger than 2GB? There is a pretty big chance that we do not support on the Spark side. |
|
good point, it would be good to add test case for > 4GB object. |
|
I am not sure that adding such a test is very good for test stability, but we could disable it by default. |
|
Test build #87777 has finished for PR 20691 at commit
|
|
Well, actually I just wanted to simply merge an older seemingly straightforward PR #15670 :) And @holdenk warned me that "it should just be fixing the merge conflicts". |
e08fcae to
e15eb63
Compare
|
Test build #87852 has finished for PR 20691 at commit
|
|
ok to test |
|
Let's give a shot for this in 3.0.0. Cloudpickle also changed its protocol a long ago from 2 to highest as well and looks it doesn't have notable regression so far. |
|
Test build #101136 has finished for PR 20691 at commit
|
|
retest this please |
|
Test build #101164 has finished for PR 20691 at commit
|
|
Looks the test failures related. In the current master, it all passes fine. |
Yes, my bad. Will change cloudpickle as you advised above. Thanks. |
e15eb63 to
85def5f
Compare
|
Test build #101205 has finished for PR 20691 at commit
|
|
Test build #101241 has finished for PR 20691 at commit
|
|
@HyukjinKwon can you review the changes please? |
27d3a85 to
5eca93d
Compare
5eca93d to
654ed03
Compare
|
Test build #101407 has finished for PR 20691 at commit
|
I decided to remove it since I picked this PR up nearly two years ago for solving that problem which eventually lost its importance for me. So now I have nothing to add.
Done. |
|
+1 on doing this for Spark 3.0.0, on a quick glance the changes seem ok to me. |
python/pyspark/broadcast.py
Outdated
| def dump(self, value, f): | ||
| try: | ||
| pickle.dump(value, f, 2) | ||
| pickle.dump(value, f, pickle.HIGHEST_PROTOCOL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mind I ask about the context? why we always use protocol 2 previously?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this change related to upgrading cloudpickle?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yea. this PR was previously setting the protocol to highest one to support 4gb+ pickle alone in the regular pickle (not including cloudpickle).
So I suggested to target upgrade Cloudpickle because upper Cloudpickle has that change to use highest protocol even though upgrading Cloudpickle is slightly orthogonal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, it should be great if we know the context about why it was set 2 previously. I suspect there's no particular reason but should be good to double check and leave the reason if it's able to find.
The highest pickle protocol will be 2 in Python 2 and 4 in Python 3.4+. So, we're changing it from 2 to 4 when Python 3.4+.
One possibility is that it was set to 2 for the worry about writing and reading even in different Python versions but I don't think that's not guranteed in PySpark. Maybe we should explicitly note this somewhere as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It happened here: 6cf5076#diff-bb67501acde415576c589b478e16c60aR82
Since then it never changed.
I agree that there was no particular reason for that since pickle.HIGHEST_PROTOCOL in Python 2 versions is 2 for ages, not 3 or 4. Using pickle.HIGHEST_PROTOCOL consistently should be safe for that reason.
|
Test build #101426 has finished for PR 20691 at commit
|
|
Looks fine to me. I'm gonna take a look few times more. Would be great if other people take a look as well. |
|
retest this please |
|
@inpefess, mind if I ask to double check together? We should take a look at:
import pickle
import pickletools
print(pickletools.dis(pickle.dumps(obj, protocol=3)))vs import pickle
import pickletools
print(pickletools.dis(pickle.dumps(obj, protocol=4)))
|
|
Adding @JoshRosen as well. |
|
Test build #101615 has finished for PR 20691 at commit
|
|
This looks reasonable for Spark 3; is the comment at #20691 (comment) still pending? |
|
It's okay. I roughly checked and wanted someone to double check. I guess it's okay to try and go ahead in Spark 3. |
|
Test build #4545 has finished for PR 20691 at commit
|
|
OK. I checked it again and looks good. BTW, @inpefess, #20691 (comment) should have been checked together with the PR proposal strictly since PySpark faced some issues related with that before. Let's be clear when we add some fixes to core path next time. Nevertheless, thanks for your efforts to get this in. I was almost about to forget to do it in Spark 3.x. Merged to master. |
|
BTW, cloudpickle 0.7.0 came out 9 days ago, and looks the next release will be major version bump up. Might be better to match it to 0.7.0. I am going to backport some important bug fixes into 0.7.x branches at cloudpickle/cloudpickle. |
|
Lastly, @inpefess, can you leave a comment on the JIRA? I cannot find your user ID to assign the JIRA to |
|
@HyukjinKwon thanks for your guidance. Sorry for failing to double check added a comment to Jira |
## What changes were proposed in this pull request? In this PR we've done two things: 1) updated the Spark's copy of cloudpickle to 0.6.1 (current stable) The main reason Spark stayed with cloudpickle 0.4.x was that the default pickle protocol was changed in later versions. 2) started using pickle.HIGHEST_PROTOCOL for both Python 2 and Python 3 for serializers and broadcast [Pyrolite](https://github.com/irmen/Pyrolite) has such Pickle protocol version support: reading: 0,1,2,3,4; writing: 2. ## How was this patch tested? Jenkins tests. Authors: Sloane Simmons, Boris Shminke This contribution is original work of Sloane Simmons and Boris Shminke and they licensed it to the project under the project's open source license. Closes apache#20691 from inpefess/pickle_protocol_4. Lead-authored-by: Boris Shminke <[email protected]> Co-authored-by: singularperturbation <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
|
Has anybody tested this new change in Python2? |
|
Yes, it was tested in Python 2 but looks cloudpickle had a critical regression/fix ( I can file a JIRA related with that cloudpickle fix, and a fix with a test in PySpark side. |
## What changes were proposed in this pull request? After upgrading cloudpickle to 0.6.1 at #20691, one regression was found. Cloudpickle had a critical cloudpipe/cloudpickle#240 for that. Basically, it currently looks existing globals would override globals shipped in a function's, meaning: **Before:** ```python >>> def hey(): ... return "Hi" ... >>> spark.range(1).rdd.map(lambda _: hey()).collect() ['Hi'] >>> def hey(): ... return "Yeah" ... >>> spark.range(1).rdd.map(lambda _: hey()).collect() ['Hi'] ``` **After:** ```python >>> def hey(): ... return "Hi" ... >>> spark.range(1).rdd.map(lambda _: hey()).collect() ['Hi'] >>> >>> def hey(): ... return "Yeah" ... >>> spark.range(1).rdd.map(lambda _: hey()).collect() ['Yeah'] ``` Therefore, this PR upgrades cloudpickle to 0.8.0. Note that cloudpickle's release cycle is quite short. Between 0.6.1 and 0.7.0, it contains minor bug fixes. I don't see notable changes to double check and/or avoid. There is virtually only this fix between 0.7.0 and 0.8.1 - other fixes are about testing. ## How was this patch tested? Manually tested, tests were added. Verified unit tests were added in cloudpickle. Closes #23904 from HyukjinKwon/SPARK-27000. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
…t protocol ## What changes were proposed in this pull request? This PR partially reverts #20691 After we changed the Python protocol to highest ones, seems like it introduced a correctness bug. This potentially affects all Python related code paths. I suspect a bug related to Pryolite (maybe opcodes `MEMOIZE`, `FRAME` and/or our `RowPickler`). I would like to stick to default protocol for now and investigate the issue separately. I will separately investigate later to bring highest protocol back. ## How was this patch tested? Unittest was added. ```bash ./run-tests --python-executables=python3.7 --testname "pyspark.sql.tests.test_serde SerdeTests.test_int_array_serialization" ``` Closes #24519 from HyukjinKwon/SPARK-27612. Authored-by: HyukjinKwon <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>
What changes were proposed in this pull request?
In this PR we've done two things:
updated the Spark's copy of cloudpickle to 0.6.1 (current stable)
The main reason Spark stayed with cloudpickle 0.4.x was that the default pickle protocol was changed in later versions.
started using pickle.HIGHEST_PROTOCOL for both Python 2 and Python 3 for serializers and broadcast
Pyrolite has such Pickle protocol version support: reading: 0,1,2,3,4; writing: 2.
How was this patch tested?
Jenkins tests.
Authors: Sloane Simmons, Boris Shminke
This contribution is original work of Sloane Simmons and Boris Shminke and they licensed it to the project under the project's open source license.