[SPARK-18161] [Python] Update cloudpickle to v0.6.1 #20691

inpefess · 2018-02-28T08:52:51Z

What changes were proposed in this pull request?

In this PR we've done two things:

updated the Spark's copy of cloudpickle to 0.6.1 (current stable)
The main reason Spark stayed with cloudpickle 0.4.x was that the default pickle protocol was changed in later versions.
started using pickle.HIGHEST_PROTOCOL for both Python 2 and Python 3 for serializers and broadcast
Pyrolite has such Pickle protocol version support: reading: 0,1,2,3,4; writing: 2.

How was this patch tested?

Jenkins tests.

Authors: Sloane Simmons, Boris Shminke

This contribution is original work of Sloane Simmons and Boris Shminke and they licensed it to the project under the project's open source license.

inpefess · 2018-02-28T08:53:29Z

@holdenk review needed

holdenk

Jenkins, ok to test.

hvanhovell · 2018-02-28T13:29:25Z

hmmm - jenkins seems not to be playing ball

hvanhovell · 2018-02-28T13:29:30Z

ok to test

hvanhovell · 2018-02-28T13:30:25Z

Have you tried serializing an array larger than 2GB? There is a pretty big chance that we do not support on the Spark side.

kiszk · 2018-02-28T13:33:23Z

good point, it would be good to add test case for > 4GB object.

hvanhovell · 2018-02-28T13:44:23Z

I am not sure that adding such a test is very good for test stability, but we could disable it by default.

SparkQA · 2018-02-28T13:54:51Z

Test build #87777 has finished for PR 20691 at commit e08fcae.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

inpefess · 2018-02-28T14:18:32Z

Well, actually I just wanted to simply merge an older seemingly straightforward PR #15670 :) And @holdenk warned me that "it should just be fixing the merge conflicts".
So now I will fix this unit-tests failure and add a (disabled by default) test that @hvanhovell suggested.

SparkQA · 2018-03-01T20:25:16Z

Test build #87852 has finished for PR 20691 at commit e15eb63.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-01-13T04:04:51Z

ok to test

HyukjinKwon · 2019-01-13T04:05:39Z

Let's give a shot for this in 3.0.0. Cloudpickle also changed its protocol a long ago from 2 to highest as well and looks it doesn't have notable regression so far.

python/pyspark/broadcast.py

SparkQA · 2019-01-13T04:32:59Z

Test build #101136 has finished for PR 20691 at commit e15eb63.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-01-14T06:52:36Z

retest this please

SparkQA · 2019-01-14T07:20:49Z

Test build #101164 has finished for PR 20691 at commit e15eb63.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-01-14T11:25:53Z

Looks the test failures related. In the current master, it all passes fine.

inpefess · 2019-01-14T17:44:47Z

Looks the test failures related. In the current master, it all passes fine.

Yes, my bad. Will change cloudpickle as you advised above. Thanks.

SparkQA · 2019-01-14T21:39:04Z

Test build #101205 has finished for PR 20691 at commit 85def5f.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-01-15T08:58:34Z

Test build #101241 has finished for PR 20691 at commit 27d3a85.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

inpefess · 2019-01-15T09:02:06Z

@HyukjinKwon can you review the changes please?

python/pyspark/tests/test_rdd.py

python/pyspark/cloudpickle.py

HyukjinKwon · 2019-01-15T10:59:46Z

Just for clarification, @inpefess, this is kind of core fix that we all should put a lot of efforts to check each corner case. For instance, see when we initially upgraded Cloudpickle - #20373.

inpefess · 2019-01-15T11:19:11Z

Just for clarification, @inpefess, this is kind of core fix that we all should put a lot of efforts to check each corner case. For instance, see when we initially upgraded Cloudpickle - #20373.

Yes, I understand that playing with pickle protocol versions can have disastrous consequences:)

SparkQA · 2019-01-18T17:33:30Z

Test build #101407 has finished for PR 20691 at commit 654ed03.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class _DynamicModuleFuncGlobals(dict):

inpefess · 2019-01-18T18:46:35Z

@inpefess, also do you mind if I ask to elaborate how highest protocol solves 4GB problem? I think it's good to leave a link related with that if there is in the PR description.

I decided to remove it since I picked this PR up nearly two years ago for solving that problem which eventually lost its importance for me. So now I have nothing to add.

Also, it would be great if leave the link that says Pyrolite supports to read all protocols.

Done.

BryanCutler · 2019-01-18T22:48:49Z

+1 on doing this for Spark 3.0.0, on a quick glance the changes seem ok to me.

viirya · 2019-01-19T02:05:19Z

python/pyspark/broadcast.py

    def dump(self, value, f):
        try:
-            pickle.dump(value, f, 2)
+            pickle.dump(value, f, pickle.HIGHEST_PROTOCOL)


Mind I ask about the context? why we always use protocol 2 previously?

Is this change related to upgrading cloudpickle?

Ah, yea. this PR was previously setting the protocol to highest one to support 4gb+ pickle alone in the regular pickle (not including cloudpickle).

So I suggested to target upgrade Cloudpickle because upper Cloudpickle has that change to use highest protocol even though upgrading Cloudpickle is slightly orthogonal.

Yea, it should be great if we know the context about why it was set 2 previously. I suspect there's no particular reason but should be good to double check and leave the reason if it's able to find.

The highest pickle protocol will be 2 in Python 2 and 4 in Python 3.4+. So, we're changing it from 2 to 4 when Python 3.4+.

One possibility is that it was set to 2 for the worry about writing and reading even in different Python versions but I don't think that's not guranteed in PySpark. Maybe we should explicitly note this somewhere as well.

It happened here: 6cf5076#diff-bb67501acde415576c589b478e16c60aR82
Since then it never changed.
I agree that there was no particular reason for that since pickle.HIGHEST_PROTOCOL in Python 2 versions is 2 for ages, not 3 or 4. Using pickle.HIGHEST_PROTOCOL consistently should be safe for that reason.

python/pyspark/serializers.py

SparkQA · 2019-01-19T07:54:52Z

Test build #101426 has finished for PR 20691 at commit b0df927.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-01-20T09:53:38Z

Looks fine to me. I'm gonna take a look few times more. Would be great if other people take a look as well.

HyukjinKwon · 2019-01-24T05:45:35Z

retest this please

HyukjinKwon · 2019-01-24T05:51:05Z

@inpefess, mind if I ask to double check together? We should take a look at:

https://github.com/irmen/Pyrolite/blob/55941dbf5d8e03381a5393a190062ca4447e21d0/java/src/main/java/net/razorvine/pickle/Unpickler.java#L289-L327 (4.13 that's we're currently using)
https://www.python.org/dev/peps/pep-3154/

import pickle
import pickletools
print(pickletools.dis(pickle.dumps(obj, protocol=3)))

vs

import pickle
import pickletools
print(pickletools.dis(pickle.dumps(obj, protocol=4)))

If we need, we should upgrade this library (if that's related with protocol 3 -> 4 change). https://github.com/irmen/Pyrolite/releases

HyukjinKwon · 2019-01-24T05:58:23Z

Adding @JoshRosen as well.

SparkQA · 2019-01-24T06:23:25Z

Test build #101615 has finished for PR 20691 at commit b0df927.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2019-01-31T12:30:47Z

This looks reasonable for Spark 3; is the comment at #20691 (comment) still pending?

HyukjinKwon · 2019-01-31T14:37:10Z

It's okay. I roughly checked and wanted someone to double check. I guess it's okay to try and go ahead in Spark 3.

SparkQA · 2019-02-02T01:16:25Z

Test build #4545 has finished for PR 20691 at commit b0df927.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-02-02T02:48:02Z

OK. I checked it again and looks good. BTW, @inpefess, #20691 (comment) should have been checked together with the PR proposal strictly since PySpark faced some issues related with that before. Let's be clear when we add some fixes to core path next time. Nevertheless, thanks for your efforts to get this in. I was almost about to forget to do it in Spark 3.x.

Merged to master.

HyukjinKwon · 2019-02-02T02:50:42Z

BTW, cloudpickle 0.7.0 came out 9 days ago, and looks the next release will be major version bump up. Might be better to match it to 0.7.0.

I am going to backport some important bug fixes into 0.7.x branches at cloudpickle/cloudpickle.

HyukjinKwon · 2019-02-02T02:52:14Z

Lastly, @inpefess, can you leave a comment on the JIRA? I cannot find your user ID to assign the JIRA to

inpefess · 2019-02-02T06:53:41Z

@HyukjinKwon thanks for your guidance. Sorry for failing to double check
#20691 (comment)

added a comment to Jira
https://issues.apache.org/jira/browse/SPARK-18161?focusedCommentId=16758906&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16758906

## What changes were proposed in this pull request? In this PR we've done two things: 1) updated the Spark's copy of cloudpickle to 0.6.1 (current stable) The main reason Spark stayed with cloudpickle 0.4.x was that the default pickle protocol was changed in later versions. 2) started using pickle.HIGHEST_PROTOCOL for both Python 2 and Python 3 for serializers and broadcast [Pyrolite](https://github.com/irmen/Pyrolite) has such Pickle protocol version support: reading: 0,1,2,3,4; writing: 2. ## How was this patch tested? Jenkins tests. Authors: Sloane Simmons, Boris Shminke This contribution is original work of Sloane Simmons and Boris Shminke and they licensed it to the project under the project's open source license. Closes apache#20691 from inpefess/pickle_protocol_4. Lead-authored-by: Boris Shminke <[email protected]> Co-authored-by: singularperturbation <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

gatorsmile · 2019-02-27T01:21:35Z

Has anybody tested this new change in Python2?

gatorsmile · 2019-02-27T01:21:57Z

cc @inpefess @HyukjinKwon @srowen

HyukjinKwon · 2019-02-27T02:11:07Z

Yes, it was tested in Python 2 but looks cloudpickle had a critical regression/fix (https://github.com/cloudpipe/cloudpickle/pull/240).

I can file a JIRA related with that cloudpickle fix, and a fix with a test in PySpark side.

## What changes were proposed in this pull request? After upgrading cloudpickle to 0.6.1 at #20691, one regression was found. Cloudpickle had a critical cloudpipe/cloudpickle#240 for that. Basically, it currently looks existing globals would override globals shipped in a function's, meaning: **Before:** ```python >>> def hey(): ... return "Hi" ... >>> spark.range(1).rdd.map(lambda _: hey()).collect() ['Hi'] >>> def hey(): ... return "Yeah" ... >>> spark.range(1).rdd.map(lambda _: hey()).collect() ['Hi'] ``` **After:** ```python >>> def hey(): ... return "Hi" ... >>> spark.range(1).rdd.map(lambda _: hey()).collect() ['Hi'] >>> >>> def hey(): ... return "Yeah" ... >>> spark.range(1).rdd.map(lambda _: hey()).collect() ['Yeah'] ``` Therefore, this PR upgrades cloudpickle to 0.8.0. Note that cloudpickle's release cycle is quite short. Between 0.6.1 and 0.7.0, it contains minor bug fixes. I don't see notable changes to double check and/or avoid. There is virtually only this fix between 0.7.0 and 0.8.1 - other fixes are about testing. ## How was this patch tested? Manually tested, tests were added. Verified unit tests were added in cloudpickle. Closes #23904 from HyukjinKwon/SPARK-27000. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

…t protocol ## What changes were proposed in this pull request? This PR partially reverts #20691 After we changed the Python protocol to highest ones, seems like it introduced a correctness bug. This potentially affects all Python related code paths. I suspect a bug related to Pryolite (maybe opcodes `MEMOIZE`, `FRAME` and/or our `RowPickler`). I would like to stick to default protocol for now and investigate the issue separately. I will separately investigate later to bring highest protocol back. ## How was this patch tested? Unittest was added. ```bash ./run-tests --python-executables=python3.7 --testname "pyspark.sql.tests.test_serde SerdeTests.test_int_array_serialization" ``` Closes #24519 from HyukjinKwon/SPARK-27612. Authored-by: HyukjinKwon <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

inpefess mentioned this pull request Feb 28, 2018

[SPARK-18161] [Python] Allow pickle to serialize >4 GB objects when possible (Python 3.4+) #15670

Closed

holdenk reviewed Feb 28, 2018

View reviewed changes

inpefess force-pushed the pickle_protocol_4 branch from e08fcae to e15eb63 Compare March 1, 2018 19:57

HyukjinKwon reviewed Jan 13, 2019

View reviewed changes

python/pyspark/broadcast.py Outdated Show resolved Hide resolved

inpefess force-pushed the pickle_protocol_4 branch from e15eb63 to 85def5f Compare January 14, 2019 21:10

HyukjinKwon reviewed Jan 15, 2019

View reviewed changes

python/pyspark/tests/test_rdd.py Outdated Show resolved Hide resolved

HyukjinKwon reviewed Jan 15, 2019

View reviewed changes

python/pyspark/cloudpickle.py Outdated Show resolved Hide resolved

HyukjinKwon mentioned this pull request Jan 18, 2019

[SPARK-26658][PySpark] : Call pickle.dump with protocol version 3 for Python 3… #23577

Closed

inpefess force-pushed the pickle_protocol_4 branch from 27d3a85 to 5eca93d Compare January 18, 2019 12:46

inpefess force-pushed the pickle_protocol_4 branch from 5eca93d to 654ed03 Compare January 18, 2019 16:55

viirya reviewed Jan 19, 2019

View reviewed changes

HyukjinKwon reviewed Jan 19, 2019

View reviewed changes

python/pyspark/serializers.py Outdated Show resolved Hide resolved

Use pickle.HIGHEST_PROTOCOL as a constant variable

b0df927

srowen approved these changes Feb 2, 2019

View reviewed changes

asfgit closed this in 75ea89a Feb 2, 2019

HyukjinKwon mentioned this pull request Feb 27, 2019

[SPARK-27000][PYTHON] Upgrades cloudpickle to v0.8.0 #23904

Closed

HyukjinKwon mentioned this pull request May 3, 2019

[SPARK-27612][PYTHON] Use Python's default protocol instead of highest protocol #24519

Closed

[SPARK-18161] [Python] Update cloudpickle to v0.6.1 #20691

[SPARK-18161] [Python] Update cloudpickle to v0.6.1 #20691

Uh oh!

Conversation

inpefess commented Feb 28, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

inpefess commented Feb 28, 2018

Uh oh!

holdenk left a comment

Choose a reason for hiding this comment

Uh oh!

hvanhovell commented Feb 28, 2018

Uh oh!

hvanhovell commented Feb 28, 2018

Uh oh!

hvanhovell commented Feb 28, 2018

Uh oh!

kiszk commented Feb 28, 2018

Uh oh!

hvanhovell commented Feb 28, 2018

Uh oh!

SparkQA commented Feb 28, 2018

Uh oh!

inpefess commented Feb 28, 2018

Uh oh!

SparkQA commented Mar 1, 2018

Uh oh!

HyukjinKwon commented Jan 13, 2019

Uh oh!

HyukjinKwon commented Jan 13, 2019

Uh oh!

Uh oh!

SparkQA commented Jan 13, 2019

Uh oh!

HyukjinKwon commented Jan 14, 2019

Uh oh!

SparkQA commented Jan 14, 2019

Uh oh!

HyukjinKwon commented Jan 14, 2019

Uh oh!

inpefess commented Jan 14, 2019

Uh oh!

SparkQA commented Jan 14, 2019

Uh oh!

SparkQA commented Jan 15, 2019

Uh oh!

inpefess commented Jan 15, 2019

Uh oh!

Uh oh!

Uh oh!

HyukjinKwon commented Jan 15, 2019

Uh oh!

inpefess commented Jan 15, 2019

Uh oh!

SparkQA commented Jan 18, 2019

Uh oh!

inpefess commented Jan 18, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BryanCutler commented Jan 18, 2019

Uh oh!

viirya Jan 19, 2019

Choose a reason for hiding this comment

Uh oh!

viirya Jan 19, 2019

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Jan 19, 2019

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Jan 19, 2019

Choose a reason for hiding this comment

Uh oh!

inpefess Jan 19, 2019

Choose a reason for hiding this comment

Uh oh!

Uh oh!

inpefess commented Feb 28, 2018 •

edited

Loading

inpefess commented Jan 18, 2019 •

edited

Loading

HyukjinKwon commented Jan 24, 2019 •

edited

Loading

HyukjinKwon commented Feb 2, 2019 •

edited

Loading

HyukjinKwon commented Feb 2, 2019 •

edited

Loading

inpefess commented Feb 2, 2019 •

edited

Loading