Skip to content

Conversation

@inpefess
Copy link

@inpefess inpefess commented Feb 28, 2018

What changes were proposed in this pull request?

In this PR we've done two things:

  1. updated the Spark's copy of cloudpickle to 0.6.1 (current stable)
    The main reason Spark stayed with cloudpickle 0.4.x was that the default pickle protocol was changed in later versions.

  2. started using pickle.HIGHEST_PROTOCOL for both Python 2 and Python 3 for serializers and broadcast
    Pyrolite has such Pickle protocol version support: reading: 0,1,2,3,4; writing: 2.

How was this patch tested?

Jenkins tests.

Authors: Sloane Simmons, Boris Shminke

This contribution is original work of Sloane Simmons and Boris Shminke and they licensed it to the project under the project's open source license.

@inpefess
Copy link
Author

@holdenk review needed

Copy link
Contributor

@holdenk holdenk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jenkins, ok to test.

@hvanhovell
Copy link
Contributor

hmmm - jenkins seems not to be playing ball

@hvanhovell
Copy link
Contributor

ok to test

@hvanhovell
Copy link
Contributor

Have you tried serializing an array larger than 2GB? There is a pretty big chance that we do not support on the Spark side.

@kiszk
Copy link
Member

kiszk commented Feb 28, 2018

good point, it would be good to add test case for > 4GB object.

@hvanhovell
Copy link
Contributor

I am not sure that adding such a test is very good for test stability, but we could disable it by default.

@SparkQA
Copy link

SparkQA commented Feb 28, 2018

Test build #87777 has finished for PR 20691 at commit e08fcae.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@inpefess
Copy link
Author

Well, actually I just wanted to simply merge an older seemingly straightforward PR #15670 :) And @holdenk warned me that "it should just be fixing the merge conflicts".
So now I will fix this unit-tests failure and add a (disabled by default) test that @hvanhovell suggested.

@inpefess inpefess force-pushed the pickle_protocol_4 branch from e08fcae to e15eb63 Compare March 1, 2018 19:57
@SparkQA
Copy link

SparkQA commented Mar 1, 2018

Test build #87852 has finished for PR 20691 at commit e15eb63.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

ok to test

@HyukjinKwon
Copy link
Member

Let's give a shot for this in 3.0.0. Cloudpickle also changed its protocol a long ago from 2 to highest as well and looks it doesn't have notable regression so far.

@SparkQA
Copy link

SparkQA commented Jan 13, 2019

Test build #101136 has finished for PR 20691 at commit e15eb63.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Jan 14, 2019

Test build #101164 has finished for PR 20691 at commit e15eb63.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Looks the test failures related. In the current master, it all passes fine.

@inpefess
Copy link
Author

Looks the test failures related. In the current master, it all passes fine.

Yes, my bad. Will change cloudpickle as you advised above. Thanks.

@SparkQA
Copy link

SparkQA commented Jan 14, 2019

Test build #101205 has finished for PR 20691 at commit 85def5f.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 15, 2019

Test build #101241 has finished for PR 20691 at commit 27d3a85.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@inpefess
Copy link
Author

@HyukjinKwon can you review the changes please?

@HyukjinKwon
Copy link
Member

Just for clarification, @inpefess, this is kind of core fix that we all should put a lot of efforts to check each corner case. For instance, see when we initially upgraded Cloudpickle - #20373.

@inpefess
Copy link
Author

Just for clarification, @inpefess, this is kind of core fix that we all should put a lot of efforts to check each corner case. For instance, see when we initially upgraded Cloudpickle - #20373.

Yes, I understand that playing with pickle protocol versions can have disastrous consequences:)

@SparkQA
Copy link

SparkQA commented Jan 18, 2019

Test build #101407 has finished for PR 20691 at commit 654ed03.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class _DynamicModuleFuncGlobals(dict):

@inpefess
Copy link
Author

inpefess commented Jan 18, 2019

@inpefess, also do you mind if I ask to elaborate how highest protocol solves 4GB problem? I think it's good to leave a link related with that if there is in the PR description.

I decided to remove it since I picked this PR up nearly two years ago for solving that problem which eventually lost its importance for me. So now I have nothing to add.

Also, it would be great if leave the link that says Pyrolite supports to read all protocols.

Done.

@BryanCutler
Copy link
Member

+1 on doing this for Spark 3.0.0, on a quick glance the changes seem ok to me.

def dump(self, value, f):
try:
pickle.dump(value, f, 2)
pickle.dump(value, f, pickle.HIGHEST_PROTOCOL)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mind I ask about the context? why we always use protocol 2 previously?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change related to upgrading cloudpickle?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yea. this PR was previously setting the protocol to highest one to support 4gb+ pickle alone in the regular pickle (not including cloudpickle).

So I suggested to target upgrade Cloudpickle because upper Cloudpickle has that change to use highest protocol even though upgrading Cloudpickle is slightly orthogonal.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, it should be great if we know the context about why it was set 2 previously. I suspect there's no particular reason but should be good to double check and leave the reason if it's able to find.

The highest pickle protocol will be 2 in Python 2 and 4 in Python 3.4+. So, we're changing it from 2 to 4 when Python 3.4+.

One possibility is that it was set to 2 for the worry about writing and reading even in different Python versions but I don't think that's not guranteed in PySpark. Maybe we should explicitly note this somewhere as well.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It happened here: 6cf5076#diff-bb67501acde415576c589b478e16c60aR82
Since then it never changed.
I agree that there was no particular reason for that since pickle.HIGHEST_PROTOCOL in Python 2 versions is 2 for ages, not 3 or 4. Using pickle.HIGHEST_PROTOCOL consistently should be safe for that reason.

@SparkQA
Copy link

SparkQA commented Jan 19, 2019

Test build #101426 has finished for PR 20691 at commit b0df927.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Looks fine to me. I'm gonna take a look few times more. Would be great if other people take a look as well.

@HyukjinKwon
Copy link
Member

retest this please

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Jan 24, 2019

@inpefess, mind if I ask to double check together? We should take a look at:

  1. https://github.com/irmen/Pyrolite/blob/55941dbf5d8e03381a5393a190062ca4447e21d0/java/src/main/java/net/razorvine/pickle/Unpickler.java#L289-L327 (4.13 that's we're currently using)
  2. https://www.python.org/dev/peps/pep-3154/
import pickle
import pickletools
print(pickletools.dis(pickle.dumps(obj, protocol=3)))

vs

import pickle
import pickletools
print(pickletools.dis(pickle.dumps(obj, protocol=4)))
  1. If we need, we should upgrade this library (if that's related with protocol 3 -> 4 change). https://github.com/irmen/Pyrolite/releases

@HyukjinKwon
Copy link
Member

Adding @JoshRosen as well.

@SparkQA
Copy link

SparkQA commented Jan 24, 2019

Test build #101615 has finished for PR 20691 at commit b0df927.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Jan 31, 2019

This looks reasonable for Spark 3; is the comment at #20691 (comment) still pending?

@HyukjinKwon
Copy link
Member

It's okay. I roughly checked and wanted someone to double check. I guess it's okay to try and go ahead in Spark 3.

@SparkQA
Copy link

SparkQA commented Feb 2, 2019

Test build #4545 has finished for PR 20691 at commit b0df927.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Feb 2, 2019

OK. I checked it again and looks good. BTW, @inpefess, #20691 (comment) should have been checked together with the PR proposal strictly since PySpark faced some issues related with that before. Let's be clear when we add some fixes to core path next time. Nevertheless, thanks for your efforts to get this in. I was almost about to forget to do it in Spark 3.x.

Merged to master.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Feb 2, 2019

BTW, cloudpickle 0.7.0 came out 9 days ago, and looks the next release will be major version bump up. Might be better to match it to 0.7.0.

I am going to backport some important bug fixes into 0.7.x branches at cloudpickle/cloudpickle.

@HyukjinKwon
Copy link
Member

Lastly, @inpefess, can you leave a comment on the JIRA? I cannot find your user ID to assign the JIRA to

@asfgit asfgit closed this in 75ea89a Feb 2, 2019
@inpefess
Copy link
Author

inpefess commented Feb 2, 2019

jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
## What changes were proposed in this pull request?

In this PR we've done two things:
1) updated the Spark's copy of cloudpickle to 0.6.1 (current stable)
The main reason Spark stayed with cloudpickle 0.4.x was that the default pickle protocol was changed in later versions.

2) started using pickle.HIGHEST_PROTOCOL for both Python 2 and Python 3 for serializers and broadcast
[Pyrolite](https://github.com/irmen/Pyrolite) has such Pickle protocol version support: reading: 0,1,2,3,4; writing: 2.

## How was this patch tested?

Jenkins tests.

Authors: Sloane Simmons, Boris Shminke

This contribution is original work of Sloane Simmons and Boris Shminke and they licensed it to the project under the project's open source license.

Closes apache#20691 from inpefess/pickle_protocol_4.

Lead-authored-by: Boris Shminke <[email protected]>
Co-authored-by: singularperturbation <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
@gatorsmile
Copy link
Member

Has anybody tested this new change in Python2?

@gatorsmile
Copy link
Member

cc @inpefess @HyukjinKwon @srowen

@HyukjinKwon
Copy link
Member

Yes, it was tested in Python 2 but looks cloudpickle had a critical regression/fix (https://github.com/cloudpipe/cloudpickle/pull/240).

I can file a JIRA related with that cloudpickle fix, and a fix with a test in PySpark side.

HyukjinKwon added a commit that referenced this pull request Feb 27, 2019
## What changes were proposed in this pull request?

After upgrading cloudpickle to 0.6.1 at #20691, one regression was found. Cloudpickle had a critical cloudpipe/cloudpickle#240 for that.

Basically, it currently looks existing globals would override globals shipped in a function's, meaning:

**Before:**

```python
>>> def hey():
...     return "Hi"
...
>>> spark.range(1).rdd.map(lambda _: hey()).collect()
['Hi']
>>> def hey():
...     return "Yeah"
...
>>> spark.range(1).rdd.map(lambda _: hey()).collect()
['Hi']
```

**After:**

```python
>>> def hey():
...     return "Hi"
...
>>> spark.range(1).rdd.map(lambda _: hey()).collect()
['Hi']
>>>
>>> def hey():
...     return "Yeah"
...
>>> spark.range(1).rdd.map(lambda _: hey()).collect()
['Yeah']
```

Therefore, this PR upgrades cloudpickle to 0.8.0.

Note that cloudpickle's release cycle is quite short.

Between 0.6.1 and 0.7.0, it contains minor bug fixes. I don't see notable changes to double check and/or avoid.

There is virtually only this fix between 0.7.0 and 0.8.1 - other fixes are about testing.

## How was this patch tested?

Manually tested, tests were added. Verified unit tests were added in cloudpickle.

Closes #23904 from HyukjinKwon/SPARK-27000.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
HyukjinKwon added a commit that referenced this pull request May 3, 2019
…t protocol

## What changes were proposed in this pull request?

This PR partially reverts #20691

After we changed the Python protocol to highest ones, seems like it introduced a correctness bug. This potentially affects all Python related code paths.

I suspect a bug related to Pryolite (maybe opcodes `MEMOIZE`, `FRAME` and/or our `RowPickler`). I would like to stick to default protocol for now and investigate the issue separately.

I will separately investigate later to bring highest protocol back.

## How was this patch tested?

Unittest was added.

```bash
./run-tests --python-executables=python3.7 --testname "pyspark.sql.tests.test_serde SerdeTests.test_int_array_serialization"
```

Closes #24519 from HyukjinKwon/SPARK-27612.

Authored-by: HyukjinKwon <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.