[SPARK-22674][PYTHON] Removed the namedtuple pickling patch #23008

superbobry · 2018-11-11T22:11:35Z

What changes were proposed in this pull request?

Prior to this PR PySpark patched collections.namedtuple to make
namedtuple instances serializable even if the namedtuple class has been
defined outside of globals(), e.g.

def do_something():
    Foo = namedtuple("Foo", ["foo"])
    sc.parallelize(range(1)).map(lambda _: Foo(42))

The patch changed the pickled representation of the namedtuple instance
to include the structure of namedtuple class, and recreate the class on
each unpickling. This behaviour causes hard to diagnose failures both
in the user code with namedtuples, as well as third-party libraries
relying on them. See 1 and 2 for details.

The PR changes the default serializer to CloudPickleSerializer which natively supports pickling namedtuples and does not require the aforementioned patch. To the best of my knowledge, this is not a breaking change.

How was this patch tested?

PySpark test suite.

This is a followup of the discussion in apache#21157. See the PR and the linked JIRA ticket for context and motivation.

HyukjinKwon · 2018-11-12T02:12:07Z

ok to test

SparkQA · 2018-11-12T02:52:41Z

Test build #98710 has finished for PR 23008 at commit 9a81879.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

superbobry · 2018-11-12T15:08:43Z

Is there a benchmark suite for PySpark?

HyukjinKwon · 2018-11-12T17:24:46Z

Nope, it should be manually done.. should be great to have it FWIW.

I am not yet sure how we're going to measure the performance. I think you can show the performance diff for namedtuple for now - that's going to at the very least show some numbers to compare.

HyukjinKwon · 2018-11-12T17:27:37Z

If the perf diff is big, let's try to discuss with other people about an option that we don't change but document that we can use CloudPickleSerializer() to avoid breaking change.

If the perf diff is rather trivial, let's check if we can keep this change. I will help to check the perf in this case as well.

HyukjinKwon · 2018-11-12T17:29:31Z

BTW, let.s test them in end-to-end. For instance, spark.range(10000).rdd.map(lambda blabla).count()

superbobry · 2018-11-12T21:51:29Z

Interestingly, cloudpickle adds overhead even if the namedtuple is importable:

$ cat a.py 
from collections import namedtuple
A = namedtuple("A", ["foo", "bar"])
$ python -c "from a import A; import cloudpickle; print(len(cloudpickle.dumps(A(42, 24))))"
30
$ python -c "from a import A; import pickle; print(len(pickle.dumps(A(42, 24))))"
20

If the namedtuple is not importable, the size of the result explodes because cloudpickle includes a full class definition along with all the docstrings with every pickled object:

>>> from collections import namedtuple
>>> A = namedtuple("A", ["foo", "bar"])
>>> import cloudpickle
>>> len(cloudpickle.dumps(A(42, 24)))
3836
>>> import pickle
>>> len(pickle.dumps(A(42, 24)))
27

Note that the order of magnitude is incomparable to what PySpark does currently:

>>> import pyspark
>>> A = namedtuple("A", ["foo", "bar"])
>>> len(pickle.dumps(A(42, 24)))
79

HyukjinKwon · 2019-01-03T01:54:17Z

ok to test

HyukjinKwon · 2019-01-03T01:54:56Z

Can we add a Spark configuration to control this?

HyukjinKwon · 2019-01-03T01:55:16Z

I mean on and off the hack.

HyukjinKwon · 2019-01-03T02:15:10Z

Hm, I can pick up your commit and open a PR as well. let me take a look when I have some time too.

SparkQA · 2019-01-03T07:05:34Z

Test build #100664 has finished for PR 23008 at commit 9a81879.

This patch passes all tests.
This patch does not merge cleanly.
This patch adds no public classes.

superbobry · 2019-01-03T10:44:50Z

Can we add a Spark configuration to control this?

Sure, do you mean an option to use cloudpickle or just pickle? I think pyspark.serializer does exactly that, right?

HyukjinKwon · 2019-01-03T10:56:05Z

Correct but there are other delta, for instance, normal pickler uses C impl which is faster in general whereas cloudpickle looks possible to be slower.

I was thinking just adding one flag because people are being worried about the behaviour change. If we explicitly be able to switch on and off the hack itself alone, I think I can leave sign-off since we can preserve the previous behaviour 100% as is by the switch.

superbobry · 2019-01-04T13:37:34Z

Oh, sorry, I missed that you propose to keep the hack but make it opt-in. I suspect that serializability of REPL-defined namedtuples affects only a small fraction of users. Therefore, removing the hack is an acceptable behaviour change (cc @holdenk) . We could clearly document this in the ->3.X migration document and potentially enable "cloudpickle" by default when PySpark is running in an interactive mode.

Keeping the hack and adding a flag on top does not fix the problematic behavior and does not make the failures any easier to diagnose.

HyukjinKwon · 2019-01-04T14:02:45Z

Ah, I meant we keep the switch for 3.0.

if not lot of users complain about behaviour changes, we could completely remove out the hack in the next release of 3.0.
If they complain, we can let them know switch on the hack in 3.0.

I think this is the most conservative approach.

AmplabJenkins · 2019-09-16T18:17:43Z

Can one of the admins verify this patch?

github-actions · 2020-01-04T00:06:37Z

We're closing this PR because it hasn't been updated in a while.
This isn't a judgement on the merit of the PR in any way. It's just
a way of keeping the PR queue manageable.

If you'd like to revive this PR, please reopen it!

superbobry · 2020-01-04T20:11:04Z

@HyukjinKwon I think you might still want to merge this eventually. Closing the PR will only make the issue harder to discover.

casassg · 2020-06-24T18:16:05Z

This is affecting beam <> pyspark compatibility: https://issues.apache.org/jira/browse/SPARK-32079 Wondering if this can be reopened

superbobry · 2020-06-25T11:53:01Z

I suspect it might be too late know that 3.X is out, but perhaps @HyukjinKwon could comment?

casassg · 2020-06-25T18:28:09Z

I agree. But maybe 3.1 or something like that. It's a bit difficult to debug as well.

superbobry · 2020-06-26T11:51:53Z

It's a bit difficult to debug as well.

I know :) It's a shame the PR was not merged in time for 3.0.

superbobry added 2 commits November 11, 2018 21:05

Removed namedtuple hack and made cloudpickle the default serializer

36ff697

This is a followup of the discussion in apache#21157. See the PR and the linked JIRA ticket for context and motivation.

Changed SerializationTestCase to use cloudpickle

9a81879

superbobry mentioned this pull request Nov 11, 2018

[SPARK-22674][PYTHON] Removed the namedtuple pickling patch #21157

Closed

dongjoon-hyun added the PYSPARK label Jun 14, 2019

shahidki31 mentioned this pull request Aug 7, 2019

[SPARK-28638][WebUI] Task summary should only contain successful tasks' metrics #25369

Closed

github-actions bot added the Stale label Jan 4, 2020

github-actions bot closed this Jan 5, 2020

tvalentyn mentioned this pull request Oct 8, 2020

[WIP][SPARK-22674][PYTHON] Removed namedtuple hack and made cloudpickle the default serializer #29851

Closed

[SPARK-22674][PYTHON] Removed the namedtuple pickling patch #23008

[SPARK-22674][PYTHON] Removed the namedtuple pickling patch #23008

Uh oh!

Conversation

superbobry commented Nov 11, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

HyukjinKwon commented Nov 12, 2018

Uh oh!

SparkQA commented Nov 12, 2018

Uh oh!

superbobry commented Nov 12, 2018

Uh oh!

HyukjinKwon commented Nov 12, 2018

Uh oh!

HyukjinKwon commented Nov 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon commented Nov 12, 2018

Uh oh!

superbobry commented Nov 12, 2018

Uh oh!

HyukjinKwon commented Jan 3, 2019

Uh oh!

HyukjinKwon commented Jan 3, 2019

Uh oh!

HyukjinKwon commented Jan 3, 2019

Uh oh!

HyukjinKwon commented Jan 3, 2019

Uh oh!

SparkQA commented Jan 3, 2019

Uh oh!

superbobry commented Jan 3, 2019

Uh oh!

HyukjinKwon commented Jan 3, 2019

Uh oh!

superbobry commented Jan 4, 2019

Uh oh!

HyukjinKwon commented Jan 4, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AmplabJenkins commented Sep 16, 2019

Uh oh!

github-actions bot commented Jan 4, 2020

Uh oh!

superbobry commented Jan 4, 2020

Uh oh!

casassg commented Jun 24, 2020

Uh oh!

superbobry commented Jun 25, 2020

Uh oh!

casassg commented Jun 25, 2020

Uh oh!

superbobry commented Jun 26, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

HyukjinKwon commented Nov 12, 2018 •

edited

Loading

HyukjinKwon commented Jan 4, 2019 •

edited

Loading