[SPARK-22674][PYTHON] Removed the namedtuple pickling patch #21157

superbobry · 2018-04-25T15:25:13Z

What changes were proposed in this pull request?

This is a breaking change.

Prior to this commit PySpark patched collections.namedtuple to make
namedtuple instances serializable even if the namedtuple class has been
defined outside of globals(), e.g.

def do_something():
    Foo = namedtuple("Foo", ["foo"])
    sc.parallelize(range(1)).map(lambda _: Foo(42))

The patch changed the pickled representation of the namedtuple instance
to include the structure of namedtuple class, and recreate the class on
each unpickling. This behaviour causes hard to diagnose failures both
in the user code with namedtuples, as well as third-party libraries
relying on them. See 1 and 2 for details.

How was this patch tested?

PySpark test suite.

HyukjinKwon · 2018-04-26T01:26:25Z

Why don't we try to fix it rather than removing out? Does the test even pass?

HyukjinKwon · 2018-04-26T01:26:31Z

ok to test

SparkQA · 2018-04-26T01:44:38Z

Test build #89865 has finished for PR 21157 at commit eadc0c8.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-04-26T02:01:31Z

Solid -1 on the complete removal if it breaks.

HyukjinKwon · 2018-04-26T07:39:49Z

Let's think about other ways to fix them until 3.0.0. I think the complete removal is the last resort we could consider for 3.0.0.

superbobry · 2018-04-26T09:41:15Z

Does the test even pass?

I think the tests should pass, modulo the tests specifically checking the behaviour being removed. I think the failing RDD test is in this group as well.

Why don't we try to fix it rather than removing out?

I might be overly pessimistic but I don't see how we can make the patch work in all cases without making the implementation more magical, and as a result, producing even more confusing error messages when things go wrong. Consider, for instance, a widespread pattern

class Foo(namedtuple("Foo", [])):
    def foo(self):
        return 42

If the outer Foo class does not explicitly customize pickling, it would use the "fallback" implementation added by _hijack_namedtuple, which only knows about the inner namedtuple class. Therefore, confusingly enough issubclass(pickle.loads(foo), Foo) is False (as detailed in 2).

What can we do about this? We somehow need to serialize the full definition of the outer Foo class alongside every instance. Maybe this can be done by recursively pickling the class __name__, __bases__ and __dict__, but __dict__ could have some other hard-to-pickle objects like user-defined methods. Should we serialize these in the deconstructed form as well? These are tough questions, and I think they are better left outside the scope of PySpark.

That said, I think an alternative to completely removing the patch might be deprecating it, and advertizing cloudpickle for workloads using namedtuples (or even making it the default?). I've played with cloudpickle a little bit, and it seems to solve the aforementioned issues in a consistent manner. The price, however, is the added overhead:

>>> len(pickle.dumps(Foo()))
23
>>> len(cloudpickle.dumps(Foo()))
3538

or, even more extreme,

>>> class A: pass
...
>>> len(cloudpickle.dumps(A()))
177

What do you think?

HyukjinKwon · 2018-04-26T09:58:13Z

I don't like the hack too but the complete removal just basically means we are going to drop namedtuple supports in RDD without, for example, any deprecation warnings. Spark is being super conservative and this's going to break compatibility. So, I was thinking we could do this for Spark 3.0. We already started to talk about this.

This should probably be something we should discuss in the mailing list since it's a breaking change. One thing clear is that the complete removal should target 3.0.0, even if we are going ahead with it.

For now, yea, cloudpickle solves it but probably it's less performant.

Logically, cloudpickle fixed it (our cloudpickle copy is matched to 0.4.3) and we can take after the fix with the normal pickle side, can't we?

SparkQA · 2018-04-26T10:08:01Z

Test build #89878 has finished for PR 21157 at commit 67c4f67.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

superbobry · 2018-04-26T10:42:37Z

Yes, we can backport some of the cloudpickle code to make the patch less fragile. This would be a nontrivial change in an already complex code, but I'd be happy to sketch this if there's a consensus on the ML

Also, note that even without the patch it is possible to have an RDD of namedtuples as long as the namedtuple classes are defined inside an importable module, i.e. NOT inside a function/REPL.

SparkQA · 2018-04-26T11:11:41Z

Test build #89883 has finished for PR 21157 at commit c67ce29.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-04-26T13:22:54Z

Yea, my point is that it breaks other codes without a warning at all, which cases are perfectly reasonable before. We already have the copy of cloudpickle. The best should be a deduplicated fix for it, shouldn't it?

I am still solid -1 on the complete removal for Spark 2.x. We should find another way first for now. Removing out is the last resort.

I would consider the complete removal in Spark 3.x after having sufficient discussions.

superbobry · 2018-04-26T20:13:08Z

One improvement we can make is change the patch to bypass namedtuples which are importable. This would resolve the issues with namedtuples coming from third-party libraries. I can open a new PR doing this, wdyt?

HyukjinKwon · 2018-04-27T00:47:32Z

Please go ahead if there's another approach to avoid to remove but fix it.

felixcheung · 2018-04-27T07:05:14Z

agree we should avoid removing test code

superbobry · 2018-04-27T14:45:30Z

Closing in favour of #21180.

This is a breaking change. Prior to this commit PySpark patched ``collections.namedtuple`` to make namedtuple instances serializable even if the namedtuple class has been defined outside of ``globals()``, e.g. def do_something(): Foo = namedtuple("Foo", ["foo"]) sc.parallelize(range(1)).map(lambda _: Foo(42)) The patch changed the pickled representation of the namedtuple instance to include the structure of namedtuple class, and recreate the class on each unpickling. This behaviour causes hard to diagnose failures both in the user code with namedtuples, as well as third-party libraries relying on them. See [1] and [2] for details. [1]: https://superbobry.github.io/pyspark-silently-breaks-your-namedtuples.html [2]: https://superbobry.github.io/tensorflowonspark-or-the-namedtuple-patch-strikes-again.html

superbobry · 2018-09-27T16:27:40Z

Reopened and rebased to be merged into the 3.X branch. See discussion in #21180.

HyukjinKwon · 2018-09-27T16:28:02Z

ok to test

HyukjinKwon · 2018-09-27T16:34:46Z

Woah. Okay. Let me add some guys interested in this again (@felixcheung looks already here) - @ueshin, @BryanCutler, @holdenk amd @JoshRosen

Additionally @rxin too. Here's my understanding:

Reynold, here's what's going on: this is about the namedtuple hack removal we added a long long while ago. This hack isn't now super crucial since cloudpickle can handle this by its own without this hack. If we remove this, in case of normal RDD operations, that named tuple should be defined in local scope. If they are defined in global scope, it fails to pickle in the normal pickle (not cloudpickle which SQL code path uses).

So, real downside of removing this now is we disallow global scope namedtuple.
actual advantage of this is, that we can get rid of weird behaviours by this hack. For instance, see the PR description (both links https://superbobry.github.io/pyspark-silently-breaks-your-namedtuples.html and https://superbobry.github.io/tensorflowonspark-or-the-namedtuple-patch-strikes-again.html).

@superbobry, wanna add some more words?

superbobry · 2018-09-27T16:36:45Z

So, real downside of removing this now is we disallow global scope namedtuple.

Importable namedtuples and their subclasses could still be used inside an RDD. Only the namedtuples defined in the REPL would fail to pickle once this PR is merged.

SparkQA · 2018-09-27T16:55:58Z

Test build #96706 has finished for PR 21157 at commit 7f2ad87.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-10-01T12:42:57Z

Test build #96815 has finished for PR 21157 at commit 2addefb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2018-10-04T14:55:17Z

Is it possible to keep the current hack for things which can't be pickled, but remove the hack in the situation where the namedtuple is well behaved and it could be pickled directly by cloudpickle? That way we don't have a functionality regression but we also improve handling of named tuples more generally. Even if so, it would probably be best to wait for 3.0 since this is a pretty core change in terms of PySpark.

Before you put in the work though let's see if that the consensus approach (if possible).

superbobry · 2018-10-04T15:34:53Z

Is it possible to keep the current hack for things which can't be pickled, but remove the hack in the situation where the namedtuple is well behaved and it could be pickled directly by cloudpickle?

@holdenk yes, this has been proposed in #21180, and later rejected in favour of this one.

I would vote for complete removal of the hack (even though #21180 makes it much more usable) as

it only works for namedtuples and not for all dynamically defined classes,
it adds hard do identify serialization overhead to all jobs using namedtuples (we saw ~2x speedups after we manually removed __reduce__ added by _hack_namedtuple).

holdenk · 2018-10-12T17:39:21Z

Ok it looks like it was @HyukjinKwon who suggested that we remove this hack in general rather than the partial work around can I get your thoughts on why? It seems like the partial work around would give us the best of both worlds (e.g. we don't break peoples existing Spark code and we handle Python tuples better).

holdenk · 2018-10-12T17:40:52Z

Do you have the code for demonstrating the 2x speed up @superbobry ?

superbobry · 2018-10-12T20:28:59Z

Nope, the job I was referring to is not open source; but I guess the speedup is easy to justify: much less payload and faster deserialization:

>>> from collections import namedtuple
>>> Stats = namedtuple("Stats", ["sample_mean", "sample_variance"])
>>> import pickle
>>> len(pickle.dumps(Stats(42, 42)))
31
>>> len(pickle.dumps(("Stats", Stats._fields, (42, 42))))
68

holdenk · 2018-10-12T20:50:24Z

Makes sense. But if we only hijack the ones that we need then wouldn't we get the speedup in the ones where we don't need the hijacking?

superbobry · 2018-10-12T20:54:50Z

Yes, that is correct. That is why I think hijacking behaviour should be removed. It silently slows down the job and does not notify the user that a trivial change such as making the namedtuple importable could result in a speedup.

rxin · 2018-10-12T21:02:09Z

But that would break both ipython notebooks and repl right? Pretty significant breaking change.

holdenk · 2018-10-12T21:13:54Z

I mean, we could warn if we are doing the hijacking and not break peoples pipelines?

superbobry · 2018-10-13T18:36:58Z

Yes, it will break IPython notebooks as well. I wonder how often people actually defined namedtuples in a notebook?

Emitting a warning is a less extreme option, yes.

HyukjinKwon · 2018-10-14T05:56:42Z

Ok it looks like it was @HyukjinKwon who suggested that we remove this hack in general rather than the partial work around can I get your thoughts on why? It seems like the partial work around would give us the best of both worlds (e.g. we don't break peoples existing Spark code and we handle Python tuples better).

Sorry for the late response. Yes, I spent some time to take a look for this named tuple hack, and my impression was that we should have not added such fixes to only allow named tuple pickling. I first thought we shouldn't break the compatibility of course but after taking close looks few times, I started to support to remove this out.

The named tuple hack was introduced for both cloudpickle (SQL path) and normal pickle path, if I am not mistaken.

Cloudpickle at PySpark side now supports named tuple pickling so the workaround to allow the cases above should be to use CloudPickler when it's possible. I think PySpark API exposes this pickler (see the SparkContext's __init__) (@superbobry, mind if I ask to document this workaround and add a test (and see if it really works?)

HyukjinKwon · 2018-10-14T06:01:38Z

To keep the current behaviour without the workaround above (using CloudPickler), the weird fix is required (#21180) where some private methods should be used. I also gave few quick tries but looks not quite easy to fix.

It is a hack to remove and looks difficult to remove without any behaviour change but still now a rough workaround looked possible (CloudPickler) so I inclined to get rid of it at Spark 3.0 for now.

holdenk · 2018-10-19T16:24:53Z

If removing the hack entirely is going to brake named tuples defined in the repl I'm a -1 on that change. While we certainly are more free to make breaking API changes in a majour version release we still have to think through the scope of the change we're going to be pushing onto users and that's pretty large.

superbobry · 2018-10-20T10:09:10Z

If removing the hack entirely is going to brake named tuples defined in the repl I'm a -1 on that change.

Yes, but it might be OK for two reasons: people rarely define namedtuples in the REPL (hypothesis); and non-namedtuple classes do not work in the REPL even with the hack.

HyukjinKwon · 2018-10-20T11:21:58Z

The workaround is to use CloudPickler btw. Technically we have many cases that normal pickler does not support. This one specific case (namedtuple) was allowed by this weird hack for normal pickler

superbobry · 2018-10-21T19:24:45Z

cloudpickle does indeed support pickling namedtuples. Maybe the way to go is to remove the patch advertise cloudpickle serializer for projects relying on the old behaviour. Wdyt @holdenk?

holdenk · 2018-10-26T16:35:28Z

I think people do defined NamedTuples in Notebooks, so I'm going to stick with -1.

HyukjinKwon · 2018-10-26T23:56:18Z

Yea, so to avoid to break, we could change the default pickler to CloudPickler or document this workaround. @superbobry, can you check if the case can be preserved if we use CloudPickler instead?

HyukjinKwon · 2018-10-27T09:59:25Z

You can just replace it to CloudPickler, remove changes at tests, and push that commit here to show no case is broken

HyukjinKwon · 2018-10-27T10:01:53Z

And you can also run profiler to show the performance effect. See #19246 (comment) to run the profile

HyukjinKwon · 2018-10-27T10:08:55Z

Adding @gatorsmile and @cloud-fan as well since this might be potentially breaking changes for 3.0 release (it affects RDD operation only with namedtuple in certain case tho)

superbobry · 2018-10-27T19:57:06Z

@HyukjinKwon do you mean change the default serializer to cloudpickle and remove _hack_namedtuple?

superbobry · 2018-10-27T20:23:20Z

I think people do defined NamedTuples in Notebooks, so I'm going to stick with -1.

@holdenk I understand your point, but there are still things we could do without breaking existing code relying on namedtuple serialization. Option 1: switch to cloudpickle as suggested by @HyukjinKwon. Option 2: #21180. What would be your choice between the two?

HyukjinKwon · 2018-10-27T21:13:03Z

I meant to use

spark/python/pyspark/serializers.py

Line 583 in a97001d

class CloudPickleSerializer(PickleSerializer):

Instead of

spark/python/pyspark/serializers.py

Line 561 in a97001d

class PickleSerializer(FramedSerializer):

remove _hack_namedtuple?

Yup

superbobry · 2018-11-11T22:12:01Z

@HyukjinKwon done in #23008.

superbobry · 2018-11-11T22:19:04Z

Closing this PR to continue the discussion in the new one.

This is a followup of the discussion in apache#21157. See the PR and the linked JIRA ticket for context and motivation.

AnthonyTruchet approved these changes Apr 25, 2018

View reviewed changes

superbobry closed this Apr 27, 2018

superbobry deleted the no-hijack-namedtuple branch April 27, 2018 14:45

superbobry restored the no-hijack-namedtuple branch September 27, 2018 16:21

superbobry reopened this Sep 27, 2018

Sergei Lebedev added 3 commits September 27, 2018 18:27

Fixed test_namedtuple_in_rdd

0dc1391

Fixed test_infer_nested_schema

7f2ad87

superbobry force-pushed the no-hijack-namedtuple branch from c67ce29 to 7f2ad87 Compare September 27, 2018 16:27

superbobry mentioned this pull request Sep 27, 2018

[SPARK-22674][PYTHON] Disabled _hack_namedtuple for picklable namedtuples #21180

Closed

superbobry closed this Nov 11, 2018

Willymontaz deleted the no-hijack-namedtuple branch April 2, 2019 15:07

HyukjinKwon pushed a commit to HyukjinKwon/spark that referenced this pull request Sep 23, 2020

Removed namedtuple hack and made cloudpickle the default serializer

c0abd6c

This is a followup of the discussion in apache#21157. See the PR and the linked JIRA ticket for context and motivation.

[SPARK-22674][PYTHON] Removed the namedtuple pickling patch #21157

[SPARK-22674][PYTHON] Removed the namedtuple pickling patch #21157

Uh oh!

Conversation

superbobry commented Apr 25, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

HyukjinKwon commented Apr 26, 2018

Uh oh!

HyukjinKwon commented Apr 26, 2018

Uh oh!

SparkQA commented Apr 26, 2018

Uh oh!

HyukjinKwon commented Apr 26, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon commented Apr 26, 2018

Uh oh!

superbobry commented Apr 26, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon commented Apr 26, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Apr 26, 2018

Uh oh!

superbobry commented Apr 26, 2018

Uh oh!

SparkQA commented Apr 26, 2018

Uh oh!

HyukjinKwon commented Apr 26, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

superbobry commented Apr 26, 2018

Uh oh!

HyukjinKwon commented Apr 27, 2018

Uh oh!

felixcheung commented Apr 27, 2018

Uh oh!

superbobry commented Apr 27, 2018

Uh oh!

superbobry commented Sep 27, 2018

Uh oh!

HyukjinKwon commented Sep 27, 2018

Uh oh!

HyukjinKwon commented Sep 27, 2018

Uh oh!

superbobry commented Sep 27, 2018

Uh oh!

SparkQA commented Sep 27, 2018

Uh oh!

SparkQA commented Oct 1, 2018

Uh oh!

holdenk commented Oct 4, 2018

Uh oh!

superbobry commented Oct 4, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

holdenk commented Oct 12, 2018

Uh oh!

holdenk commented Oct 12, 2018

Uh oh!

superbobry commented Oct 12, 2018

Uh oh!

holdenk commented Oct 12, 2018

Uh oh!

superbobry commented Oct 12, 2018

Uh oh!

rxin commented Oct 12, 2018

Uh oh!

holdenk commented Oct 12, 2018

Uh oh!

superbobry commented Oct 13, 2018

Uh oh!

HyukjinKwon commented Oct 14, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon commented Oct 14, 2018

Uh oh!

HyukjinKwon commented Apr 26, 2018 •

edited

Loading

superbobry commented Apr 26, 2018 •

edited

Loading

HyukjinKwon commented Apr 26, 2018 •

edited

Loading

HyukjinKwon commented Apr 26, 2018 •

edited

Loading

superbobry commented Oct 4, 2018 •

edited

Loading

HyukjinKwon commented Oct 14, 2018 •

edited

Loading

HyukjinKwon commented Oct 20, 2018 •

edited

Loading

superbobry commented Oct 27, 2018 •

edited

Loading

HyukjinKwon commented Oct 27, 2018 •

edited

Loading