-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-22674][PYTHON] Removed the namedtuple pickling patch #21157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Why don't we try to fix it rather than removing out? Does the test even pass? |
|
ok to test |
|
Test build #89865 has finished for PR 21157 at commit
|
|
Solid -1 on the complete removal if it breaks. |
|
Let's think about other ways to fix them until 3.0.0. I think the complete removal is the last resort we could consider for 3.0.0. |
I think the tests should pass, modulo the tests specifically checking the behaviour being removed. I think the failing RDD test is in this group as well.
I might be overly pessimistic but I don't see how we can make the patch work in all cases without making the implementation more magical, and as a result, producing even more confusing error messages when things go wrong. Consider, for instance, a widespread pattern class Foo(namedtuple("Foo", [])):
def foo(self):
return 42If the outer What can we do about this? We somehow need to serialize the full definition of the outer That said, I think an alternative to completely removing the patch might be deprecating it, and advertizing >>> len(pickle.dumps(Foo()))
23
>>> len(cloudpickle.dumps(Foo()))
3538or, even more extreme, >>> class A: pass
...
>>> len(cloudpickle.dumps(A()))
177What do you think? |
|
I don't like the hack too but the complete removal just basically means we are going to drop namedtuple supports in RDD without, for example, any deprecation warnings. Spark is being super conservative and this's going to break compatibility. So, I was thinking we could do this for Spark 3.0. We already started to talk about this. This should probably be something we should discuss in the mailing list since it's a breaking change. One thing clear is that the complete removal should target 3.0.0, even if we are going ahead with it. For now, yea, Logically, |
|
Test build #89878 has finished for PR 21157 at commit
|
|
Yes, we can backport some of the cloudpickle code to make the patch less fragile. This would be a nontrivial change in an already complex code, but I'd be happy to sketch this if there's a consensus on the ML Also, note that even without the patch it is possible to have an RDD of namedtuples as long as the namedtuple classes are defined inside an importable module, i.e. NOT inside a function/REPL. |
|
Test build #89883 has finished for PR 21157 at commit
|
|
Yea, my point is that it breaks other codes without a warning at all, which cases are perfectly reasonable before. We already have the copy of cloudpickle. The best should be a deduplicated fix for it, shouldn't it? I am still solid -1 on the complete removal for Spark 2.x. We should find another way first for now. Removing out is the last resort. I would consider the complete removal in Spark 3.x after having sufficient discussions. |
|
One improvement we can make is change the patch to bypass namedtuples which are importable. This would resolve the issues with namedtuples coming from third-party libraries. I can open a new PR doing this, wdyt? |
|
Please go ahead if there's another approach to avoid to remove but fix it. |
|
agree we should avoid removing test code |
|
Closing in favour of #21180. |
This is a breaking change.
Prior to this commit PySpark patched ``collections.namedtuple`` to make
namedtuple instances serializable even if the namedtuple class has been
defined outside of ``globals()``, e.g.
def do_something():
Foo = namedtuple("Foo", ["foo"])
sc.parallelize(range(1)).map(lambda _: Foo(42))
The patch changed the pickled representation of the namedtuple instance
to include the structure of namedtuple class, and recreate the class on
each unpickling. This behaviour causes hard to diagnose failures both
in the user code with namedtuples, as well as third-party libraries
relying on them. See [1] and [2] for details.
[1]: https://superbobry.github.io/pyspark-silently-breaks-your-namedtuples.html
[2]: https://superbobry.github.io/tensorflowonspark-or-the-namedtuple-patch-strikes-again.html
|
Reopened and rebased to be merged into the 3.X branch. See discussion in #21180. |
c67ce29 to
7f2ad87
Compare
|
ok to test |
|
Woah. Okay. Let me add some guys interested in this again (@felixcheung looks already here) - @ueshin, @BryanCutler, @holdenk amd @JoshRosen Additionally @rxin too. Here's my understanding: Reynold, here's what's going on: this is about the namedtuple hack removal we added a long long while ago. This hack isn't now super crucial since cloudpickle can handle this by its own without this hack. If we remove this, in case of normal RDD operations, that named tuple should be defined in local scope. If they are defined in global scope, it fails to pickle in the normal pickle (not cloudpickle which SQL code path uses).
@superbobry, wanna add some more words? |
Importable namedtuples and their subclasses could still be used inside an RDD. Only the namedtuples defined in the REPL would fail to pickle once this PR is merged. |
|
Test build #96706 has finished for PR 21157 at commit
|
|
Test build #96815 has finished for PR 21157 at commit
|
|
Is it possible to keep the current hack for things which can't be pickled, but remove the hack in the situation where the namedtuple is well behaved and it could be pickled directly by cloudpickle? That way we don't have a functionality regression but we also improve handling of named tuples more generally. Even if so, it would probably be best to wait for 3.0 since this is a pretty core change in terms of PySpark. Before you put in the work though let's see if that the consensus approach (if possible). |
@holdenk yes, this has been proposed in #21180, and later rejected in favour of this one. I would vote for complete removal of the hack (even though #21180 makes it much more usable) as
|
|
Ok it looks like it was @HyukjinKwon who suggested that we remove this hack in general rather than the partial work around can I get your thoughts on why? It seems like the partial work around would give us the best of both worlds (e.g. we don't break peoples existing Spark code and we handle Python tuples better). |
|
Do you have the code for demonstrating the 2x speed up @superbobry ? |
|
Nope, the job I was referring to is not open source; but I guess the speedup is easy to justify: much less payload and faster deserialization: |
|
Makes sense. But if we only hijack the ones that we need then wouldn't we get the speedup in the ones where we don't need the hijacking? |
|
Yes, that is correct. That is why I think hijacking behaviour should be removed. It silently slows down the job and does not notify the user that a trivial change such as making the namedtuple importable could result in a speedup. |
|
But that would break both ipython notebooks and repl right? Pretty significant breaking change. |
|
I mean, we could warn if we are doing the hijacking and not break peoples pipelines? |
|
Yes, it will break IPython notebooks as well. I wonder how often people actually defined namedtuples in a notebook? Emitting a warning is a less extreme option, yes. |
Sorry for the late response. Yes, I spent some time to take a look for this named tuple hack, and my impression was that we should have not added such fixes to only allow named tuple pickling. I first thought we shouldn't break the compatibility of course but after taking close looks few times, I started to support to remove this out. The named tuple hack was introduced for both cloudpickle (SQL path) and normal pickle path, if I am not mistaken. Cloudpickle at PySpark side now supports named tuple pickling so the workaround to allow the cases above should be to use CloudPickler when it's possible. I think PySpark API exposes this pickler (see the |
|
To keep the current behaviour without the workaround above (using CloudPickler), the weird fix is required (#21180) where some private methods should be used. I also gave few quick tries but looks not quite easy to fix. It is a hack to remove and looks difficult to remove without any behaviour change but still now a rough workaround looked possible (CloudPickler) so I inclined to get rid of it at Spark 3.0 for now. |
|
If removing the hack entirely is going to brake named tuples defined in the repl I'm a -1 on that change. While we certainly are more free to make breaking API changes in a majour version release we still have to think through the scope of the change we're going to be pushing onto users and that's pretty large. |
Yes, but it might be OK for two reasons: people rarely define namedtuples in the REPL (hypothesis); and non-namedtuple classes do not work in the REPL even with the hack. |
|
The workaround is to use CloudPickler btw. Technically we have many cases that normal pickler does not support. This one specific case (namedtuple) was allowed by this weird hack for normal pickler |
|
|
|
I think people do defined NamedTuples in Notebooks, so I'm going to stick with -1. |
|
Yea, so to avoid to break, we could change the default pickler to CloudPickler or document this workaround. @superbobry, can you check if the case can be preserved if we use CloudPickler instead? |
|
You can just replace it to CloudPickler, remove changes at tests, and push that commit here to show no case is broken |
|
And you can also run profiler to show the performance effect. See #19246 (comment) to run the profile |
|
Adding @gatorsmile and @cloud-fan as well since this might be potentially breaking changes for 3.0 release (it affects RDD operation only with namedtuple in certain case tho) |
|
@HyukjinKwon do you mean change the default serializer to cloudpickle and remove _hack_namedtuple? |
@holdenk I understand your point, but there are still things we could do without breaking existing code relying on namedtuple serialization. Option 1: switch to cloudpickle as suggested by @HyukjinKwon. Option 2: #21180. What would be your choice between the two? |
|
I meant to use spark/python/pyspark/serializers.py Line 583 in a97001d
Instead of spark/python/pyspark/serializers.py Line 561 in a97001d
Yup |
|
@HyukjinKwon done in #23008. |
|
Closing this PR to continue the discussion in the new one. |
This is a followup of the discussion in apache#21157. See the PR and the linked JIRA ticket for context and motivation.
What changes were proposed in this pull request?
This is a breaking change.
Prior to this commit PySpark patched
collections.namedtupleto makenamedtuple instances serializable even if the namedtuple class has been
defined outside of
globals(), e.g.The patch changed the pickled representation of the namedtuple instance
to include the structure of namedtuple class, and recreate the class on
each unpickling. This behaviour causes hard to diagnose failures both
in the user code with namedtuples, as well as third-party libraries
relying on them. See 1 and 2 for details.
How was this patch tested?
PySpark test suite.