-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-22674][PYTHON] Removed the namedtuple pickling patch #23008
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This is a followup of the discussion in apache#21157. See the PR and the linked JIRA ticket for context and motivation.
|
ok to test |
|
Test build #98710 has finished for PR 23008 at commit
|
|
Is there a benchmark suite for PySpark? |
|
Nope, it should be manually done.. should be great to have it FWIW. I am not yet sure how we're going to measure the performance. I think you can show the performance diff for namedtuple for now - that's going to at the very least show some numbers to compare. |
|
If the perf diff is big, let's try to discuss with other people about an option that we don't change but document that we can use If the perf diff is rather trivial, let's check if we can keep this change. I will help to check the perf in this case as well. |
|
BTW, let.s test them in end-to-end. For instance, |
|
Interestingly, $ cat a.py
from collections import namedtuple
A = namedtuple("A", ["foo", "bar"])
$ python -c "from a import A; import cloudpickle; print(len(cloudpickle.dumps(A(42, 24))))"
30
$ python -c "from a import A; import pickle; print(len(pickle.dumps(A(42, 24))))"
20If the namedtuple is not importable, the size of the result explodes because >>> from collections import namedtuple
>>> A = namedtuple("A", ["foo", "bar"])
>>> import cloudpickle
>>> len(cloudpickle.dumps(A(42, 24)))
3836
>>> import pickle
>>> len(pickle.dumps(A(42, 24)))
27Note that the order of magnitude is incomparable to what PySpark does currently: >>> import pyspark
>>> A = namedtuple("A", ["foo", "bar"])
>>> len(pickle.dumps(A(42, 24)))
79 |
|
ok to test |
|
Can we add a Spark configuration to control this? |
|
I mean on and off the hack. |
|
Hm, I can pick up your commit and open a PR as well. let me take a look when I have some time too. |
|
Test build #100664 has finished for PR 23008 at commit
|
Sure, do you mean an option to use cloudpickle or just pickle? I think |
|
Correct but there are other delta, for instance, normal pickler uses C impl which is faster in general whereas cloudpickle looks possible to be slower. I was thinking just adding one flag because people are being worried about the behaviour change. If we explicitly be able to switch on and off the hack itself alone, I think I can leave sign-off since we can preserve the previous behaviour 100% as is by the switch. |
|
Oh, sorry, I missed that you propose to keep the hack but make it opt-in. I suspect that serializability of REPL-defined namedtuples affects only a small fraction of users. Therefore, removing the hack is an acceptable behaviour change (cc @holdenk) . We could clearly document this in the ->3.X migration document and potentially enable "cloudpickle" by default when PySpark is running in an interactive mode. Keeping the hack and adding a flag on top does not fix the problematic behavior and does not make the failures any easier to diagnose. |
|
Ah, I meant we keep the switch for 3.0. if not lot of users complain about behaviour changes, we could completely remove out the hack in the next release of 3.0. I think this is the most conservative approach. |
|
Can one of the admins verify this patch? |
|
We're closing this PR because it hasn't been updated in a while. If you'd like to revive this PR, please reopen it! |
|
@HyukjinKwon I think you might still want to merge this eventually. Closing the PR will only make the issue harder to discover. |
|
This is affecting beam <> pyspark compatibility: https://issues.apache.org/jira/browse/SPARK-32079 Wondering if this can be reopened |
|
I suspect it might be too late know that 3.X is out, but perhaps @HyukjinKwon could comment? |
|
I agree. But maybe 3.1 or something like that. It's a bit difficult to debug as well. |
I know :) It's a shame the PR was not merged in time for 3.0. |
What changes were proposed in this pull request?
Prior to this PR PySpark patched
collections.namedtupleto makenamedtuple instances serializable even if the namedtuple class has been
defined outside of
globals(), e.g.The patch changed the pickled representation of the namedtuple instance
to include the structure of namedtuple class, and recreate the class on
each unpickling. This behaviour causes hard to diagnose failures both
in the user code with namedtuples, as well as third-party libraries
relying on them. See 1 and 2 for details.
The PR changes the default serializer to
CloudPickleSerializerwhich natively supports pickling namedtuples and does not require the aforementioned patch. To the best of my knowledge, this is not a breaking change.How was this patch tested?
PySpark test suite.