-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-22674][PYTHON] Disabled _hack_namedtuple for picklable namedtuples #21180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-22674][PYTHON] Disabled _hack_namedtuple for picklable namedtuples #21180
Conversation
…ples
Prior to this PR ``_hack_namedtuple`` was applied to all namedtuples,
regardless if they were defined on the top level of some module, and
are therefore picklable using the default ``__reduce__`` implementation,
or not.
The PR ensures that only the non-picklable namedtuples are hacked, i.e.
the ones defined in the REPL or locally in a function or method.
Note that the namedtuple might be defined locally but still be picklable
without the hack applied.
def define():
global Foo
Foo = namedtuple("Foo", [])
The current implementation do not cover such cases, and will apply
the hack anyway.
Sidenote: the PR also fixes the module name of the hacked namedtuples.
Due to an extra layer of indirection added by ``_hijack_namedtuple``,
all hacked namedtuples had "collections" as ``__module__``. This
behaviour is no longer the case.
SerializationTestCase and RDDTests.
|
ok to test |
|
Test build #89951 has finished for PR 21180 at commit
|
python/pyspark/serializers.py
Outdated
| return _hack_namedtuple(cls) | ||
|
|
||
| import sys | ||
| f = sys._getframe(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm .. https://docs.python.org/2/library/sys.html#sys._getframe
CPython implementation detail: This function should be used for internal and specialized purposes only. It is not guaranteed to exist in all implementations of Python.
Is it safe to use it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. collections.nametuple has a fix for Jython and IronPython. I can backport it for completeness, but realistically, the probability of someone running PySpark on these implementations is not very high.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea but thing is, that the doc says this not guaranteed although most of Python implementations look having it - there's a risk here we should take (it could be broken in a specific implementation of Python although it sounds unlikely). Is there any other way to avoid this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another way to fix the module name is to get rid of the extra stack frame. This can be done by e.g. modifying the bytecode of collections.namedtuple so that it redirected to _hack_namedtuple when needed. However, I think that the current implementation (despite being magical) is better than bytecode hacking since it is as magical as collections.namedtuple itself.
I can change the PR to do the same as collections.namedtuple for cross-interpreter compatibility, wdyt?
This provides the same level of cross-interpreter compatibility guarantees as the CPython implementation of ``namedtuple``.
|
Test build #89992 has finished for PR 21180 at commit
|
|
Hi @HyukjinKwon, could you kindly have a look at the new version? It is backwards compatible and has the same robustness guarantees as |
|
Hey @HyukjinKwon and @felixcheung, do you think the PR is good to be merged as-is, or would you like me to think further about how to make it more robust? |
|
@HyukjinKwon, @felixcheung did you have the time to look into this? If you disagree with the proposed implementation, I think we should at least document the current behaviour somewhere and include debugging tips for the users. |
|
Sorry, will take a look. Another release is close so I kind of happened to be busy. Particularly this case needs closer looks. I think you know the current approach doesn't look safe, right? We need to check other possibilities and see if this is the last resort to solve this issue. Usually it takes longer to review and check such cases. |
|
Sure, I'd be happy if we could find a better/safer solution. I don't quite see why the proposed one is unsafe. Could you elaborate, please? |
|
CPython-compatible implementations only and accessing a "private" method such as |
|
I meant a bit hacky. Sorry for confusion by what I meant by safe. |
|
I agree that in theory, the use of A possible alternative to |
|
Friendly ping @HyukjinKwon. |
|
ok to test |
|
Test build #91611 has finished for PR 21180 at commit
|
|
+1 for fixing this. A workaround (in this case for the example given on Jira) has been something like: class SparkTuple:
def __reduce__(self):
return self.__class__, tuple(self)
class PointSubclass(SparkTuple, Point):
def sum(self):
return self.x + self.yeffectively manually marking classes to un-hack but it would be great to avoid this if possible as the existing behavior is really unexpected. |
|
Hey @HyukjinKwon, apologies for yet another ping. Should I ask someone else to review the PR on the spark-dev mailing list? |
|
Sorry to bug you @HyukjinKwon, but I would really like for this patch to make it into the next PySpark release. Would you have time in the following weeks to have another look at this? |
|
@superbobry, terribly sorry that I have been late .. can we target 3.0.0 for this? I don't have enough nerve to push this in to fix this hack within 2.4.0 since the code freeze is super close and Spark's is being super conservative. |
|
retest this please |
|
Test build #94413 has finished for PR 21180 at commit
|
|
Thanks for getting back and no worries about the delay.
Do you mean that Spark 2.4 is the last release in 2.X branch? If yes, is there already a release date for 3.0.0? |
|
@HyukjinKwon do you think asking for feedback on the dev mailing list might help decide, whether it is OK to merge it in 2.X? I want to point out that the PR
|
|
That's an option too to listen to other people's thoughts. I wouldn't probably have some time to take a look for this one within 2.4.x timeline. |
|
@HyukjinKwon could you reference other PySpark maintainers who could do a review and share their thoughts? |
|
Usually @holdenk, @BryanCutler, @ueshin and @icexelloss (non committer). Can we just target to remove this weird existing namedtuple hack in 3.0.0, and document the workaround (making it global)? The fix itself looks not the clean as well anyway for now. Since 2.4.0 is being closed, we will probably avoid risky changes. |
|
The only workaround I know is to manually remove the
Could you kindly elaborate on the parts you'd like me to improve? A proper fix, I think, is to just remove the patch altogether, but this looks like a bigger change than making it work for more use-cases.
Will 2.4.0 be the last release in 2.X branch? If yes, then targeting 3.0.0 sounds reasonable, otherwise, I'd very much prefer to have it in the release following 2.4.0. |
|
Yea the next will likely be 3.0.0 and actually many fixes are being targeted to 3.0.0 (even one of mine). I thought this only affects globally defined named tuple, no? |
Awesome, I'm OK to wait until 3.0.0 then. Is there a release date already?
It does, but it is very likely that 99% of the namedtuples used in Python programs are defined globally. The ones defined in the PySpark shell would work as well, assuming they are proper namedtuples and not classes inheriting namedtuples. |
|
@superbobry, sorry for back and forth. branch-2.4 is cut out. Can we open another PR that removes the hack? Let's make sure to leave the potential impact, workaround and strong justification (your blog) in the PR description. I am still hesitant but good to get rid of it 3.0.0. From a cursory look so far, I also think we should eventually get rid of it. |
|
@HyukjinKwon sure, should I just open a new PR against master or some other branch? |
|
Master please |
|
@HyukjinKwon I've reopened the PR doing this and rebased it on the latest master, see #21157. |
What changes were proposed in this pull request?
Prior to this PR
_hack_namedtuplewas applied to all namedtuples,regardless if they were defined on the top level of some module, and
are therefore picklable using the default
__reduce__implementation,or not.
The PR ensures that only the non-picklable namedtuples are hacked, i.e.
the ones defined in the REPL or locally in a function or method.
Note that the namedtuple might be defined locally but still be picklable
without the hack applied.
The current implementation does not cover such cases and will apply
the hack anyway.
Sidenote: the PR also fixes the module name of the hacked namedtuples.
Due to an extra layer of indirection added by
_hijack_namedtuple,all hacked namedtuples had "collections" as
__module__. Thisbehaviour is no longer the case.
How was this patch tested?
SerializationTestCase and RDDTests.