[SPARK-22674][PYTHON] Disabled _hack_namedtuple for picklable namedtuples #21180

superbobry · 2018-04-27T14:45:13Z

What changes were proposed in this pull request?

Prior to this PR _hack_namedtuple was applied to all namedtuples,
regardless if they were defined on the top level of some module, and
are therefore picklable using the default __reduce__ implementation,
or not.

The PR ensures that only the non-picklable namedtuples are hacked, i.e.
the ones defined in the REPL or locally in a function or method.

Note that the namedtuple might be defined locally but still be picklable
without the hack applied.

def define():
    global Foo
    Foo = namedtuple("Foo", [])

The current implementation does not cover such cases and will apply
the hack anyway.

Sidenote: the PR also fixes the module name of the hacked namedtuples.
Due to an extra layer of indirection added by _hijack_namedtuple,
all hacked namedtuples had "collections" as __module__. This
behaviour is no longer the case.

How was this patch tested?

SerializationTestCase and RDDTests.

…ples Prior to this PR ``_hack_namedtuple`` was applied to all namedtuples, regardless if they were defined on the top level of some module, and are therefore picklable using the default ``__reduce__`` implementation, or not. The PR ensures that only the non-picklable namedtuples are hacked, i.e. the ones defined in the REPL or locally in a function or method. Note that the namedtuple might be defined locally but still be picklable without the hack applied. def define(): global Foo Foo = namedtuple("Foo", []) The current implementation do not cover such cases, and will apply the hack anyway. Sidenote: the PR also fixes the module name of the hacked namedtuples. Due to an extra layer of indirection added by ``_hijack_namedtuple``, all hacked namedtuples had "collections" as ``__module__``. This behaviour is no longer the case. SerializationTestCase and RDDTests.

felixcheung · 2018-04-28T06:01:21Z

ok to test

SparkQA · 2018-04-28T06:37:23Z

Test build #89951 has finished for PR 21180 at commit 37b0f6d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-04-28T13:40:51Z

python/pyspark/serializers.py

-        return _hack_namedtuple(cls)
+
+        import sys
+        f = sys._getframe(1)


Hm .. https://docs.python.org/2/library/sys.html#sys._getframe

CPython implementation detail: This function should be used for internal and specialized purposes only. It is not guaranteed to exist in all implementations of Python.

Is it safe to use it?

Good point. collections.nametuple has a fix for Jython and IronPython. I can backport it for completeness, but realistically, the probability of someone running PySpark on these implementations is not very high.

Yea but thing is, that the doc says this not guaranteed although most of Python implementations look having it - there's a risk here we should take (it could be broken in a specific implementation of Python although it sounds unlikely). Is there any other way to avoid this?

Another way to fix the module name is to get rid of the extra stack frame. This can be done by e.g. modifying the bytecode of collections.namedtuple so that it redirected to _hack_namedtuple when needed. However, I think that the current implementation (despite being magical) is better than bytecode hacking since it is as magical as collections.namedtuple itself.

I can change the PR to do the same as collections.namedtuple for cross-interpreter compatibility, wdyt?

This provides the same level of cross-interpreter compatibility guarantees as the CPython implementation of ``namedtuple``.

SparkQA · 2018-05-01T18:45:11Z

Test build #89992 has finished for PR 21180 at commit ec33738.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

superbobry · 2018-05-06T22:03:32Z

Hi @HyukjinKwon, could you kindly have a look at the new version? It is backwards compatible and has the same robustness guarantees as collections.namedtuple from CPython.

superbobry · 2018-05-14T20:14:49Z

Hey @HyukjinKwon and @felixcheung, do you think the PR is good to be merged as-is, or would you like me to think further about how to make it more robust?

superbobry · 2018-05-28T08:44:56Z

@HyukjinKwon, @felixcheung did you have the time to look into this? If you disagree with the proposed implementation, I think we should at least document the current behaviour somewhere and include debugging tips for the users.

HyukjinKwon · 2018-05-28T08:58:35Z

Sorry, will take a look. Another release is close so I kind of happened to be busy. Particularly this case needs closer looks. I think you know the current approach doesn't look safe, right? We need to check other possibilities and see if this is the last resort to solve this issue. Usually it takes longer to review and check such cases.

superbobry · 2018-05-28T09:00:47Z

Sure, I'd be happy if we could find a better/safer solution. I don't quite see why the proposed one is unsafe. Could you elaborate, please?

HyukjinKwon · 2018-05-28T09:03:08Z

CPython-compatible implementations only and accessing a "private" method such as _getframe and also this code bit is basically within a hack.

HyukjinKwon · 2018-05-28T09:04:39Z

I meant a bit hacky. Sorry for confusion by what I meant by safe.

superbobry · 2018-05-28T11:13:03Z

I agree that in theory, the use of _getframe is somewhat limiting, but I doubt it would be a major concern in practice. Python does not have a standard, so the best we can hope for, as far as portability goes, is for other interpreters to behave identically to CPython. In any case, on non-CPython compliant interpreters, the worst we'd get is the current behaviour (recreate the namedtuple class on executors).

A possible alternative to _getframe is to use traceback.extract_stack which seems to behave identically on CPython/PyPy/Jython/IronPython. It'd work just fine for the original purpose of the PR. We'd still need _getframe for fixing the module value, though since f_globals is not available in the output of extract_stack. Wdyt?

superbobry · 2018-06-06T19:44:03Z

Friendly ping @HyukjinKwon.

HyukjinKwon · 2018-06-09T09:03:20Z

ok to test

SparkQA · 2018-06-09T09:44:32Z

Test build #91611 has finished for PR 21180 at commit ec33738.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

a-johnston · 2018-07-02T17:59:01Z

+1 for fixing this. A workaround (in this case for the example given on Jira) has been something like:

class SparkTuple:
    def __reduce__(self):
        return self.__class__, tuple(self)

class PointSubclass(SparkTuple, Point):
    def sum(self):
        return self.x + self.y

effectively manually marking classes to un-hack but it would be great to avoid this if possible as the existing behavior is really unexpected.

superbobry · 2018-07-12T10:48:56Z

Hey @HyukjinKwon, apologies for yet another ping. Should I ask someone else to review the PR on the spark-dev mailing list?

superbobry · 2018-08-07T20:18:52Z

Sorry to bug you @HyukjinKwon, but I would really like for this patch to make it into the next PySpark release. Would you have time in the following weeks to have another look at this?

HyukjinKwon · 2018-08-08T08:21:15Z

@superbobry, terribly sorry that I have been late .. can we target 3.0.0 for this? I don't have enough nerve to push this in to fix this hack within 2.4.0 since the code freeze is super close and Spark's is being super conservative.

HyukjinKwon · 2018-08-08T08:21:30Z

retest this please

SparkQA · 2018-08-08T09:05:22Z

Test build #94413 has finished for PR 21180 at commit ec33738.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

superbobry · 2018-08-08T09:06:19Z

Thanks for getting back and no worries about the delay.

[...] can we target 3.0.0 for this? I don't have enough nerve to push this in to fix this hack within 2.4.0 since the code freeze is super close and Spark's is being super conservative.

Do you mean that Spark 2.4 is the last release in 2.X branch? If yes, is there already a release date for 3.0.0?

superbobry · 2018-08-14T12:53:11Z

@HyukjinKwon do you think asking for feedback on the dev mailing list might help decide, whether it is OK to merge it in 2.X?

I want to point out that the PR

does not break any of the currently working usages of namedtuple,
removes unhelpful error messages from code using namedtuple inheritance,
removes the serialization overhead for importable namedtuples.

HyukjinKwon · 2018-08-14T13:11:30Z

That's an option too to listen to other people's thoughts. I wouldn't probably have some time to take a look for this one within 2.4.x timeline.

superbobry · 2018-08-20T08:41:36Z

@HyukjinKwon could you reference other PySpark maintainers who could do a review and share their thoughts?

HyukjinKwon · 2018-08-20T08:50:27Z

Usually @holdenk, @BryanCutler, @ueshin and @icexelloss (non committer).

Can we just target to remove this weird existing namedtuple hack in 3.0.0, and document the workaround (making it global)? The fix itself looks not the clean as well anyway for now. Since 2.4.0 is being closed, we will probably avoid risky changes.

superbobry · 2018-08-20T09:44:24Z

The only workaround I know is to manually remove the __reduce__ method from all namedtuple classes in use. This has to be done on both the driver and the executors.

The fix itself looks not the clean as well anyway for now.

Could you kindly elaborate on the parts you'd like me to improve? A proper fix, I think, is to just remove the patch altogether, but this looks like a bigger change than making it work for more use-cases.

Since 2.4.0 is being closed, we will probably avoid risky changes.

Will 2.4.0 be the last release in 2.X branch? If yes, then targeting 3.0.0 sounds reasonable, otherwise, I'd very much prefer to have it in the release following 2.4.0.

HyukjinKwon · 2018-08-20T12:14:10Z

Yea the next will likely be 3.0.0 and actually many fixes are being targeted to 3.0.0 (even one of mine). I thought this only affects globally defined named tuple, no?

superbobry · 2018-08-20T21:01:16Z

Yea the next will likely be 3.0.0 and actually many fixes are being targeted to 3.0.0 (even one of mine).

Awesome, I'm OK to wait until 3.0.0 then. Is there a release date already?

I thought this only affects globally defined named tuple, no?

It does, but it is very likely that 99% of the namedtuples used in Python programs are defined globally. The ones defined in the PySpark shell would work as well, assuming they are proper namedtuples and not classes inheriting namedtuples.

HyukjinKwon · 2018-09-08T05:02:04Z

@superbobry, sorry for back and forth. branch-2.4 is cut out. Can we open another PR that removes the hack? Let's make sure to leave the potential impact, workaround and strong justification (your blog) in the PR description. I am still hesitant but good to get rid of it 3.0.0. From a cursory look so far, I also think we should eventually get rid of it.

superbobry · 2018-09-09T19:55:29Z

@HyukjinKwon sure, should I just open a new PR against master or some other branch?

HyukjinKwon · 2018-09-10T01:03:27Z

Master please

superbobry · 2018-09-27T16:28:13Z

@HyukjinKwon I've reopened the PR doing this and rebased it on the latest master, see #21157.

superbobry mentioned this pull request Apr 27, 2018

[SPARK-22674][PYTHON] Removed the namedtuple pickling patch #21157

Closed

HyukjinKwon reviewed Apr 28, 2018

View reviewed changes

Wrapped the sys._getframe call in a try-catch

ec33738

This provides the same level of cross-interpreter compatibility guarantees as the CPython implementation of ``namedtuple``.

superbobry closed this Sep 27, 2018

Willymontaz deleted the hijack-non-importable-namedtuple branch April 2, 2019 15:07

[SPARK-22674][PYTHON] Disabled _hack_namedtuple for picklable namedtuples #21180

[SPARK-22674][PYTHON] Disabled _hack_namedtuple for picklable namedtuples #21180

Uh oh!

Conversation

superbobry commented Apr 27, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

felixcheung commented Apr 28, 2018

Uh oh!

SparkQA commented Apr 28, 2018

Uh oh!

HyukjinKwon Apr 28, 2018

Choose a reason for hiding this comment

Uh oh!

superbobry Apr 28, 2018

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Apr 29, 2018

Choose a reason for hiding this comment

Uh oh!

superbobry Apr 29, 2018

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 1, 2018

Uh oh!

superbobry commented May 6, 2018

Uh oh!

superbobry commented May 14, 2018

Uh oh!

superbobry commented May 28, 2018

Uh oh!

HyukjinKwon commented May 28, 2018

Uh oh!

superbobry commented May 28, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon commented May 28, 2018

Uh oh!

HyukjinKwon commented May 28, 2018

Uh oh!

superbobry commented May 28, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

superbobry commented Jun 6, 2018

Uh oh!

HyukjinKwon commented Jun 9, 2018

Uh oh!

SparkQA commented Jun 9, 2018

Uh oh!

a-johnston commented Jul 2, 2018

Uh oh!

superbobry commented Jul 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

superbobry commented Aug 7, 2018

Uh oh!

HyukjinKwon commented Aug 8, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon commented Aug 8, 2018

Uh oh!

SparkQA commented Aug 8, 2018

Uh oh!

superbobry commented Aug 8, 2018

Uh oh!

superbobry commented Aug 14, 2018

Uh oh!

HyukjinKwon commented Aug 14, 2018

Uh oh!

superbobry commented Aug 20, 2018

Uh oh!

HyukjinKwon commented Aug 20, 2018

Uh oh!

superbobry commented Aug 20, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon commented Aug 20, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

superbobry commented Apr 27, 2018 •

edited

Loading

superbobry commented May 28, 2018 •

edited

Loading

superbobry commented May 28, 2018 •

edited

Loading

superbobry commented Jul 12, 2018 •

edited

Loading

HyukjinKwon commented Aug 8, 2018 •

edited

Loading

superbobry commented Aug 20, 2018 •

edited

Loading

HyukjinKwon commented Aug 20, 2018 •

edited

Loading