Skip to content

Conversation

@superbobry
Copy link
Contributor

@superbobry superbobry commented Apr 27, 2018

What changes were proposed in this pull request?

Prior to this PR _hack_namedtuple was applied to all namedtuples,
regardless if they were defined on the top level of some module, and
are therefore picklable using the default __reduce__ implementation,
or not.

The PR ensures that only the non-picklable namedtuples are hacked, i.e.
the ones defined in the REPL or locally in a function or method.

Note that the namedtuple might be defined locally but still be picklable
without the hack applied.

def define():
    global Foo
    Foo = namedtuple("Foo", [])

The current implementation does not cover such cases and will apply
the hack anyway.

Sidenote: the PR also fixes the module name of the hacked namedtuples.
Due to an extra layer of indirection added by _hijack_namedtuple,
all hacked namedtuples had "collections" as __module__. This
behaviour is no longer the case.

How was this patch tested?

SerializationTestCase and RDDTests.

…ples

Prior to this PR ``_hack_namedtuple`` was applied to all namedtuples,
regardless if they were defined on the top level of some module, and
are therefore picklable using the default ``__reduce__`` implementation,
or not.

The PR ensures that only the non-picklable namedtuples are hacked, i.e.
the ones defined in the REPL or locally in a function or method.

Note that the namedtuple might be defined locally but still be picklable
without the hack applied.

    def define():
        global Foo
        Foo = namedtuple("Foo", [])

The current implementation do not cover such cases, and will apply
the hack anyway.

Sidenote: the PR also fixes the module name of the hacked namedtuples.
Due to an extra layer of indirection added by ``_hijack_namedtuple``,
all hacked namedtuples had "collections" as ``__module__``. This
behaviour is no longer the case.

SerializationTestCase and RDDTests.
@felixcheung
Copy link
Member

ok to test

@SparkQA
Copy link

SparkQA commented Apr 28, 2018

Test build #89951 has finished for PR 21180 at commit 37b0f6d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

return _hack_namedtuple(cls)

import sys
f = sys._getframe(1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm .. https://docs.python.org/2/library/sys.html#sys._getframe

CPython implementation detail: This function should be used for internal and specialized purposes only. It is not guaranteed to exist in all implementations of Python.

Is it safe to use it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. collections.nametuple has a fix for Jython and IronPython. I can backport it for completeness, but realistically, the probability of someone running PySpark on these implementations is not very high.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea but thing is, that the doc says this not guaranteed although most of Python implementations look having it - there's a risk here we should take (it could be broken in a specific implementation of Python although it sounds unlikely). Is there any other way to avoid this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another way to fix the module name is to get rid of the extra stack frame. This can be done by e.g. modifying the bytecode of collections.namedtuple so that it redirected to _hack_namedtuple when needed. However, I think that the current implementation (despite being magical) is better than bytecode hacking since it is as magical as collections.namedtuple itself.

I can change the PR to do the same as collections.namedtuple for cross-interpreter compatibility, wdyt?

This provides the same level of cross-interpreter compatibility
guarantees as the CPython implementation of ``namedtuple``.
@SparkQA
Copy link

SparkQA commented May 1, 2018

Test build #89992 has finished for PR 21180 at commit ec33738.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@superbobry
Copy link
Contributor Author

Hi @HyukjinKwon, could you kindly have a look at the new version? It is backwards compatible and has the same robustness guarantees as collections.namedtuple from CPython.

@superbobry
Copy link
Contributor Author

Hey @HyukjinKwon and @felixcheung, do you think the PR is good to be merged as-is, or would you like me to think further about how to make it more robust?

@superbobry
Copy link
Contributor Author

@HyukjinKwon, @felixcheung did you have the time to look into this? If you disagree with the proposed implementation, I think we should at least document the current behaviour somewhere and include debugging tips for the users.

@HyukjinKwon
Copy link
Member

Sorry, will take a look. Another release is close so I kind of happened to be busy. Particularly this case needs closer looks. I think you know the current approach doesn't look safe, right? We need to check other possibilities and see if this is the last resort to solve this issue. Usually it takes longer to review and check such cases.

@superbobry
Copy link
Contributor Author

superbobry commented May 28, 2018

Sure, I'd be happy if we could find a better/safer solution. I don't quite see why the proposed one is unsafe. Could you elaborate, please?

@HyukjinKwon
Copy link
Member

CPython-compatible implementations only and accessing a "private" method such as _getframe and also this code bit is basically within a hack.

@HyukjinKwon
Copy link
Member

I meant a bit hacky. Sorry for confusion by what I meant by safe.

@superbobry
Copy link
Contributor Author

superbobry commented May 28, 2018

I agree that in theory, the use of _getframe is somewhat limiting, but I doubt it would be a major concern in practice. Python does not have a standard, so the best we can hope for, as far as portability goes, is for other interpreters to behave identically to CPython. In any case, on non-CPython compliant interpreters, the worst we'd get is the current behaviour (recreate the namedtuple class on executors).

A possible alternative to _getframe is to use traceback.extract_stack which seems to behave identically on CPython/PyPy/Jython/IronPython. It'd work just fine for the original purpose of the PR. We'd still need _getframe for fixing the module value, though since f_globals is not available in the output of extract_stack. Wdyt?

@superbobry
Copy link
Contributor Author

Friendly ping @HyukjinKwon.

@HyukjinKwon
Copy link
Member

ok to test

@SparkQA
Copy link

SparkQA commented Jun 9, 2018

Test build #91611 has finished for PR 21180 at commit ec33738.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@a-johnston
Copy link

+1 for fixing this. A workaround (in this case for the example given on Jira) has been something like:

class SparkTuple:
    def __reduce__(self):
        return self.__class__, tuple(self)

class PointSubclass(SparkTuple, Point):
    def sum(self):
        return self.x + self.y

effectively manually marking classes to un-hack but it would be great to avoid this if possible as the existing behavior is really unexpected.

@superbobry
Copy link
Contributor Author

superbobry commented Jul 12, 2018

Hey @HyukjinKwon, apologies for yet another ping. Should I ask someone else to review the PR on the spark-dev mailing list?

@superbobry
Copy link
Contributor Author

Sorry to bug you @HyukjinKwon, but I would really like for this patch to make it into the next PySpark release. Would you have time in the following weeks to have another look at this?

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Aug 8, 2018

@superbobry, terribly sorry that I have been late .. can we target 3.0.0 for this? I don't have enough nerve to push this in to fix this hack within 2.4.0 since the code freeze is super close and Spark's is being super conservative.

@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Aug 8, 2018

Test build #94413 has finished for PR 21180 at commit ec33738.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@superbobry
Copy link
Contributor Author

Thanks for getting back and no worries about the delay.

[...] can we target 3.0.0 for this? I don't have enough nerve to push this in to fix this hack within 2.4.0 since the code freeze is super close and Spark's is being super conservative.

Do you mean that Spark 2.4 is the last release in 2.X branch? If yes, is there already a release date for 3.0.0?

@superbobry
Copy link
Contributor Author

@HyukjinKwon do you think asking for feedback on the dev mailing list might help decide, whether it is OK to merge it in 2.X?

I want to point out that the PR

  • does not break any of the currently working usages of namedtuple,
  • removes unhelpful error messages from code using namedtuple inheritance,
  • removes the serialization overhead for importable namedtuples.

@HyukjinKwon
Copy link
Member

That's an option too to listen to other people's thoughts. I wouldn't probably have some time to take a look for this one within 2.4.x timeline.

@superbobry
Copy link
Contributor Author

@HyukjinKwon could you reference other PySpark maintainers who could do a review and share their thoughts?

@HyukjinKwon
Copy link
Member

Usually @holdenk, @BryanCutler, @ueshin and @icexelloss (non committer).

Can we just target to remove this weird existing namedtuple hack in 3.0.0, and document the workaround (making it global)? The fix itself looks not the clean as well anyway for now. Since 2.4.0 is being closed, we will probably avoid risky changes.

@superbobry
Copy link
Contributor Author

superbobry commented Aug 20, 2018

The only workaround I know is to manually remove the __reduce__ method from all namedtuple classes in use. This has to be done on both the driver and the executors.

The fix itself looks not the clean as well anyway for now.

Could you kindly elaborate on the parts you'd like me to improve? A proper fix, I think, is to just remove the patch altogether, but this looks like a bigger change than making it work for more use-cases.

Since 2.4.0 is being closed, we will probably avoid risky changes.

Will 2.4.0 be the last release in 2.X branch? If yes, then targeting 3.0.0 sounds reasonable, otherwise, I'd very much prefer to have it in the release following 2.4.0.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Aug 20, 2018

Yea the next will likely be 3.0.0 and actually many fixes are being targeted to 3.0.0 (even one of mine). I thought this only affects globally defined named tuple, no?

@superbobry
Copy link
Contributor Author

Yea the next will likely be 3.0.0 and actually many fixes are being targeted to 3.0.0 (even one of mine).

Awesome, I'm OK to wait until 3.0.0 then. Is there a release date already?

I thought this only affects globally defined named tuple, no?

It does, but it is very likely that 99% of the namedtuples used in Python programs are defined globally. The ones defined in the PySpark shell would work as well, assuming they are proper namedtuples and not classes inheriting namedtuples.

@HyukjinKwon
Copy link
Member

@superbobry, sorry for back and forth. branch-2.4 is cut out. Can we open another PR that removes the hack? Let's make sure to leave the potential impact, workaround and strong justification (your blog) in the PR description. I am still hesitant but good to get rid of it 3.0.0. From a cursory look so far, I also think we should eventually get rid of it.

@superbobry
Copy link
Contributor Author

@HyukjinKwon sure, should I just open a new PR against master or some other branch?

@HyukjinKwon
Copy link
Member

Master please

@superbobry
Copy link
Contributor Author

@HyukjinKwon I've reopened the PR doing this and rebased it on the latest master, see #21157.

@superbobry superbobry closed this Sep 27, 2018
@Willymontaz Willymontaz deleted the hijack-non-importable-namedtuple branch April 2, 2019 15:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants