Skip to content

Conversation

@superbobry
Copy link
Contributor

What changes were proposed in this pull request?

Prior to this PR PySpark patched collections.namedtuple to make
namedtuple instances serializable even if the namedtuple class has been
defined outside of globals(), e.g.

def do_something():
    Foo = namedtuple("Foo", ["foo"])
    sc.parallelize(range(1)).map(lambda _: Foo(42))

The patch changed the pickled representation of the namedtuple instance
to include the structure of namedtuple class, and recreate the class on
each unpickling. This behaviour causes hard to diagnose failures both
in the user code with namedtuples, as well as third-party libraries
relying on them. See 1 and 2 for details.

The PR changes the default serializer to CloudPickleSerializer which natively supports pickling namedtuples and does not require the aforementioned patch. To the best of my knowledge, this is not a breaking change.

How was this patch tested?

PySpark test suite.

This is a followup of the discussion in apache#21157. See the PR and the
linked JIRA ticket for context and motivation.
@HyukjinKwon
Copy link
Member

ok to test

@SparkQA
Copy link

SparkQA commented Nov 12, 2018

Test build #98710 has finished for PR 23008 at commit 9a81879.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@superbobry
Copy link
Contributor Author

Is there a benchmark suite for PySpark?

@HyukjinKwon
Copy link
Member

Nope, it should be manually done.. should be great to have it FWIW.

I am not yet sure how we're going to measure the performance. I think you can show the performance diff for namedtuple for now - that's going to at the very least show some numbers to compare.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Nov 12, 2018

If the perf diff is big, let's try to discuss with other people about an option that we don't change but document that we can use CloudPickleSerializer() to avoid breaking change.

If the perf diff is rather trivial, let's check if we can keep this change. I will help to check the perf in this case as well.

@HyukjinKwon
Copy link
Member

BTW, let.s test them in end-to-end. For instance, spark.range(10000).rdd.map(lambda blabla).count()

@superbobry
Copy link
Contributor Author

Interestingly, cloudpickle adds overhead even if the namedtuple is importable:

$ cat a.py 
from collections import namedtuple
A = namedtuple("A", ["foo", "bar"])
$ python -c "from a import A; import cloudpickle; print(len(cloudpickle.dumps(A(42, 24))))"
30
$ python -c "from a import A; import pickle; print(len(pickle.dumps(A(42, 24))))"
20

If the namedtuple is not importable, the size of the result explodes because cloudpickle includes a full class definition along with all the docstrings with every pickled object:

>>> from collections import namedtuple
>>> A = namedtuple("A", ["foo", "bar"])
>>> import cloudpickle
>>> len(cloudpickle.dumps(A(42, 24)))
3836
>>> import pickle
>>> len(pickle.dumps(A(42, 24)))
27

Note that the order of magnitude is incomparable to what PySpark does currently:

>>> import pyspark
>>> A = namedtuple("A", ["foo", "bar"])
>>> len(pickle.dumps(A(42, 24)))
79

@HyukjinKwon
Copy link
Member

ok to test

@HyukjinKwon
Copy link
Member

Can we add a Spark configuration to control this?

@HyukjinKwon
Copy link
Member

I mean on and off the hack.

@HyukjinKwon
Copy link
Member

Hm, I can pick up your commit and open a PR as well. let me take a look when I have some time too.

@SparkQA
Copy link

SparkQA commented Jan 3, 2019

Test build #100664 has finished for PR 23008 at commit 9a81879.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@superbobry
Copy link
Contributor Author

Can we add a Spark configuration to control this?

Sure, do you mean an option to use cloudpickle or just pickle? I think pyspark.serializer does exactly that, right?

@HyukjinKwon
Copy link
Member

Correct but there are other delta, for instance, normal pickler uses C impl which is faster in general whereas cloudpickle looks possible to be slower.

I was thinking just adding one flag because people are being worried about the behaviour change. If we explicitly be able to switch on and off the hack itself alone, I think I can leave sign-off since we can preserve the previous behaviour 100% as is by the switch.

@superbobry
Copy link
Contributor Author

Oh, sorry, I missed that you propose to keep the hack but make it opt-in. I suspect that serializability of REPL-defined namedtuples affects only a small fraction of users. Therefore, removing the hack is an acceptable behaviour change (cc @holdenk) . We could clearly document this in the ->3.X migration document and potentially enable "cloudpickle" by default when PySpark is running in an interactive mode.

Keeping the hack and adding a flag on top does not fix the problematic behavior and does not make the failures any easier to diagnose.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Jan 4, 2019

Ah, I meant we keep the switch for 3.0.

if not lot of users complain about behaviour changes, we could completely remove out the hack in the next release of 3.0.
If they complain, we can let them know switch on the hack in 3.0.

I think this is the most conservative approach.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@github-actions
Copy link

github-actions bot commented Jan 4, 2020

We're closing this PR because it hasn't been updated in a while.
This isn't a judgement on the merit of the PR in any way. It's just
a way of keeping the PR queue manageable.

If you'd like to revive this PR, please reopen it!

@github-actions github-actions bot added the Stale label Jan 4, 2020
@superbobry
Copy link
Contributor Author

@HyukjinKwon I think you might still want to merge this eventually. Closing the PR will only make the issue harder to discover.

@github-actions github-actions bot closed this Jan 5, 2020
@casassg
Copy link

casassg commented Jun 24, 2020

This is affecting beam <> pyspark compatibility: https://issues.apache.org/jira/browse/SPARK-32079 Wondering if this can be reopened

@superbobry
Copy link
Contributor Author

I suspect it might be too late know that 3.X is out, but perhaps @HyukjinKwon could comment?

@casassg
Copy link

casassg commented Jun 25, 2020

I agree. But maybe 3.1 or something like that. It's a bit difficult to debug as well.

@superbobry
Copy link
Contributor Author

It's a bit difficult to debug as well.

I know :) It's a shame the PR was not merged in time for 3.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants