-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[WIP][SPARK-22674][PYTHON] Removed namedtuple hack and made cloudpickle the default serializer #29851
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This is a followup of the discussion in apache#21157. See the PR and the linked JIRA ticket for context and motivation.
This comment has been minimized.
This comment has been minimized.
12f501b to
db3eafa
Compare
|
Test build #129027 has finished for PR 29851 at commit
|
|
Test build #129029 has finished for PR 29851 at commit
|
|
Hi, @HyukjinKwon - do you know what are the next steps on addressing this issue? Do you plan to work on it? If not, is there any chance you can help find an owner? Thanks! |
|
I plan to remove this out at the end. Investigating. |
|
Thanks a lot. Some context: this monkeypatch makes it difficult for other libraries, like tensorflow-extended (tfx) to maintain interoperability for pyspark and requires rather inconvenient hacks to work around it. We would much rather prefer to fix this in pyspark instead. Do we need help or input from other pyspark maintainers to move forward with the fix? |
|
Is this fixed in Spark 3? |
|
According to #23008 (comment), it's not fixed in 3.0. |
|
@rcrowe-google @tvalentyn edit: I didn't read carefully; I do see that it affects TF too. This was affecting PyTorch and was fixed from PyTorch side yesterday - pytorch/pytorch#45870 Yes it would be nice to be fixed from PySpark side too. From what I understand, a potential fix in PySpark would require Python 3.8. @HyukjinKwon knows these details better . |
|
Hello! I recently got bitten by the monkeypatching of namedtuple. Anything I can do to help move this along? :-) |
|
Hey, yeah we should remove this away since cloudpickle is now backed by C-pickle (Python 3.8+) which is as fast as the regular pickle. I think we can switch to use cloudpickle after removing the hack. |
What changes were proposed in this pull request?
TDB
Why are the changes needed?
TDB
Does this PR introduce any user-facing change?
TDB
How was this patch tested?
TDB