Skip to content
This repository was archived by the owner on Sep 10, 2025. It is now read-only.

Conversation

@mthrok
Copy link
Contributor

@mthrok mthrok commented Dec 16, 2020

This PR makes Regex, RegexTokenizer, Vocab, Vectors and SentencePiece
pickle-able on both PyBind11 and TorchScript.

closes #1085

Approach

  1. define _serialize_XXX and _deserialize_XXX next to these classes
    This is the replacement of _get_states_XXX and _set_states_XXX.
    I saw the names of the original functions were flipped, and used wrongly in
    __getstate__ and __setstate__ so I changed the function names to something
    more descriptive and less confusing.

  2. Use c10::intrusive_ptr as holder for custom class when using pybind11.
    This allows to use the same serialization/deserialization function for both
    PyBind11 and TorchScript.
    See https://pybind11.readthedocs.io/en/stable/advanced/smart_ptrs.html#smart-pointers
    for the detail of holder.

  3. For pickling TorchScript-bound SentencePiece, use byte Tensor as bytes container
    The serialized form of SentencePiece is byte string and returning std::string to
    Python realm causes decoding error as Python tries to decode it as UTF-8.
    PyBind11 can work around this with pybind11::bytes type, but TorchScript does not
    support byte string, this approach uses bytes Tensor as a container/intermediate format
    to pass byte string to Python.

Problem

Added I/O round trip tests

class PyBind11 TorchScript
Regex
RegexTokenizer
BasicEnglishNormalize
Vocab
Vectors
SentencePiece

@codecov
Copy link

codecov bot commented Dec 17, 2020

Codecov Report

Merging #1104 (d6249f8) into master (9562f80) will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #1104   +/-   ##
=======================================
  Coverage   77.54%   77.54%           
=======================================
  Files          45       45           
  Lines        3086     3086           
=======================================
  Hits         2393     2393           
  Misses        693      693           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9562f80...d6249f8. Read the comment docs.

@mthrok mthrok marked this pull request as ready for review December 17, 2020 03:31
@mthrok
Copy link
Contributor Author

mthrok commented Dec 17, 2020

I cannot find the usage of Regex class in Python realm. What should I do with that?

self.assertEqual(vectors_obj['not_in_it'], unk_tensor)

def test_vectors_load_and_save(self):
def test_vectors_update(self):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Splitting the test for updating as I think it should be tested separately from serialization.

@mthrok
Copy link
Contributor Author

mthrok commented Dec 17, 2020

The issue with linux conda build has nothing to do with the changes in this PR

conda/conda-package-handling#71

@seemethere
Copy link
Member

The issue with linux conda build has nothing to do with the changes in this PR

conda/conda-package-handling#71

This should be resolved with #1106

This commit makes Regex, RegexTokenizer, Vocab, Vectors and SentencePiece
pickle-able on both PyBind11 and TorchScript.

The approach is

1. define `_serialize_XXX` and `_deserialize_XXX`
    This is the replacement of `_get_states_XXX` and `_set_states_XXX`.
    I saw the names of the original functions were flipped, and used wrongly in
    `__getstate__` and `__setstate__` so I changed the function names to something
    more descriptive.

2. Use `c10::intrusive_ptr` as holder for custom class when using pybind11.
    This allows to use the same serialization/deserialization function for both
    PyBind11 and TorchScript.
    See https://pybind11.readthedocs.io/en/stable/advanced/smart_ptrs.html#smart-pointers
    for the detail of holder.
@zhangguanheng66 zhangguanheng66 merged commit aa7176b into pytorch:master Dec 22, 2020
@mthrok mthrok deleted the sp-binding2 branch December 22, 2020 19:43
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants