-
Notifications
You must be signed in to change notification settings - Fork 814
Make Regex, RegexTokenizer, Vocab, Vectors, SentencePiece pickle-able #1104
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1104 +/- ##
=======================================
Coverage 77.54% 77.54%
=======================================
Files 45 45
Lines 3086 3086
=======================================
Hits 2393 2393
Misses 693 693 Continue to review full report at Codecov.
|
|
I cannot find the usage of |
| self.assertEqual(vectors_obj['not_in_it'], unk_tensor) | ||
|
|
||
| def test_vectors_load_and_save(self): | ||
| def test_vectors_update(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Splitting the test for updating as I think it should be tested separately from serialization.
|
The issue with linux conda build has nothing to do with the changes in this PR |
This should be resolved with #1106 |
This commit makes Regex, RegexTokenizer, Vocab, Vectors and SentencePiece
pickle-able on both PyBind11 and TorchScript.
The approach is
1. define `_serialize_XXX` and `_deserialize_XXX`
This is the replacement of `_get_states_XXX` and `_set_states_XXX`.
I saw the names of the original functions were flipped, and used wrongly in
`__getstate__` and `__setstate__` so I changed the function names to something
more descriptive.
2. Use `c10::intrusive_ptr` as holder for custom class when using pybind11.
This allows to use the same serialization/deserialization function for both
PyBind11 and TorchScript.
See https://pybind11.readthedocs.io/en/stable/advanced/smart_ptrs.html#smart-pointers
for the detail of holder.
This PR makes Regex, RegexTokenizer, Vocab, Vectors and SentencePiece
pickle-able on both PyBind11 and TorchScript.
closes #1085
Approach
define
_serialize_XXXand_deserialize_XXXnext to these classesThis is the replacement of
_get_states_XXXand_set_states_XXX.I saw the names of the original functions were flipped, and used wrongly in
__getstate__and__setstate__so I changed the function names to somethingmore descriptive and less confusing.
Use
c10::intrusive_ptras holder for custom class when using pybind11.This allows to use the same serialization/deserialization function for both
PyBind11 and TorchScript.
See https://pybind11.readthedocs.io/en/stable/advanced/smart_ptrs.html#smart-pointers
for the detail of holder.
For pickling TorchScript-bound
SentencePiece, use byte Tensor as bytes containerThe serialized form of
SentencePieceis byte string and returningstd::stringtoPython realm causes decoding error as Python tries to decode it as UTF-8.
PyBind11 can work around this with
pybind11::bytestype, but TorchScript does notsupport byte string, this approach uses bytes Tensor as a container/intermediate format
to pass byte string to Python.
Problem
TorchScript does not supportbytes, thus SentencePiece bound via TorchScript is not pickle-able.Added I/O round trip tests
RegexRegexTokenizerBasicEnglishNormalizeVocabVectorsSentencePiece