-
Notifications
You must be signed in to change notification settings - Fork 814
[Feature] Added capability to add special tokens in GPT2BPEEncoder and avoid splitting on them #1916
Conversation
torchtext/csrc/gpt2_bpe_tokenizer.h
Outdated
| bool is_never_split_token); | ||
| int64_t GetBPEMergeRank_(std::string pair); | ||
| int64_t added_to_vocab_tokens_count; | ||
| // std::set<std::string> bpe_never_split_set_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove uncommented code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, Joe! I removed the comment.
torchtext/transforms.py
Outdated
| from copy import deepcopy | ||
| from functools import lru_cache | ||
| from typing import Any, List, Optional, Tuple, Union | ||
| from typing import Any, Dict, List, Optional, Tuple, Union |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use more general Mapping
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done. Thank you!
torchtext/transforms.py
Outdated
| """ | ||
| return self.bpe.tokenize(text) | ||
|
|
||
| def add_special_tokens(self, special_tokens_dict: Dict[str, Union[str, List[str]]]) -> int: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How did you decide on this implementation? As I remember from the post, the inspiration was primarily from HuggingFace's implementation: https://huggingface.co/docs/transformers/internal/tokenization_utils#transformers.SpecialTokensMixin.add_tokens
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This implementation is very similar to the HuggingFace's.
For HuggingFace:
- Mixin defines the special tokens, and the structure enforced on the input: https://github.com/huggingface/transformers/blob/main/src/transformers/tokenization_utils_base.py#L896
PreTrainedTokenizerkeeps a separate encoder map for special tokens that are added later: https://github.com/huggingface/transformers/blob/main/src/transformers/tokenization_utils.py#L430 (also creates a trie to save all those tokens that are not to be split: https://github.com/huggingface/transformers/blob/main/src/transformers/tokenization_utils.py#L445)- Added tokens encoder map is checked first when doing encoding: https://github.com/huggingface/transformers/blob/main/src/transformers/tokenization_utils.py#L586
- Regex based approach is used to distinguish between non-special and special tokens: https://github.com/huggingface/transformers/blob/3f936df66287f557c6528912a9a68d7850913b9b/src/transformers/tokenization_utils.py#L511
Of course, HuggingFace's implementation is more generic and is extensible to all of their tokenizers through inheritance. GPT2Tokenizer inherits PreTrainedTokenizer which inherits PreTrainedTokenizerBase, which implements two mixins: SpecialTokensMixin, PushToHubMixin. Comparatively torchtext's tokenizers all inherit nn.Module with their respective (and fairly separated imo) C++ implementation.
This pull request is the minimal implementation of their general Python approach. That's why I kept the Python API to be the same. A more general implementation is possible as C++ supports multiple inheritance. But it might take a bit more time and require a broader discussion with your team.
Nayef211
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @reachsumit for taking the time to implement a highly requested feature to GPT2BPEEncoder. Let me know if my comments make sense. Happy to also have some offline discussions as needed before approving this PR! 😄
| [](const c10::intrusive_ptr<GPT2BPEEncoder>& self, | ||
| const std::unordered_map<std::string, std::string>& items, | ||
| const std::vector<std::string>& additional) { | ||
| c10::Dict<std::string, std::string> d; | ||
| for (const auto& item : items) | ||
| d.insert(item.first, item.second); | ||
| return (self->AddSpecialTokens(d, additional)); | ||
| }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume the reason you have this additional logic is because you're expecting a c10::Dict as inputs to AddSpecialTokens. Can we just get around this altogether by passing in std::unordered_map like we do in the constructor?
Ideally we want this file to be utilized only for pybind registration logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I originally tried using the std::unordered_map, but I get the following compilation error (unsupported input type: std::unordered_map<Key, Value>. Please use Dict<Key, Value> instead.) on doing so.
/Users/sumitkumar/mambaforge/envs/bootcamp-rl/lib/python3.10/site-packages/torch/include/ATen/core/boxing/impl/make_boxed_from_unboxed_functor.h:164:5: error: static_assert failed "You tried to register a kernel with an unsupported input type: std::unordered_map<Key, Value>. Please use Dict<Key, Value> instead."
static_assert(AllowDeprecatedTypes,
^ ~~~~~~~~~~~~~~~~~~~~
So, I switched to c10::Dict instead. But it turned out that the Python dictionary doesn't directly convert to our custom c10:Dict map, and it gives me following error if I try to do so.
TypeError: add_special_tokens(): incompatible function arguments. The following argument types are supported:
1. (self: torchtext._torchtext.GPT2BPEEncoder, arg0: c10::Dict<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, long long>) -> int
Invoked with: <torchtext._torchtext.GPT2BPEEncoder object at 0x12428de70>, {}
Did you forget to `#include <pybind11/stl.h>`? Or <pybind11/complex.h>,
<pybind11/functional.h>, <pybind11/chrono.h>, etc. Some automatic
conversions are optional and require extra headers to be included
when compiling your pybind11 module.
So, I chose to pass Python dict to C++ unordered_map, but convert it bindings to avoid the first error. This is similar to the logic for constructor.
| // - ELSE make token[-1] its own token and add to return list | ||
| // - ELSE IF prepend_space == True, prepend a space to the token and add to | ||
| // return list | ||
| // - ELSE, add token to return list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a few lines to the above comments that explains the additional logic needed to handle tokens contained in bpe_never_split_set_?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added more comments in AddSpecialTokens to indicate the usage of this set.
| } | ||
| } | ||
|
|
||
| added_to_vocab_tokens_count += newly_added; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be missing something here but where exactly is this variable used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I originally meant to use this variable to return the vocab size (original encoder size + added encoder size), similar to how HuggingFace API provisions it. But it appears that we don't really have any existing requirement for fetching size, so I just removed it in the latest commit.
| index_matches.push_back(input.substr(it->position(), it->length())); | ||
| last_idx = it->position() + it->length() + 1; | ||
| if (isspace(input[last_idx])) { | ||
| // rstrip | ||
| last_idx++; | ||
| } | ||
| } | ||
| if (last_idx < input.length() - 1) | ||
| index_matches.push_back( | ||
| input.substr(last_idx, input.length() - last_idx)); | ||
| } else { | ||
| index_matches.push_back(input); | ||
| } | ||
|
|
||
| for (std::string index_token : index_matches) { | ||
| bool is_never_split_token = | ||
| bpe_never_split_set_.find(index_token) != bpe_never_split_set_.end(); | ||
| if (is_never_split_token) { | ||
| tokens.push_back(index_token); | ||
| continue; | ||
| } | ||
| re2::StringPiece inp(index_token); | ||
| while (kGPT2Regex.FindAndConsume(&inp, &token)) { | ||
| if (is_whitespace(token)) { | ||
| prepend_space = false; | ||
| if (inp.empty()) { // token is last token | ||
| tokens.push_back(token); | ||
| } else { | ||
| if (token.length() > 1) { | ||
| tokens.push_back(token.substr(0, token.length() - 1)); | ||
| } | ||
| if (token[token.length() - 1] == ' ') { // last char is space | ||
| prepend_space = true; | ||
| } else { // push last whitespace char as a token if it is not a space | ||
| tokens.push_back(token.substr(token.length() - 1)); | ||
| } | ||
| } | ||
| } else if (prepend_space) { | ||
| tokens.push_back(" " + token); | ||
| prepend_space = false; | ||
| } else { | ||
| tokens.push_back(token); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm having a bit of a hard time following all of the additional logic to this method. Having a more detailed explanation via code comments + adding a few lines to the commented pseudocode would really help with readability for someone that is new to our codebase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, for the confusion. I agree that the logic is complex to understand just by looking at the code. I added some detailed explanation and example on the steps in the latest commit. Let me know if that is sufficient or any further details would help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the explanation you added really helps with the code readability, thanks!
| tokens.push_back(token.substr(0, token.length() - 1)); | ||
| std::vector<std::string> index_matches; | ||
|
|
||
| /* Notes on handling Special Tokens: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This thread is also very helpful for understanding spacing in BPE tokenization: https://discuss.huggingface.co/t/bpe-tokenizers-and-spaces-before-words/475
|
|
||
| return self.bpe.add_special_tokens( | ||
| {k: v for k, v in special_tokens_dict.items() if k != "additional_special_tokens"}, | ||
| special_tokens_dict.get("additional_special_tokens", []), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason for keeping standard special tokens separate from "additional" special tokens is to be able to easily provision access to those standard tokens later. (like this: https://github.com/huggingface/transformers/blob/main/src/transformers/tokenization_utils_base.py#L979).
torchtext/transforms.py
Outdated
| def __init__(self, encoder_json_path: str, vocab_bpe_path: str, return_tokens: bool = False) -> None: | ||
| super().__init__() | ||
| self._seperator = "\u0001" | ||
| self.SPECIAL_TOKENS_ATTRIBUTES = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These special tokens are fairly common across all tokenizers. Would it make sense to pull these out into commons or utils so that we can reuse them across multiple transforms? cc @Nayef211
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved SPECIAL_TOKENS_ATTRIBUTES to utils in latest commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mthrok taught me about the YAGNI principle which I believe is quite applicable here. Let's not move these out into commons or utils until we see a concrete case where they can be reused. I also think it would make more sense for SPECIAL_TOKENS_ATTRIBUTES to be a class variable of GPT2BPETokenizer rather than an instance variable as these attributes would not be changing across instances. Lmk if this makes sense.
Nayef211
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing all the comments @reachsumit! The changes LGTM other than a few nits. I will merge this PR once those are resolved!
| `add_special_tokens` API. | ||
| - form a regex pattern that helps in extracting special tokens from the | ||
| input text. | ||
| * Crate a vector that contains chunks of input text, such that each chunk is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: create
torchtext/transforms.py
Outdated
| def __init__(self, encoder_json_path: str, vocab_bpe_path: str, return_tokens: bool = False) -> None: | ||
| super().__init__() | ||
| self._seperator = "\u0001" | ||
| self.SPECIAL_TOKENS_ATTRIBUTES = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mthrok taught me about the YAGNI principle which I believe is quite applicable here. Let's not move these out into commons or utils until we see a concrete case where they can be reused. I also think it would make more sense for SPECIAL_TOKENS_ATTRIBUTES to be a class variable of GPT2BPETokenizer rather than an instance variable as these attributes would not be changing across instances. Lmk if this makes sense.
| index_matches.push_back(input.substr(it->position(), it->length())); | ||
| last_idx = it->position() + it->length() + 1; | ||
| if (isspace(input[last_idx])) { | ||
| // rstrip | ||
| last_idx++; | ||
| } | ||
| } | ||
| if (last_idx < input.length() - 1) | ||
| index_matches.push_back( | ||
| input.substr(last_idx, input.length() - last_idx)); | ||
| } else { | ||
| index_matches.push_back(input); | ||
| } | ||
|
|
||
| for (std::string index_token : index_matches) { | ||
| bool is_never_split_token = | ||
| bpe_never_split_set_.find(index_token) != bpe_never_split_set_.end(); | ||
| if (is_never_split_token) { | ||
| tokens.push_back(index_token); | ||
| continue; | ||
| } | ||
| re2::StringPiece inp(index_token); | ||
| while (kGPT2Regex.FindAndConsume(&inp, &token)) { | ||
| if (is_whitespace(token)) { | ||
| prepend_space = false; | ||
| if (inp.empty()) { // token is last token | ||
| tokens.push_back(token); | ||
| } else { | ||
| if (token.length() > 1) { | ||
| tokens.push_back(token.substr(0, token.length() - 1)); | ||
| } | ||
| if (token[token.length() - 1] == ' ') { // last char is space | ||
| prepend_space = true; | ||
| } else { // push last whitespace char as a token if it is not a space | ||
| tokens.push_back(token.substr(token.length() - 1)); | ||
| } | ||
| } | ||
| } else if (prepend_space) { | ||
| tokens.push_back(" " + token); | ||
| prepend_space = false; | ||
| } else { | ||
| tokens.push_back(token); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the explanation you added really helps with the code readability, thanks!
58b31ee to
ad249fc
Compare
|
Thanks for your comments, @Nayef211 ! I addressed them in the latest commits and also rebased the change on latest main branch. Let me know if in case of any concerns. |
Description
Add add_special_tokens feature to GPT2BPETokenizer
This change is targeted towards a requirement posted internally. It adds a new function
add_special_tokenstoGPT2BPETokenizerin order to enable the user to specify a dictionary of token that should not be changed during tokenization. Any newly specified token shall also be added to the vocabulary.The functionality is same as HuggingFace's
add_special_tokensfeature.Types of changes
[x ] New feature (non-breaking change which adds core functionality)
Changes made
GPT2BPEEncoderclass) to support theadd_special_tokensfunctionality. Also added the logic to ensure those special token are not split during tokenization.add_special_tokensfunction to Python interface (GPT2BPETokenizerclass).Testing
pre-commit