Skip to content
This repository was archived by the owner on Sep 10, 2025. It is now read-only.

Conversation

@reachsumit
Copy link
Contributor

Description

Add add_special_tokens feature to GPT2BPETokenizer

This change is targeted towards a requirement posted internally. It adds a new function add_special_tokens to GPT2BPETokenizer in order to enable the user to specify a dictionary of token that should not be changed during tokenization. Any newly specified token shall also be added to the vocabulary.

The functionality is same as HuggingFace's add_special_tokens feature.

Types of changes

[x ] New feature (non-breaking change which adds core functionality)

Changes made

  • Made changes to C++ code (GPT2BPEEncoder class) to support the add_special_tokens functionality. Also added the logic to ensure those special token are not split during tokenization.
  • Updated torch bindings and py bindings as required.
  • Added add_special_tokens function to Python interface (GPT2BPETokenizer class).
  • Added unit test to extensively test the tokenization process.

Testing

  • Verified tokenization output to be consistent with HuggingFace's tokenizer.
  • No issue identified in pre-commit
  • No issue identified with any of the unit tests.

bool is_never_split_token);
int64_t GetBPEMergeRank_(std::string pair);
int64_t added_to_vocab_tokens_count;
// std::set<std::string> bpe_never_split_set_;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove uncommented code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Joe! I removed the comment.

from copy import deepcopy
from functools import lru_cache
from typing import Any, List, Optional, Tuple, Union
from typing import Any, Dict, List, Optional, Tuple, Union
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use more general Mapping

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done. Thank you!

"""
return self.bpe.tokenize(text)

def add_special_tokens(self, special_tokens_dict: Dict[str, Union[str, List[str]]]) -> int:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did you decide on this implementation? As I remember from the post, the inspiration was primarily from HuggingFace's implementation: https://huggingface.co/docs/transformers/internal/tokenization_utils#transformers.SpecialTokensMixin.add_tokens

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation is very similar to the HuggingFace's.

For HuggingFace:

Of course, HuggingFace's implementation is more generic and is extensible to all of their tokenizers through inheritance. GPT2Tokenizer inherits PreTrainedTokenizer which inherits PreTrainedTokenizerBase, which implements two mixins: SpecialTokensMixin, PushToHubMixin. Comparatively torchtext's tokenizers all inherit nn.Module with their respective (and fairly separated imo) C++ implementation.

This pull request is the minimal implementation of their general Python approach. That's why I kept the Python API to be the same. A more general implementation is possible as C++ supports multiple inheritance. But it might take a bit more time and require a broader discussion with your team.

Copy link
Contributor

@Nayef211 Nayef211 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @reachsumit for taking the time to implement a highly requested feature to GPT2BPEEncoder. Let me know if my comments make sense. Happy to also have some offline discussions as needed before approving this PR! 😄

Comment on lines +184 to +192
[](const c10::intrusive_ptr<GPT2BPEEncoder>& self,
const std::unordered_map<std::string, std::string>& items,
const std::vector<std::string>& additional) {
c10::Dict<std::string, std::string> d;
for (const auto& item : items)
d.insert(item.first, item.second);
return (self->AddSpecialTokens(d, additional));
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume the reason you have this additional logic is because you're expecting a c10::Dict as inputs to AddSpecialTokens. Can we just get around this altogether by passing in std::unordered_map like we do in the constructor?

Ideally we want this file to be utilized only for pybind registration logic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally tried using the std::unordered_map, but I get the following compilation error (unsupported input type: std::unordered_map<Key, Value>. Please use Dict<Key, Value> instead.) on doing so.

/Users/sumitkumar/mambaforge/envs/bootcamp-rl/lib/python3.10/site-packages/torch/include/ATen/core/boxing/impl/make_boxed_from_unboxed_functor.h:164:5: error: static_assert failed "You tried to register a kernel with an unsupported input type: std::unordered_map<Key, Value>. Please use Dict<Key, Value> instead."
    static_assert(AllowDeprecatedTypes,
    ^             ~~~~~~~~~~~~~~~~~~~~

So, I switched to c10::Dict instead. But it turned out that the Python dictionary doesn't directly convert to our custom c10:Dict map, and it gives me following error if I try to do so.

TypeError: add_special_tokens(): incompatible function arguments. The following argument types are supported:
    1. (self: torchtext._torchtext.GPT2BPEEncoder, arg0: c10::Dict<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, long long>) -> int

Invoked with: <torchtext._torchtext.GPT2BPEEncoder object at 0x12428de70>, {}

Did you forget to `#include <pybind11/stl.h>`? Or <pybind11/complex.h>,
<pybind11/functional.h>, <pybind11/chrono.h>, etc. Some automatic
conversions are optional and require extra headers to be included
when compiling your pybind11 module.

So, I chose to pass Python dict to C++ unordered_map, but convert it bindings to avoid the first error. This is similar to the logic for constructor.

// - ELSE make token[-1] its own token and add to return list
// - ELSE IF prepend_space == True, prepend a space to the token and add to
// return list
// - ELSE, add token to return list
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a few lines to the above comments that explains the additional logic needed to handle tokens contained in bpe_never_split_set_?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added more comments in AddSpecialTokens to indicate the usage of this set.

}
}

added_to_vocab_tokens_count += newly_added;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be missing something here but where exactly is this variable used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally meant to use this variable to return the vocab size (original encoder size + added encoder size), similar to how HuggingFace API provisions it. But it appears that we don't really have any existing requirement for fetching size, so I just removed it in the latest commit.

Comment on lines 111 to 183
index_matches.push_back(input.substr(it->position(), it->length()));
last_idx = it->position() + it->length() + 1;
if (isspace(input[last_idx])) {
// rstrip
last_idx++;
}
}
if (last_idx < input.length() - 1)
index_matches.push_back(
input.substr(last_idx, input.length() - last_idx));
} else {
index_matches.push_back(input);
}

for (std::string index_token : index_matches) {
bool is_never_split_token =
bpe_never_split_set_.find(index_token) != bpe_never_split_set_.end();
if (is_never_split_token) {
tokens.push_back(index_token);
continue;
}
re2::StringPiece inp(index_token);
while (kGPT2Regex.FindAndConsume(&inp, &token)) {
if (is_whitespace(token)) {
prepend_space = false;
if (inp.empty()) { // token is last token
tokens.push_back(token);
} else {
if (token.length() > 1) {
tokens.push_back(token.substr(0, token.length() - 1));
}
if (token[token.length() - 1] == ' ') { // last char is space
prepend_space = true;
} else { // push last whitespace char as a token if it is not a space
tokens.push_back(token.substr(token.length() - 1));
}
}
} else if (prepend_space) {
tokens.push_back(" " + token);
prepend_space = false;
} else {
tokens.push_back(token);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having a bit of a hard time following all of the additional logic to this method. Having a more detailed explanation via code comments + adding a few lines to the commented pseudocode would really help with readability for someone that is new to our codebase.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, for the confusion. I agree that the logic is complex to understand just by looking at the code. I added some detailed explanation and example on the steps in the latest commit. Let me know if that is sufficient or any further details would help.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the explanation you added really helps with the code readability, thanks!

tokens.push_back(token.substr(0, token.length() - 1));
std::vector<std::string> index_matches;

/* Notes on handling Special Tokens:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This thread is also very helpful for understanding spacing in BPE tokenization: https://discuss.huggingface.co/t/bpe-tokenizers-and-spaces-before-words/475


return self.bpe.add_special_tokens(
{k: v for k, v in special_tokens_dict.items() if k != "additional_special_tokens"},
special_tokens_dict.get("additional_special_tokens", []),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason for keeping standard special tokens separate from "additional" special tokens is to be able to easily provision access to those standard tokens later. (like this: https://github.com/huggingface/transformers/blob/main/src/transformers/tokenization_utils_base.py#L979).

def __init__(self, encoder_json_path: str, vocab_bpe_path: str, return_tokens: bool = False) -> None:
super().__init__()
self._seperator = "\u0001"
self.SPECIAL_TOKENS_ATTRIBUTES = [
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These special tokens are fairly common across all tokenizers. Would it make sense to pull these out into commons or utils so that we can reuse them across multiple transforms? cc @Nayef211

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved SPECIAL_TOKENS_ATTRIBUTES to utils in latest commit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mthrok taught me about the YAGNI principle which I believe is quite applicable here. Let's not move these out into commons or utils until we see a concrete case where they can be reused. I also think it would make more sense for SPECIAL_TOKENS_ATTRIBUTES to be a class variable of GPT2BPETokenizer rather than an instance variable as these attributes would not be changing across instances. Lmk if this makes sense.

Copy link
Contributor

@Nayef211 Nayef211 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing all the comments @reachsumit! The changes LGTM other than a few nits. I will merge this PR once those are resolved!

`add_special_tokens` API.
- form a regex pattern that helps in extracting special tokens from the
input text.
* Crate a vector that contains chunks of input text, such that each chunk is
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: create

def __init__(self, encoder_json_path: str, vocab_bpe_path: str, return_tokens: bool = False) -> None:
super().__init__()
self._seperator = "\u0001"
self.SPECIAL_TOKENS_ATTRIBUTES = [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mthrok taught me about the YAGNI principle which I believe is quite applicable here. Let's not move these out into commons or utils until we see a concrete case where they can be reused. I also think it would make more sense for SPECIAL_TOKENS_ATTRIBUTES to be a class variable of GPT2BPETokenizer rather than an instance variable as these attributes would not be changing across instances. Lmk if this makes sense.

Comment on lines 111 to 183
index_matches.push_back(input.substr(it->position(), it->length()));
last_idx = it->position() + it->length() + 1;
if (isspace(input[last_idx])) {
// rstrip
last_idx++;
}
}
if (last_idx < input.length() - 1)
index_matches.push_back(
input.substr(last_idx, input.length() - last_idx));
} else {
index_matches.push_back(input);
}

for (std::string index_token : index_matches) {
bool is_never_split_token =
bpe_never_split_set_.find(index_token) != bpe_never_split_set_.end();
if (is_never_split_token) {
tokens.push_back(index_token);
continue;
}
re2::StringPiece inp(index_token);
while (kGPT2Regex.FindAndConsume(&inp, &token)) {
if (is_whitespace(token)) {
prepend_space = false;
if (inp.empty()) { // token is last token
tokens.push_back(token);
} else {
if (token.length() > 1) {
tokens.push_back(token.substr(0, token.length() - 1));
}
if (token[token.length() - 1] == ' ') { // last char is space
prepend_space = true;
} else { // push last whitespace char as a token if it is not a space
tokens.push_back(token.substr(token.length() - 1));
}
}
} else if (prepend_space) {
tokens.push_back(" " + token);
prepend_space = false;
} else {
tokens.push_back(token);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the explanation you added really helps with the code readability, thanks!

@reachsumit reachsumit force-pushed the add_special_tokens_feature branch from 58b31ee to ad249fc Compare October 4, 2022 06:11
@reachsumit
Copy link
Contributor Author

Thanks for your comments, @Nayef211 ! I addressed them in the latest commits and also rebased the change on latest main branch. Let me know if in case of any concerns.

@Nayef211 Nayef211 merged commit 3f9c349 into pytorch:main Oct 4, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants