[Feature] Added capability to add special tokens in GPT2BPEEncoder and avoid splitting on them #1916

reachsumit · 2022-09-29T23:15:38Z

Description

Add add_special_tokens feature to GPT2BPETokenizer

This change is targeted towards a requirement posted internally. It adds a new function add_special_tokens to GPT2BPETokenizer in order to enable the user to specify a dictionary of token that should not be changed during tokenization. Any newly specified token shall also be added to the vocabulary.

The functionality is same as HuggingFace's add_special_tokens feature.

Types of changes

[x ] New feature (non-breaking change which adds core functionality)

Changes made

Made changes to C++ code (GPT2BPEEncoder class) to support the add_special_tokens functionality. Also added the logic to ensure those special token are not split during tokenization.
Updated torch bindings and py bindings as required.
Added add_special_tokens function to Python interface (GPT2BPETokenizer class).
Added unit test to extensively test the tokenization process.

Testing

Verified tokenization output to be consistent with HuggingFace's tokenizer.
No issue identified in pre-commit
No issue identified with any of the unit tests.

joecummings · 2022-09-30T01:06:39Z

torchtext/csrc/gpt2_bpe_tokenizer.h

+      bool is_never_split_token);
  int64_t GetBPEMergeRank_(std::string pair);
+  int64_t added_to_vocab_tokens_count;
+  //   std::set<std::string> bpe_never_split_set_;


Remove uncommented code

Thanks, Joe! I removed the comment.

joecummings · 2022-09-30T01:07:13Z

torchtext/transforms.py

 from copy import deepcopy
 from functools import lru_cache
-from typing import Any, List, Optional, Tuple, Union
+from typing import Any, Dict, List, Optional, Tuple, Union


Use more general Mapping

done. Thank you!

joecummings · 2022-09-30T01:39:10Z

torchtext/transforms.py

        """
        return self.bpe.tokenize(text)

+    def add_special_tokens(self, special_tokens_dict: Dict[str, Union[str, List[str]]]) -> int:


How did you decide on this implementation? As I remember from the post, the inspiration was primarily from HuggingFace's implementation: https://huggingface.co/docs/transformers/internal/tokenization_utils#transformers.SpecialTokensMixin.add_tokens

This implementation is very similar to the HuggingFace's.

For HuggingFace:

Mixin defines the special tokens, and the structure enforced on the input: https://github.com/huggingface/transformers/blob/main/src/transformers/tokenization_utils_base.py#L896

PreTrainedTokenizer keeps a separate encoder map for special tokens that are added later: https://github.com/huggingface/transformers/blob/main/src/transformers/tokenization_utils.py#L430 (also creates a trie to save all those tokens that are not to be split: https://github.com/huggingface/transformers/blob/main/src/transformers/tokenization_utils.py#L445)

Added tokens encoder map is checked first when doing encoding: https://github.com/huggingface/transformers/blob/main/src/transformers/tokenization_utils.py#L586

Regex based approach is used to distinguish between non-special and special tokens: https://github.com/huggingface/transformers/blob/3f936df66287f557c6528912a9a68d7850913b9b/src/transformers/tokenization_utils.py#L511

Of course, HuggingFace's implementation is more generic and is extensible to all of their tokenizers through inheritance. GPT2Tokenizer inherits PreTrainedTokenizer which inherits PreTrainedTokenizerBase, which implements two mixins: SpecialTokensMixin, PushToHubMixin. Comparatively torchtext's tokenizers all inherit nn.Module with their respective (and fairly separated imo) C++ implementation.

This pull request is the minimal implementation of their general Python approach. That's why I kept the Python API to be the same. A more general implementation is possible as C++ supports multiple inheritance. But it might take a bit more time and require a broader discussion with your team.

Nayef211

Thanks @reachsumit for taking the time to implement a highly requested feature to GPT2BPEEncoder. Let me know if my comments make sense. Happy to also have some offline discussions as needed before approving this PR! 😄

Nayef211 · 2022-09-30T02:49:58Z

torchtext/csrc/register_pybindings.cpp

+          [](const c10::intrusive_ptr<GPT2BPEEncoder>& self,
+             const std::unordered_map<std::string, std::string>& items,
+             const std::vector<std::string>& additional) {
+            c10::Dict<std::string, std::string> d;
+            for (const auto& item : items)
+              d.insert(item.first, item.second);
+            return (self->AddSpecialTokens(d, additional));
+          })


I assume the reason you have this additional logic is because you're expecting a c10::Dict as inputs to AddSpecialTokens. Can we just get around this altogether by passing in std::unordered_map like we do in the constructor?

Ideally we want this file to be utilized only for pybind registration logic.

I originally tried using the std::unordered_map, but I get the following compilation error (unsupported input type: std::unordered_map<Key, Value>. Please use Dict<Key, Value> instead.) on doing so.

/Users/sumitkumar/mambaforge/envs/bootcamp-rl/lib/python3.10/site-packages/torch/include/ATen/core/boxing/impl/make_boxed_from_unboxed_functor.h:164:5: error: static_assert failed "You tried to register a kernel with an unsupported input type: std::unordered_map<Key, Value>. Please use Dict<Key, Value> instead." static_assert(AllowDeprecatedTypes, ^ ~~~~~~~~~~~~~~~~~~~~

So, I switched to c10::Dict instead. But it turned out that the Python dictionary doesn't directly convert to our custom c10:Dict map, and it gives me following error if I try to do so.

TypeError: add_special_tokens(): incompatible function arguments. The following argument types are supported: 1. (self: torchtext._torchtext.GPT2BPEEncoder, arg0: c10::Dict<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, long long>) -> int Invoked with: <torchtext._torchtext.GPT2BPEEncoder object at 0x12428de70>, {} Did you forget to `#include <pybind11/stl.h>`? Or <pybind11/complex.h>, <pybind11/functional.h>, <pybind11/chrono.h>, etc. Some automatic conversions are optional and require extra headers to be included when compiling your pybind11 module.

So, I chose to pass Python dict to C++ unordered_map, but convert it bindings to avoid the first error. This is similar to the logic for constructor.

Nayef211 · 2022-09-30T02:59:10Z

torchtext/csrc/gpt2_bpe_tokenizer.cpp

  //     - ELSE make token[-1] its own token and add to return list
  // - ELSE IF prepend_space == True, prepend a space to the token and add to
  // return list
  // - ELSE, add token to return list


Can we add a few lines to the above comments that explains the additional logic needed to handle tokens contained in bpe_never_split_set_?

I added more comments in AddSpecialTokens to indicate the usage of this set.

Nayef211 · 2022-09-30T03:01:51Z

torchtext/csrc/gpt2_bpe_tokenizer.cpp

+    }
+  }
+
+  added_to_vocab_tokens_count += newly_added;


I might be missing something here but where exactly is this variable used?

I originally meant to use this variable to return the vocab size (original encoder size + added encoder size), similar to how HuggingFace API provisions it. But it appears that we don't really have any existing requirement for fetching size, so I just removed it in the latest commit.

Nayef211 · 2022-09-30T03:07:24Z

torchtext/csrc/gpt2_bpe_tokenizer.cpp

+      index_matches.push_back(input.substr(it->position(), it->length()));
+      last_idx = it->position() + it->length() + 1;
+      if (isspace(input[last_idx])) {
+        // rstrip
+        last_idx++;
+      }
+    }
+    if (last_idx < input.length() - 1)
+      index_matches.push_back(
+          input.substr(last_idx, input.length() - last_idx));
+  } else {
+    index_matches.push_back(input);
+  }
+
+  for (std::string index_token : index_matches) {
+    bool is_never_split_token =
+        bpe_never_split_set_.find(index_token) != bpe_never_split_set_.end();
+    if (is_never_split_token) {
+      tokens.push_back(index_token);
+      continue;
+    }
+    re2::StringPiece inp(index_token);
+    while (kGPT2Regex.FindAndConsume(&inp, &token)) {
+      if (is_whitespace(token)) {
+        prepend_space = false;
+        if (inp.empty()) { // token is last token
+          tokens.push_back(token);
+        } else {
+          if (token.length() > 1) {
+            tokens.push_back(token.substr(0, token.length() - 1));
+          }
+          if (token[token.length() - 1] == ' ') { // last char is space
+            prepend_space = true;
+          } else { // push last whitespace char as a token if it is not a space
+            tokens.push_back(token.substr(token.length() - 1));
+          }
+        }
+      } else if (prepend_space) {
+        tokens.push_back(" " + token);
+        prepend_space = false;
+      } else {
+        tokens.push_back(token);
+      }


I'm having a bit of a hard time following all of the additional logic to this method. Having a more detailed explanation via code comments + adding a few lines to the commented pseudocode would really help with readability for someone that is new to our codebase.

Sorry, for the confusion. I agree that the logic is complex to understand just by looking at the code. I added some detailed explanation and example on the steps in the latest commit. Let me know if that is sufficient or any further details would help.

I think the explanation you added really helps with the code readability, thanks!

reachsumit · 2022-09-30T04:43:23Z

torchtext/csrc/gpt2_bpe_tokenizer.cpp

-          tokens.push_back(token.substr(0, token.length() - 1));
+  std::vector<std::string> index_matches;
+
+  /* Notes on handling Special Tokens:


This thread is also very helpful for understanding spacing in BPE tokenization: https://discuss.huggingface.co/t/bpe-tokenizers-and-spaces-before-words/475

reachsumit · 2022-09-30T04:46:26Z

torchtext/transforms.py

+
+        return self.bpe.add_special_tokens(
+            {k: v for k, v in special_tokens_dict.items() if k != "additional_special_tokens"},
+            special_tokens_dict.get("additional_special_tokens", []),


The reason for keeping standard special tokens separate from "additional" special tokens is to be able to easily provision access to those standard tokens later. (like this: https://github.com/huggingface/transformers/blob/main/src/transformers/tokenization_utils_base.py#L979).

joecummings · 2022-09-30T13:28:50Z

torchtext/transforms.py

    def __init__(self, encoder_json_path: str, vocab_bpe_path: str, return_tokens: bool = False) -> None:
        super().__init__()
        self._seperator = "\u0001"
+        self.SPECIAL_TOKENS_ATTRIBUTES = [


These special tokens are fairly common across all tokenizers. Would it make sense to pull these out into commons or utils so that we can reuse them across multiple transforms? cc @Nayef211

I moved SPECIAL_TOKENS_ATTRIBUTES to utils in latest commit.

@mthrok taught me about the YAGNI principle which I believe is quite applicable here. Let's not move these out into commons or utils until we see a concrete case where they can be reused. I also think it would make more sense for SPECIAL_TOKENS_ATTRIBUTES to be a class variable of GPT2BPETokenizer rather than an instance variable as these attributes would not be changing across instances. Lmk if this makes sense.

Nayef211

Thanks for addressing all the comments @reachsumit! The changes LGTM other than a few nits. I will merge this PR once those are resolved!

Nayef211 · 2022-10-04T04:55:43Z

torchtext/csrc/gpt2_bpe_tokenizer.cpp

+  `add_special_tokens` API.
+    - form a regex pattern that helps in extracting special tokens from the
+  input text.
+  * Crate a vector that contains chunks of input text, such that each chunk is


nit: create

Nayef211 · 2022-10-04T05:05:53Z

torchtext/transforms.py

    def __init__(self, encoder_json_path: str, vocab_bpe_path: str, return_tokens: bool = False) -> None:
        super().__init__()
        self._seperator = "\u0001"
+        self.SPECIAL_TOKENS_ATTRIBUTES = [


@mthrok taught me about the YAGNI principle which I believe is quite applicable here. Let's not move these out into commons or utils until we see a concrete case where they can be reused. I also think it would make more sense for SPECIAL_TOKENS_ATTRIBUTES to be a class variable of GPT2BPETokenizer rather than an instance variable as these attributes would not be changing across instances. Lmk if this makes sense.

Nayef211 · 2022-10-04T05:06:44Z

torchtext/csrc/gpt2_bpe_tokenizer.cpp

+      index_matches.push_back(input.substr(it->position(), it->length()));
+      last_idx = it->position() + it->length() + 1;
+      if (isspace(input[last_idx])) {
+        // rstrip
+        last_idx++;
+      }
+    }
+    if (last_idx < input.length() - 1)
+      index_matches.push_back(
+          input.substr(last_idx, input.length() - last_idx));
+  } else {
+    index_matches.push_back(input);
+  }
+
+  for (std::string index_token : index_matches) {
+    bool is_never_split_token =
+        bpe_never_split_set_.find(index_token) != bpe_never_split_set_.end();
+    if (is_never_split_token) {
+      tokens.push_back(index_token);
+      continue;
+    }
+    re2::StringPiece inp(index_token);
+    while (kGPT2Regex.FindAndConsume(&inp, &token)) {
+      if (is_whitespace(token)) {
+        prepend_space = false;
+        if (inp.empty()) { // token is last token
+          tokens.push_back(token);
+        } else {
+          if (token.length() > 1) {
+            tokens.push_back(token.substr(0, token.length() - 1));
+          }
+          if (token[token.length() - 1] == ' ') { // last char is space
+            prepend_space = true;
+          } else { // push last whitespace char as a token if it is not a space
+            tokens.push_back(token.substr(token.length() - 1));
+          }
+        }
+      } else if (prepend_space) {
+        tokens.push_back(" " + token);
+        prepend_space = false;
+      } else {
+        tokens.push_back(token);
+      }


I think the explanation you added really helps with the code readability, thanks!

reachsumit · 2022-10-04T06:20:54Z

Thanks for your comments, @Nayef211 ! I addressed them in the latest commits and also rebased the change on latest main branch. Let me know if in case of any concerns.

facebook-github-bot added the cla signed label Sep 29, 2022

joecummings reviewed Sep 30, 2022

View reviewed changes

Nayef211 reviewed Sep 30, 2022

View reviewed changes

reachsumit commented Sep 30, 2022

View reviewed changes

joecummings reviewed Sep 30, 2022

View reviewed changes

Nayef211 approved these changes Oct 4, 2022

View reviewed changes

reachsumit added 5 commits October 3, 2022 22:53

add_special_tokens and never split features added

70230b9

removed a comment and updated a type hint

4ceb641

added explanation and example for how this change works

42a14a0

move SPECIAL_TOKENS_ATTRIBUTES to utils

9e66291

rebase and address latest nit comments

ad249fc

reachsumit force-pushed the add_special_tokens_feature branch from 58b31ee to ad249fc Compare October 4, 2022 06:11

Nayef211 merged commit 3f9c349 into pytorch:main Oct 4, 2022

reachsumit mentioned this pull request Oct 6, 2022

Avoid using std::regex and fix lint errors #1930

Merged

[Feature] Added capability to add special tokens in GPT2BPEEncoder and avoid splitting on them #1916

[Feature] Added capability to add special tokens in GPT2BPEEncoder and avoid splitting on them #1916

Uh oh!

Conversation

reachsumit commented Sep 29, 2022

Description

Types of changes

Changes made

Testing

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Nayef211 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Nayef211 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

reachsumit commented Oct 4, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants