Implement ClipTokenizer that builds on top of GPT2BPETokenizer #1541

abhinavarora · 2022-01-27T18:31:41Z

This PR implements the CLIP tokenizer released by OpenAI:

Implement a C++ CLIPEncoder that extends the the BPE encoder with the necessary modifications.
Adds a Python interface for CLIPTokenizer
Adds unit tests for the CLIPTokenizer
Adds an entry transforms docs
Fixes a bug in the BPE_ method that is applicable only to CLIP and was identified with CLIP texting.

Testing

Test CLIP Tokenizer

pytest -k test_clip_ test/test_transforms.py

Test GPT2BPE Tokenizer

pytest -k test_gpt2_ test/test_transforms.py

Add case insensitive flag to CLIP pre tokenization regex Add Python interface Bring back gpt2 Add docstring Update docs

abhinavarora · 2022-01-28T05:20:07Z

Test failures seem to be coming from dataset testing failures!

ebsmothers

Thanks so much for implementing this! This looks great from my perspective.

ebsmothers · 2022-01-28T17:22:52Z

torchtext/csrc/gpt2_bpe_tokenizer.cpp

+  auto pairs = get_pairs(tok_list, seperator_);
  if (pairs.empty()) {
-    return tok_list;
+    return {concatenated};


Is this the bug you mentioned previously when trying to use GPT2 BPE for CLIP tokenization?

Yes this is the bug. Although GPT-2 never really runs into this because pairs is not empty there. I came across this only when testing with CLIP tokenizer.

ebsmothers · 2022-01-28T17:28:48Z

torchtext/csrc/clip_tokenizer.cpp

+    while (i < tok_list.size()) {
+      auto j = list_str_index(tok_list, parts.first, i);
+      if (j != -1) {
+        for (int k = i; k < j; k++) new_token_list.push_back(tok_list[k]);


Split into multiple lines?

Thanks for sharing this. Currently this file is formatted with Google C++ Style . We are in the process of setting up clang-format for torchtext, so all formatting can be handled there.

parmeet · 2022-01-28T14:30:50Z

test/test_transforms.py

+        encoder_json = "clip_encoder.json"
+        bpe_vocab = "clip_vocab.bpe"


Are these files different from GPT2 files? Even if that's the case, would it be ok to rely on those files for unit-testing?

Yes they are different. Since our unit test is more like an e2e test, it is better to rely on prod files which we don't expect to change at the moment.

As a follow-up, let's also think about how to expose the prod files for tokenizers to the users. If the user just want to use the pre-trained text processing files from established models (XLM-R, Roberta, CLIP), we might want to think how to best expose it. As of now we do not have a standard way of doing so.

parmeet · 2022-01-28T22:52:00Z

torchtext/csrc/register_pybindings.cpp

+      .def_property_readonly("bpe_encoder_", &CLIPEncoder::GetBPEEncoder)
+      .def_property_readonly("bpe_merge_ranks_", &CLIPEncoder::GetBPEMergeRanks)
+      .def_readonly("seperator_", &CLIPEncoder::seperator_)
+      .def_property_readonly("byte_encoder_", &CLIPEncoder::GetByteEncoder)


I wonder if it is necessary to expose these properties?

Yes, this is necessary because when we call the Torchbind class in __prepare_scriptable__, we access these properties and pass it to the Torchbind constructor.

ahh right, my bad. Completely missed that part. Thanks!

parmeet · 2022-01-28T23:24:23Z

torchtext/transforms.py

+        """
+        if not self.is_jitable:
+            tokenizer_copy = deepcopy(self)
+            # Disable caching in script mode


why do we disable caching in script mode? Also do we have some stats on how much we gain performance with caching?

Disabling caching in script mode is a safe option. In future, we can design an API that can expose whether model developers want caching in script mode or not.

In script mode the tokenizer is used mostly for production and the cache can technically grow infinitely in such a case. This can lead to OOM errors which will be hard to debug.

Hmm, I see. Sounds good. Well, technically we can always limit how much cache we want to keep, but I agree that that would require some API redesign for user-control etc. We can park this topic until then.

parmeet

LGTM, thanks for adding this feature :)

ProGamerGov · 2022-02-13T16:57:58Z

So, how soon until this makes it's way into the release versions?

And do I need to create a .json file to use CLIP's vocab file from here? https://github.com/openai/CLIP/blob/main/clip/bpe_simple_vocab_16e6.txt.gz, and if so how do I create it?

parmeet · 2022-02-15T16:50:27Z

So, how soon until this makes it's way into the release versions?

This should be part of our upcoming release in March.

And do I need to create a .json file to use CLIP's vocab file from here? https://github.com/openai/CLIP/blob/main/clip/bpe_simple_vocab_16e6.txt.gz, and if so how do I create it?

Yes, you would need to pass the encode json file per the current interface exposed in transforms. You can create it as per the code here and save the dictionary as json file. cc: @abhinavarora

ProGamerGov · 2022-02-15T16:58:09Z

@abhinavarora and @parmeet , seeing as the encoder file is just the output of the vocab file, I think that it might make sense to make the encoder.json file optional: #1612

abhinavarora · 2022-02-18T07:37:24Z

@abhinavarora and @parmeet , seeing as the encoder file is just the output of the vocab file, I think that it might make sense to make the encoder.json file optional: #1612

@ProGamerGov I replied to you on #1612. Let's continue the discussion there :-) For now, you can use the json file and the txt file from HuggingFace which can be found here

Implement CLIPEncoder in C++

3749a1b

Add case insensitive flag to CLIP pre tokenization regex Add Python interface Bring back gpt2 Add docstring Update docs

pytorch-bot bot added the ciflow/default label Jan 27, 2022

facebook-github-bot added the cla signed label Jan 27, 2022

abhinavarora requested review from Nayef211, ebsmothers and parmeet January 27, 2022 18:32

Fix stylecheck

8b7e945

ebsmothers approved these changes Jan 28, 2022

View reviewed changes

parmeet reviewed Jan 28, 2022

View reviewed changes

abhinavarora requested a review from parmeet January 31, 2022 04:06

parmeet approved these changes Jan 31, 2022

View reviewed changes

abhinavarora merged commit 448a791 into pytorch:main Feb 1, 2022

abhinavarora deleted the clip_tokenizer branch February 1, 2022 18:38

		encoder_json = "clip_encoder.json"
		bpe_vocab = "clip_vocab.bpe"

Implement ClipTokenizer that builds on top of GPT2BPETokenizer #1541

Implement ClipTokenizer that builds on top of GPT2BPETokenizer #1541

Uh oh!

Conversation

abhinavarora commented Jan 27, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Testing

Uh oh!

abhinavarora commented Jan 28, 2022

Uh oh!

ebsmothers left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

parmeet left a comment

Choose a reason for hiding this comment

Uh oh!

ProGamerGov commented Feb 13, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

parmeet commented Feb 15, 2022

Uh oh!

ProGamerGov commented Feb 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

abhinavarora commented Feb 18, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

abhinavarora commented Jan 27, 2022 •

edited

Loading

ProGamerGov commented Feb 13, 2022 •

edited

Loading

ProGamerGov commented Feb 15, 2022 •

edited

Loading