Skip to content
This repository was archived by the owner on Sep 10, 2025. It is now read-only.

Conversation

@abhinavarora
Copy link
Contributor

@abhinavarora abhinavarora commented Jan 27, 2022

This PR implements the CLIP tokenizer released by OpenAI:

  • Implement a C++ CLIPEncoder that extends the the BPE encoder with the necessary modifications.
  • Adds a Python interface for CLIPTokenizer
  • Adds unit tests for the CLIPTokenizer
  • Adds an entry transforms docs
  • Fixes a bug in the BPE_ method that is applicable only to CLIP and was identified with CLIP texting.

Testing

Test CLIP Tokenizer

pytest -k test_clip_ test/test_transforms.py

Test GPT2BPE Tokenizer

pytest -k test_gpt2_ test/test_transforms.py

Add case insensitive flag to CLIP pre tokenization regex

Add Python interface

Bring back gpt2

Add docstring

Update docs
@abhinavarora
Copy link
Contributor Author

Test failures seem to be coming from dataset testing failures!

Copy link
Contributor

@ebsmothers ebsmothers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for implementing this! This looks great from my perspective.

auto pairs = get_pairs(tok_list, seperator_);
if (pairs.empty()) {
return tok_list;
return {concatenated};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the bug you mentioned previously when trying to use GPT2 BPE for CLIP tokenization?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this is the bug. Although GPT-2 never really runs into this because pairs is not empty there. I came across this only when testing with CLIP tokenizer.

while (i < tok_list.size()) {
auto j = list_str_index(tok_list, parts.first, i);
if (j != -1) {
for (int k = i; k < j; k++) new_token_list.push_back(tok_list[k]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Split into multiple lines?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for sharing this. Currently this file is formatted with Google C++ Style . We are in the process of setting up clang-format for torchtext, so all formatting can be handled there.

Comment on lines +298 to +299
encoder_json = "clip_encoder.json"
bpe_vocab = "clip_vocab.bpe"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these files different from GPT2 files? Even if that's the case, would it be ok to rely on those files for unit-testing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes they are different. Since our unit test is more like an e2e test, it is better to rely on prod files which we don't expect to change at the moment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a follow-up, let's also think about how to expose the prod files for tokenizers to the users. If the user just want to use the pre-trained text processing files from established models (XLM-R, Roberta, CLIP), we might want to think how to best expose it. As of now we do not have a standard way of doing so.

Comment on lines +186 to +189
.def_property_readonly("bpe_encoder_", &CLIPEncoder::GetBPEEncoder)
.def_property_readonly("bpe_merge_ranks_", &CLIPEncoder::GetBPEMergeRanks)
.def_readonly("seperator_", &CLIPEncoder::seperator_)
.def_property_readonly("byte_encoder_", &CLIPEncoder::GetByteEncoder)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it is necessary to expose these properties?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is necessary because when we call the Torchbind class in __prepare_scriptable__, we access these properties and pass it to the Torchbind constructor.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh right, my bad. Completely missed that part. Thanks!

"""
if not self.is_jitable:
tokenizer_copy = deepcopy(self)
# Disable caching in script mode
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we disable caching in script mode? Also do we have some stats on how much we gain performance with caching?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disabling caching in script mode is a safe option. In future, we can design an API that can expose whether model developers want caching in script mode or not.

In script mode the tokenizer is used mostly for production and the cache can technically grow infinitely in such a case. This can lead to OOM errors which will be hard to debug.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I see. Sounds good. Well, technically we can always limit how much cache we want to keep, but I agree that that would require some API redesign for user-control etc. We can park this topic until then.

@abhinavarora abhinavarora requested a review from parmeet January 31, 2022 04:06
Copy link
Contributor

@parmeet parmeet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for adding this feature :)

@abhinavarora abhinavarora merged commit 448a791 into pytorch:main Feb 1, 2022
@abhinavarora abhinavarora deleted the clip_tokenizer branch February 1, 2022 18:38
@ProGamerGov
Copy link
Contributor

ProGamerGov commented Feb 13, 2022

So, how soon until this makes it's way into the release versions?

And do I need to create a .json file to use CLIP's vocab file from here? https://github.com/openai/CLIP/blob/main/clip/bpe_simple_vocab_16e6.txt.gz, and if so how do I create it?

@parmeet
Copy link
Contributor

parmeet commented Feb 15, 2022

So, how soon until this makes it's way into the release versions?

This should be part of our upcoming release in March.

And do I need to create a .json file to use CLIP's vocab file from here? https://github.com/openai/CLIP/blob/main/clip/bpe_simple_vocab_16e6.txt.gz, and if so how do I create it?

Yes, you would need to pass the encode json file per the current interface exposed in transforms. You can create it as per the code here and save the dictionary as json file. cc: @abhinavarora

@ProGamerGov
Copy link
Contributor

ProGamerGov commented Feb 15, 2022

@abhinavarora and @parmeet , seeing as the encoder file is just the output of the vocab file, I think that it might make sense to make the encoder.json file optional: #1612

@abhinavarora
Copy link
Contributor Author

@abhinavarora and @parmeet , seeing as the encoder file is just the output of the vocab file, I think that it might make sense to make the encoder.json file optional: #1612

@ProGamerGov I replied to you on #1612. Let's continue the discussion there :-) For now, you can use the json file and the txt file from HuggingFace which can be found here

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants