-
Notifications
You must be signed in to change notification settings - Fork 814
Implement ClipTokenizer that builds on top of GPT2BPETokenizer #1541
Conversation
Add case insensitive flag to CLIP pre tokenization regex Add Python interface Bring back gpt2 Add docstring Update docs
|
Test failures seem to be coming from dataset testing failures! |
ebsmothers
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much for implementing this! This looks great from my perspective.
| auto pairs = get_pairs(tok_list, seperator_); | ||
| if (pairs.empty()) { | ||
| return tok_list; | ||
| return {concatenated}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the bug you mentioned previously when trying to use GPT2 BPE for CLIP tokenization?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this is the bug. Although GPT-2 never really runs into this because pairs is not empty there. I came across this only when testing with CLIP tokenizer.
| while (i < tok_list.size()) { | ||
| auto j = list_str_index(tok_list, parts.first, i); | ||
| if (j != -1) { | ||
| for (int k = i; k < j; k++) new_token_list.push_back(tok_list[k]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Split into multiple lines?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for sharing this. Currently this file is formatted with Google C++ Style . We are in the process of setting up clang-format for torchtext, so all formatting can be handled there.
| encoder_json = "clip_encoder.json" | ||
| bpe_vocab = "clip_vocab.bpe" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these files different from GPT2 files? Even if that's the case, would it be ok to rely on those files for unit-testing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes they are different. Since our unit test is more like an e2e test, it is better to rely on prod files which we don't expect to change at the moment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a follow-up, let's also think about how to expose the prod files for tokenizers to the users. If the user just want to use the pre-trained text processing files from established models (XLM-R, Roberta, CLIP), we might want to think how to best expose it. As of now we do not have a standard way of doing so.
| .def_property_readonly("bpe_encoder_", &CLIPEncoder::GetBPEEncoder) | ||
| .def_property_readonly("bpe_merge_ranks_", &CLIPEncoder::GetBPEMergeRanks) | ||
| .def_readonly("seperator_", &CLIPEncoder::seperator_) | ||
| .def_property_readonly("byte_encoder_", &CLIPEncoder::GetByteEncoder) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it is necessary to expose these properties?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is necessary because when we call the Torchbind class in __prepare_scriptable__, we access these properties and pass it to the Torchbind constructor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahh right, my bad. Completely missed that part. Thanks!
| """ | ||
| if not self.is_jitable: | ||
| tokenizer_copy = deepcopy(self) | ||
| # Disable caching in script mode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we disable caching in script mode? Also do we have some stats on how much we gain performance with caching?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Disabling caching in script mode is a safe option. In future, we can design an API that can expose whether model developers want caching in script mode or not.
In script mode the tokenizer is used mostly for production and the cache can technically grow infinitely in such a case. This can lead to OOM errors which will be hard to debug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I see. Sounds good. Well, technically we can always limit how much cache we want to keep, but I agree that that would require some API redesign for user-control etc. We can park this topic until then.
parmeet
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for adding this feature :)
|
So, how soon until this makes it's way into the release versions? And do I need to create a |
This should be part of our upcoming release in March.
Yes, you would need to pass the |
|
@abhinavarora and @parmeet , seeing as the encoder file is just the output of the vocab file, I think that it might make sense to make the |
@ProGamerGov I replied to you on #1612. Let's continue the discussion there :-) For now, you can use the json file and the txt file from HuggingFace which can be found here |
This PR implements the CLIP tokenizer released by OpenAI:
CLIPTokenizerCLIPTokenizerBPE_method that is applicable only to CLIP and was identified with CLIP texting.Testing
Test CLIP Tokenizer
Test GPT2BPE Tokenizer