-
Notifications
You must be signed in to change notification settings - Fork 814
Modify CLIPTokenizer to either infer number of merges from encoder json or take it in constructor #1622
Modify CLIPTokenizer to either infer number of merges from encoder json or take it in constructor #1622
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -322,30 +322,55 @@ class CLIPTokenizer(Module): | |
| (a bit like sentencepiece) so a word will be encoded differently whether it | ||
| is at the beginning of the sentence (without space) or not. | ||
|
|
||
| :param encoder_json_path: Path to BPE encoder json file. | ||
| The below code snippet shows how to use the CLIP tokenizer with encoder and merges file | ||
| taken from the original paper implementation. | ||
|
|
||
| Example | ||
| >>> from torchtext.transforms import CLIPTokenizer | ||
| >>> MERGES_FILE = "http://download.pytorch.org/models/text/clip_merges.bpe" | ||
| >>> ENCODER_FILE = "http://download.pytorch.org/models/text/clip_encoder.json" | ||
| >>> tokenizer = CLIPTokenizer(merges_path=MERGES_FILE, encoder_json_path=ENCODER_FILE) | ||
| >>> tokenizer("the quick brown fox jumped over the lazy dog") | ||
|
|
||
| :param merges_path: Path to bpe merges file. | ||
| :type merges_path: str | ||
| :param encoder_json_path: Optional, path to BPE encoder json file. When specified, this is used | ||
| to infer num_merges. | ||
| :type encoder_json_path: str | ||
| :param vocab_bpe_path: Path to bpe vocab file. | ||
| :type vocab_bpe_path: str | ||
| :param num_merges: Optional, number of merges to read from the bpe merges file. | ||
| :type num_merges: int | ||
| """ | ||
|
|
||
| _seperator: torch.jit.Final[str] | ||
|
|
||
| def __init__( | ||
| self, | ||
| encoder_json_path: str, | ||
| vocab_bpe_path: str, | ||
| ): | ||
| def __init__(self, merges_path: str, encoder_json_path: Optional[str] = None, num_merges: Optional[int] = None): | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is a breaking change. Is that OK?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it's OK. This is still to be released in upcoming cycle. |
||
| super().__init__() | ||
| self._seperator = "\u0001" | ||
| # load bpe encoder | ||
| with open(get_asset_local_path(encoder_json_path), "r", encoding="utf-8") as f: | ||
| bpe_encoder = json.load(f) | ||
| # load bpe vocab | ||
| with open(get_asset_local_path(vocab_bpe_path), "r", encoding="utf-8") as f: | ||
| bpe_vocab = f.read() | ||
| bpe_merge_ranks = { | ||
| self._seperator.join(merge_pair.split()): i for i, merge_pair in enumerate(bpe_vocab.split("\n")[1:-1]) | ||
| } | ||
| # load bpe merges | ||
| with open(get_asset_local_path(merges_path), "r", encoding="utf-8") as f: | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This may be a bit too much for this PR, but for sake of starting the conversation... is it possible to move the files from the constructor into a classmethod? The constructor is doing a lot of work and if people have their own merges, there's no direct way they can construct this without first writing them to a file.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry, I am not sure if I follow it completely. So as per the current interface user would have to provide the file paths. Are you suggesting that the constructor should allow passing both file paths as well as direct
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No, I'm suggesting that the constructor should not deal with files at all.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I believe this is a broader discussion, not specific to this PR. I believe there has been a discussion around this before. I agree with you, that in future we should deal with file-like objects. I remember there were some concerns around this./ Maybe @parmeet may remember. @erip Let's track this in a separate issue, so that we can standardize this across all our methods that depend on external files.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sounds good -- opened #1624 |
||
| bpe_merges = f.read().split("\n")[1:] | ||
|
|
||
| if encoder_json_path: | ||
| # load bpe encoder | ||
| with open(get_asset_local_path(encoder_json_path), "r", encoding="utf-8") as f: | ||
| bpe_encoder = json.load(f) | ||
| # 256 * 2 for each byte. For each byte we have ['a', 'a</w>'] | ||
| # Additional 2 tokens for bos and eos | ||
| num_merges = len(bpe_encoder) - (256 * 2 + 2) | ||
| bpe_merge_ranks = { | ||
| self._seperator.join(merge_pair.split()): i for i, merge_pair in enumerate(bpe_merges[:num_merges]) | ||
| } | ||
| else: | ||
| num_merges = num_merges or len(bpe_merges) | ||
| bpe_merge_ranks = { | ||
| self._seperator.join(merge_pair.split()): i for i, merge_pair in enumerate(bpe_merges[:num_merges]) | ||
| } | ||
| bpe_vocab = list(bytes_to_unicode().values()) | ||
| bpe_vocab = bpe_vocab + [v + "</w>" for v in bpe_vocab] | ||
| bpe_vocab.extend(["".join(merge_pair.split()) for merge_pair in bpe_merges[:num_merges]]) | ||
| bpe_vocab.extend(["<|startoftext|>", "<|endoftext|>"]) | ||
| bpe_encoder = {v: i for i, v in enumerate(bpe_vocab)} | ||
|
|
||
| # Caching is enabled in Eager mode | ||
| self.bpe = CLIPEncoderPyBind(bpe_encoder, bpe_merge_ranks, self._seperator, bytes_to_unicode(), True) | ||
|
|
||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@abhinavarora As we discussed earlier for tokenizers doc-strings to provide information on standard out-of-the-box artifacts that we host, can we also update to include example and provide artifacts path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added in the latest commit!