-
Notifications
You must be signed in to change notification settings - Fork 814
Modify CLIPTokenizer to either infer number of merges from encoder json or take it in constructor #1622
Conversation
…on or take it in constructor
Codecov Report
@@ Coverage Diff @@
## main #1622 +/- ##
==========================================
+ Coverage 85.25% 85.31% +0.05%
==========================================
Files 58 58
Lines 2483 2492 +9
==========================================
+ Hits 2117 2126 +9
Misses 366 366
Continue to review full report at Codecov.
|
ebsmothers
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great from my perspective. Thanks for the quick fix!
| :param encoder_json_path: Path to BPE encoder json file. | ||
| :type encoder_json_path: str | ||
| :param vocab_bpe_path: Path to bpe vocab file. | ||
| :type vocab_bpe_path: str | ||
| :param num_merges: Number of merges to read from the bpe merges file. | ||
| :type num_merges: int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: indicate in the docstring that these params are optional
| encoder_json_path: str, | ||
| vocab_bpe_path: str, | ||
| ): | ||
| def __init__(self, merges_path: str, encoder_json_path: Optional[str] = None, num_merges: Optional[int] = None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a breaking change. Is that OK?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's OK. This is still to be released in upcoming cycle.
| self._seperator.join(merge_pair.split()): i for i, merge_pair in enumerate(bpe_vocab.split("\n")[1:-1]) | ||
| } | ||
| # load bpe merges | ||
| with open(get_asset_local_path(merges_path), "r", encoding="utf-8") as f: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may be a bit too much for this PR, but for sake of starting the conversation... is it possible to move the files from the constructor into a classmethod? The constructor is doing a lot of work and if people have their own merges, there's no direct way they can construct this without first writing them to a file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I am not sure if I follow it completely. So as per the current interface user would have to provide the file paths. Are you suggesting that the constructor should allow passing both file paths as well as direct merges object?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I'm suggesting that the constructor should not deal with files at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is a broader discussion, not specific to this PR. I believe there has been a discussion around this before. I agree with you, that in future we should deal with file-like objects. I remember there were some concerns around this./ Maybe @parmeet may remember.
@erip Let's track this in a separate issue, so that we can standardize this across all our methods that depend on external files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good -- opened #1624
| This tokenizer has been trained to treat spaces like parts of the tokens | ||
| (a bit like sentencepiece) so a word will be encoded differently whether it | ||
| is at the beginning of the sentence (without space) or not. | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@abhinavarora As we discussed earlier for tokenizers doc-strings to provide information on standard out-of-the-box artifacts that we host, can we also update to include example and provide artifacts path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added in the latest commit!
|
The CircleCI error seems unrelated to this PR. Will merge this PR and debug later. |
…on or take it in constructor (pytorch#1622)
Problem
@ebsmothers reported that current CLIPTokenizer runs into errors when original merges files is loaded from OpenAI. This happens because OpenAI codebase hardcodes the number of merges which torchtext did not do.
Similarly @ProGamerGov reported #1612 .
Solution
This PR addresses these issues by doing the following:
num_mergesparam that users can provide.Testing