Skip to content
This repository was archived by the owner on Sep 10, 2025. It is now read-only.

Conversation

@abhinavarora
Copy link
Contributor

Problem

@ebsmothers reported that current CLIPTokenizer runs into errors when original merges files is loaded from OpenAI. This happens because OpenAI codebase hardcodes the number of merges which torchtext did not do.

Similarly @ProGamerGov reported #1612 .

Solution

This PR addresses these issues by doing the following:

  1. Enable initializing CLIPTokenizer with just the bpe merges file, similar to OpenAI.
  2. Add a num_merges param that users can provide.
  3. If encoder json is provided, use that to infer number of merges.

Testing

  • Added new tests with an example that was failing earlier.
  • Added tests for different types of initialization
pytest test/test_transforms.py -k test_clip

@codecov
Copy link

codecov bot commented Feb 18, 2022

Codecov Report

Merging #1622 (3862570) into main (16acc71) will increase coverage by 0.05%.
The diff coverage is 100.00%.

❗ Current head 3862570 differs from pull request most recent head c9cb1d2. Consider uploading reports for the commit c9cb1d2 to get more accurate results

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1622      +/-   ##
==========================================
+ Coverage   85.25%   85.31%   +0.05%     
==========================================
  Files          58       58              
  Lines        2483     2492       +9     
==========================================
+ Hits         2117     2126       +9     
  Misses        366      366              
Impacted Files Coverage Δ
torchtext/transforms.py 96.19% <100.00%> (+0.19%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 16acc71...c9cb1d2. Read the comment docs.

Copy link
Contributor

@ebsmothers ebsmothers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great from my perspective. Thanks for the quick fix!

Comment on lines 327 to 330
:param encoder_json_path: Path to BPE encoder json file.
:type encoder_json_path: str
:param vocab_bpe_path: Path to bpe vocab file.
:type vocab_bpe_path: str
:param num_merges: Number of merges to read from the bpe merges file.
:type num_merges: int
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: indicate in the docstring that these params are optional

encoder_json_path: str,
vocab_bpe_path: str,
):
def __init__(self, merges_path: str, encoder_json_path: Optional[str] = None, num_merges: Optional[int] = None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a breaking change. Is that OK?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's OK. This is still to be released in upcoming cycle.

self._seperator.join(merge_pair.split()): i for i, merge_pair in enumerate(bpe_vocab.split("\n")[1:-1])
}
# load bpe merges
with open(get_asset_local_path(merges_path), "r", encoding="utf-8") as f:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be a bit too much for this PR, but for sake of starting the conversation... is it possible to move the files from the constructor into a classmethod? The constructor is doing a lot of work and if people have their own merges, there's no direct way they can construct this without first writing them to a file.

Copy link
Contributor

@parmeet parmeet Feb 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I am not sure if I follow it completely. So as per the current interface user would have to provide the file paths. Are you suggesting that the constructor should allow passing both file paths as well as direct merges object?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I'm suggesting that the constructor should not deal with files at all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is a broader discussion, not specific to this PR. I believe there has been a discussion around this before. I agree with you, that in future we should deal with file-like objects. I remember there were some concerns around this./ Maybe @parmeet may remember.

@erip Let's track this in a separate issue, so that we can standardize this across all our methods that depend on external files.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good -- opened #1624

This tokenizer has been trained to treat spaces like parts of the tokens
(a bit like sentencepiece) so a word will be encoded differently whether it
is at the beginning of the sentence (without space) or not.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@abhinavarora As we discussed earlier for tokenizers doc-strings to provide information on standard out-of-the-box artifacts that we host, can we also update to include example and provide artifacts path?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in the latest commit!

@abhinavarora
Copy link
Contributor Author

The CircleCI error seems unrelated to this PR. Will merge this PR and debug later.

@abhinavarora abhinavarora merged commit 81212ba into pytorch:main Feb 22, 2022
abhinavarora added a commit to abhinavarora/text that referenced this pull request Feb 22, 2022
parmeet pushed a commit that referenced this pull request Feb 23, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Make the CLIPTokenizer's encoder_json_path variable optional, and use dict(zip(vocab, range(len(vocab)))) instead

5 participants