Modify CLIPTokenizer to either infer number of merges from encoder json or take it in constructor #1622

abhinavarora · 2022-02-18T20:43:04Z

Problem

@ebsmothers reported that current CLIPTokenizer runs into errors when original merges files is loaded from OpenAI. This happens because OpenAI codebase hardcodes the number of merges which torchtext did not do.

Similarly @ProGamerGov reported #1612 .

Solution

This PR addresses these issues by doing the following:

Enable initializing CLIPTokenizer with just the bpe merges file, similar to OpenAI.
Add a num_merges param that users can provide.
If encoder json is provided, use that to infer number of merges.

Testing

Added new tests with an example that was failing earlier.
Added tests for different types of initialization

pytest test/test_transforms.py -k test_clip

…on or take it in constructor

codecov · 2022-02-18T21:03:14Z

Codecov Report

Merging #1622 (3862570) into main (16acc71) will increase coverage by 0.05%.
The diff coverage is 100.00%.

❗ Current head 3862570 differs from pull request most recent head c9cb1d2. Consider uploading reports for the commit c9cb1d2 to get more accurate results

@@            Coverage Diff             @@
##             main    #1622      +/-   ##
==========================================
+ Coverage   85.25%   85.31%   +0.05%     
==========================================
  Files          58       58              
  Lines        2483     2492       +9     
==========================================
+ Hits         2117     2126       +9     
  Misses        366      366

Impacted Files	Coverage Δ
torchtext/transforms.py	`96.19% <100.00%> (+0.19%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 16acc71...c9cb1d2. Read the comment docs.

ebsmothers

Looks great from my perspective. Thanks for the quick fix!

ebsmothers · 2022-02-19T00:00:20Z

torchtext/transforms.py

    :param encoder_json_path: Path to BPE encoder json file.
    :type encoder_json_path: str
-    :param vocab_bpe_path: Path to bpe vocab file.
-    :type vocab_bpe_path: str
+    :param num_merges: Number of merges to read from the bpe merges file.
+    :type num_merges: int


nit: indicate in the docstring that these params are optional

erip · 2022-02-19T01:06:25Z

torchtext/transforms.py

-        encoder_json_path: str,
-        vocab_bpe_path: str,
-    ):
+    def __init__(self, merges_path: str, encoder_json_path: Optional[str] = None, num_merges: Optional[int] = None):


This is a breaking change. Is that OK?

I think it's OK. This is still to be released in upcoming cycle.

erip · 2022-02-19T13:45:21Z

torchtext/transforms.py

-            self._seperator.join(merge_pair.split()): i for i, merge_pair in enumerate(bpe_vocab.split("\n")[1:-1])
-        }
+        # load bpe merges
+        with open(get_asset_local_path(merges_path), "r", encoding="utf-8") as f:


This may be a bit too much for this PR, but for sake of starting the conversation... is it possible to move the files from the constructor into a classmethod? The constructor is doing a lot of work and if people have their own merges, there's no direct way they can construct this without first writing them to a file.

Sorry, I am not sure if I follow it completely. So as per the current interface user would have to provide the file paths. Are you suggesting that the constructor should allow passing both file paths as well as direct merges object?

No, I'm suggesting that the constructor should not deal with files at all.

I believe this is a broader discussion, not specific to this PR. I believe there has been a discussion around this before. I agree with you, that in future we should deal with file-like objects. I remember there were some concerns around this./ Maybe @parmeet may remember.

@erip Let's track this in a separate issue, so that we can standardize this across all our methods that depend on external files.

Sounds good -- opened #1624

parmeet · 2022-02-20T02:06:08Z

torchtext/transforms.py

    This tokenizer has been trained to treat spaces like parts of the tokens
    (a bit like sentencepiece) so a word will be encoded differently whether it
    is at the beginning of the sentence (without space) or not.



@abhinavarora As we discussed earlier for tokenizers doc-strings to provide information on standard out-of-the-box artifacts that we host, can we also update to include example and provide artifacts path?

Added in the latest commit!

abhinavarora · 2022-02-22T19:53:49Z

The CircleCI error seems unrelated to this PR. Will merge this PR and debug later.

…on or take it in constructor (pytorch#1622)

…on or take it in constructor (#1622) (#1626)

Modify CLIPTokenizer to either infer number of merges from encoder js…

3862570

…on or take it in constructor

pytorch-bot bot added the ciflow/default label Feb 18, 2022

facebook-github-bot added the cla signed label Feb 18, 2022

abhinavarora requested a review from ebsmothers February 18, 2022 20:43

abhinavarora mentioned this pull request Feb 18, 2022

Make the CLIPTokenizer's encoder_json_path variable optional, and use dict(zip(vocab, range(len(vocab)))) instead #1612

Closed

abhinavarora self-assigned this Feb 18, 2022

abhinavarora linked an issue Feb 18, 2022 that may be closed by this pull request

Make the CLIPTokenizer's encoder_json_path variable optional, and use dict(zip(vocab, range(len(vocab)))) instead #1612

Closed

ebsmothers approved these changes Feb 19, 2022

View reviewed changes

erip reviewed Feb 19, 2022

View reviewed changes

parmeet reviewed Feb 20, 2022

View reviewed changes

erip mentioned this pull request Feb 22, 2022

Separate object construction from file reading #1624

Open

Add optional keyword in doc string comments

c9cb1d2

abhinavarora merged commit 81212ba into pytorch:main Feb 22, 2022

abhinavarora added a commit to abhinavarora/text that referenced this pull request Feb 22, 2022

Modify CLIPTokenizer to either infer number of merges from encoder js…

ccc3f79

…on or take it in constructor (pytorch#1622)

parmeet pushed a commit that referenced this pull request Feb 23, 2022

Modify CLIPTokenizer to either infer number of merges from encoder js…

170b74a

…on or take it in constructor (#1622) (#1626)

Modify CLIPTokenizer to either infer number of merges from encoder json or take it in constructor #1622

Modify CLIPTokenizer to either infer number of merges from encoder json or take it in constructor #1622

Uh oh!

Conversation

abhinavarora commented Feb 18, 2022

Problem

Solution

Testing

Uh oh!

codecov bot commented Feb 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ebsmothers left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

parmeet Feb 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

abhinavarora commented Feb 22, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

codecov bot commented Feb 18, 2022 •

edited

Loading

parmeet Feb 20, 2022 •

edited

Loading