Make the `CLIPTokenizer`'s `encoder_json_path` variable optional, and use `dict(zip(vocab, range(len(vocab))))` instead

## 🚀 Feature


https://github.com/pytorch/text/blob/main/torchtext/transforms.py#L312


In both CLIP and OpenCLIP, the encoder is simply just the vocab run through `dict(zip(vocab, range(len(vocab))))`, and it doesn't make a ton of sense to require a `encoder.json` file for this information. The `encoder.json` file requirement should be optional as the vocab file itself can be used to create the encoder, making the encoder file redundant.

https://github.com/openai/CLIP/blob/main/clip/simple_tokenizer.py#L74
https://github.com/mlfoundations/open_clip/blob/main/src/clip/tokenizer.py#L78


The current `clip_encoder.json` test asset is just the Python dict created by `dict(zip(vocab, range(len(vocab))))`, and thus will it makes a useful test for specifying an encoder, it's redundant: https://github.com/pytorch/text/blob/main/test/asset/clip_encoder.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Make the `CLIPTokenizer`'s `encoder_json_path` variable optional, and use `dict(zip(vocab, range(len(vocab))))` instead #1612

🚀 Feature

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Make the CLIPTokenizer's encoder_json_path variable optional, and use dict(zip(vocab, range(len(vocab)))) instead #1612

Description

🚀 Feature

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Make the `CLIPTokenizer`'s `encoder_json_path` variable optional, and use `dict(zip(vocab, range(len(vocab))))` instead #1612