Skip to content
This repository was archived by the owner on Sep 10, 2025. It is now read-only.
This repository was archived by the owner on Sep 10, 2025. It is now read-only.

Make the CLIPTokenizer's encoder_json_path variable optional, and use dict(zip(vocab, range(len(vocab)))) instead #1612

@ProGamerGov

Description

@ProGamerGov

🚀 Feature

https://github.com/pytorch/text/blob/main/torchtext/transforms.py#L312

In both CLIP and OpenCLIP, the encoder is simply just the vocab run through dict(zip(vocab, range(len(vocab)))), and it doesn't make a ton of sense to require a encoder.json file for this information. The encoder.json file requirement should be optional as the vocab file itself can be used to create the encoder, making the encoder file redundant.

https://github.com/openai/CLIP/blob/main/clip/simple_tokenizer.py#L74
https://github.com/mlfoundations/open_clip/blob/main/src/clip/tokenizer.py#L78

The current clip_encoder.json test asset is just the Python dict created by dict(zip(vocab, range(len(vocab)))), and thus will it makes a useful test for specifying an encoder, it's redundant: https://github.com/pytorch/text/blob/main/test/asset/clip_encoder.json

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions