This repository was archived by the owner on Sep 10, 2025. It is now read-only.

Description
🚀 Feature
Link to the docs
I believe it would be beneficial to limit the number of words you want in your vocabulary with an argument like max_words, e.g.:
vocab = build_vocab_from_iterator(yield_tokens_batch(file_path), specials=["<unk>"], max_words=50000)
Motivation
This allows a controllable-sized nn.Embedding, with rare words being mapped to <unk>. Otherwise, it would not be practical to use build_vocab_from_iterator for larger datasets.
Alternatives
Keras and Huggingface's tokenizers would be viable alternatives, but do not nicely integrate with the torchtext ecosystem.