Add a `max_words` argument to `build_vocab_from_iterator`

## 🚀 Feature


[Link to the docs](https://pytorch.org/text/stable/vocab.html?highlight=build%20vocab#torchtext.vocab.build_vocab_from_iterator)

I believe it would be beneficial to limit the number of words you want in your vocabulary with an argument like `max_words`, e.g.:
```
vocab = build_vocab_from_iterator(yield_tokens_batch(file_path), specials=["<unk>"], max_words=50000)
```

**Motivation**




This allows a controllable-sized `nn.Embedding`, with rare words being mapped to `<unk>`. Otherwise, it would not be practical to use `build_vocab_from_iterator` for larger datasets.


**Alternatives**



Keras and Huggingface's tokenizers would be viable alternatives, but do not nicely integrate with the torchtext ecosystem.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add a `max_words` argument to `build_vocab_from_iterator` #1523

🚀 Feature

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add a max_words argument to build_vocab_from_iterator #1523

Description

🚀 Feature

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Add a `max_words` argument to `build_vocab_from_iterator` #1523