Skip to content
This repository was archived by the owner on Sep 10, 2025. It is now read-only.
This repository was archived by the owner on Sep 10, 2025. It is now read-only.

Problem with StopIteration on dataset when creating vocabulary #1447

@00krishna

Description

@00krishna

❓ Questions and Help

Description

Hey folks, I was hoping someone could tell me a better way to deal with this issue. I am getting a StopIteration error on the
dataset, and I am not clear on how to get around it. Here is a minimal example below which creates the error. I am using Torchtext 0.10.0.

In the real code, I am pulling the AG_NEWS dataset into the train_iter variable, building a vocabulary based on that train_iter dataset, and then trying to process batches for that same dataset using a Dataloader with collate function.

The problem seems to be that I iterate through train_iter one time, in order to build the vocabulary with the yield_tokens function. But when I try and then do next(iter(train_iter)), the iterator has already reached its end. Is there a way to copy the train_iter so that I can build the vocabulary based on the copy. I can probably write some hacky code to workaround this, but just wanted to see if there is a better or more appropriate way.

from torchtext.datasets import AG_NEWS
from torchtext.data.utils import get_tokenizer

from typing import Optional, Tuple

import torchtext
import torch
from torchtext.vocab import Vocab, build_vocab_from_iterator
import numpy as np


def yield_tokens(data_iter):
    for _, text in data_iter:
        yield tokenizer(text)

tokenizer = get_tokenizer('basic_english')
train_iter, test_iter = AG_NEWS()

vocab = build_vocab_from_iterator(yield_tokens(train_iter), 
                                            specials=["<unk>"])
vocab.set_default_index(vocab["<unk>"])

print(next(iter(train_iter)))

The error message generated is:

Exception has occurred: StopIteration
exception: no description

    print(next(iter(train_iter)))

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions