Skip to content
This repository was archived by the owner on Sep 10, 2025. It is now read-only.
This repository was archived by the owner on Sep 10, 2025. It is now read-only.

IMDB with default root only loads half the data #2041

@JannisZeller

Description

@JannisZeller

🐛 Bug

Description

If I try to load the IMDb data via torchtext.datasets.IMDB() with no root argument the docs say the data should be loaded to os.path.expanduser('~/.torchtext/cache'), which is 'C:\\Users\\USERNAME/.torchtext/cache'. Yet I cannot find the data there and (what is actually the bigger problem) only half the data is pulled (the 1-labelled half). If I explicitly set root = C:/Users/USERNAME/.torchtext/cache' everything works fine. Might be caused by some path-problems but the torchtext.datasets.IMDB source is a little cryptic for me to read...

To Reproduce Run:

from torchtext.datasets import IMDB
import numpy as np

trn_rawpipe = IMDB(split="train")
targets = []
for x in trn_rawpipe:
    targets.append(x[0])

print(len(targets))
>> 12500

print(np.unique(targets, return_counts=True))
>> (array([1]), array([12500], dtype=int64))

trn_rawpipe = IMDB("C:/Users/USERNAME/.torchtext/cache", split="train")
targets = []
for x in trn_rawpipe:
    targets.append(x[0])

print(len(targets))
>> 25000

print(np.unique(targets, return_counts=True))
>> (array([1, 2]), array([12500, 12500], dtype=int64))

Expected Behaviour

It should not make any difference.

Environment

  • PyTorch Version: 1.13.1
  • OS (e.g., Linux): Windows 11 (10.0.22621 Build 22621)
  • How you installed PyTorch: conda
  • Python version: 3.10.8
  • CUDA/cuDNN version: 11.7
  • GPU models and configuration: RTX 2080 Super
  • Any other relevant information: I installed all packages following the instructions suitable for my setup. The other torch-related packages are the following:
# Name                    Version                   Build    Channel
pytorch                   1.13.1          py3.10_cuda11.7_cudnn8_0    pytorch
pytorch-cuda              11.7                 h67b0de4_1    pytorch
pytorch-mutex             1.0                        cuda    pytorch
torch-scatter             2.1.0+pt113cu117          pypi_0    pypi
torch-sparse              0.6.16+pt113cu117          pypi_0    pypi
torchaudio                0.13.1                   pypi_0    pypi
torchdata                 0.5.1              pyh2db4395_0    conda-forge
torchsummary              1.5.1                    pypi_0    pypi
torchtext                 0.14.1                    py310    pytorch
torchvision               0.14.1                   pypi_0    pypi

Running the loader with the default root in this notebook from the tutorial in colab works out despite the packages being mainly the same. I appreciate any help, keep up the good work!

Edit: Typo and unification of both approaches.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions