-
Notifications
You must be signed in to change notification settings - Fork 814
IMDB with default root only loads half the data #2041
Description
🐛 Bug
Description
If I try to load the IMDb data via torchtext.datasets.IMDB() with no root argument the docs say the data should be loaded to os.path.expanduser('~/.torchtext/cache'), which is 'C:\\Users\\USERNAME/.torchtext/cache'. Yet I cannot find the data there and (what is actually the bigger problem) only half the data is pulled (the 1-labelled half). If I explicitly set root = C:/Users/USERNAME/.torchtext/cache' everything works fine. Might be caused by some path-problems but the torchtext.datasets.IMDB source is a little cryptic for me to read...
To Reproduce Run:
from torchtext.datasets import IMDB
import numpy as np
trn_rawpipe = IMDB(split="train")
targets = []
for x in trn_rawpipe:
targets.append(x[0])
print(len(targets))
>> 12500
print(np.unique(targets, return_counts=True))
>> (array([1]), array([12500], dtype=int64))
trn_rawpipe = IMDB("C:/Users/USERNAME/.torchtext/cache", split="train")
targets = []
for x in trn_rawpipe:
targets.append(x[0])
print(len(targets))
>> 25000
print(np.unique(targets, return_counts=True))
>> (array([1, 2]), array([12500, 12500], dtype=int64))Expected Behaviour
It should not make any difference.
Environment
- PyTorch Version: 1.13.1
- OS (e.g., Linux): Windows 11 (10.0.22621 Build 22621)
- How you installed PyTorch: conda
- Python version: 3.10.8
- CUDA/cuDNN version: 11.7
- GPU models and configuration: RTX 2080 Super
- Any other relevant information: I installed all packages following the instructions suitable for my setup. The other torch-related packages are the following:
# Name Version Build Channel
pytorch 1.13.1 py3.10_cuda11.7_cudnn8_0 pytorch
pytorch-cuda 11.7 h67b0de4_1 pytorch
pytorch-mutex 1.0 cuda pytorch
torch-scatter 2.1.0+pt113cu117 pypi_0 pypi
torch-sparse 0.6.16+pt113cu117 pypi_0 pypi
torchaudio 0.13.1 pypi_0 pypi
torchdata 0.5.1 pyh2db4395_0 conda-forge
torchsummary 1.5.1 pypi_0 pypi
torchtext 0.14.1 py310 pytorch
torchvision 0.14.1 pypi_0 pypi
Running the loader with the default root in this notebook from the tutorial in colab works out despite the packages being mainly the same. I appreciate any help, keep up the good work!
Edit: Typo and unification of both approaches.