Skip to content
This repository was archived by the owner on Sep 10, 2025. It is now read-only.
This repository was archived by the owner on Sep 10, 2025. It is now read-only.

One of the three datasets returned by Multi30k seems to be bugged. #2001

@raaaaaymond

Description

@raaaaaymond

🐛 Bug

Describe the bug A clear and concise description of what the bug is.

The testing data returned by Multi30k doesn't match the expected SHA256 hash. The precise error is:

RuntimeError: The computed hash 0681be16a532912288a91ddd573594fbdd57c0fbb81486eff7c55247e35326c2 of C:\Users\raaaa/.cache\torch\text\datasets\Multi30k\mmt16_task1_test.tar.gz does not match the expectedhash 6d1ca1dba99e2c5dd54cae1226ff11c2551e6ce63527ebb072a1f70f72a5cd36. Delete the file manually and retry.
This exception is thrown by __iter__ of HashCheckerIterDataPipe(hash_dict={'C:\\Users\\raaaa/.cache\\torch\\text\\datasets\\Multi30k\\mmt16_task1_test.tar.gz': '6d1ca1dba99e2c5dd54cae1226ff11c2551e6ce63527ebb072a1f70f72a5cd36'}, hash_type='sha256', rewind=True, source_datapipe=MapperIterDataPipe)

I've done what the message suggested; I deleted the files manually and did it again, but the same error occurs.

To Reproduce Steps to reproduce the behavior:

Paste the following into a new Python file and run it.

import torchtext

def _main():
    train, val, test = torchtext.datasets.Multi30k(language_pair=("de", "en"))
    # The following works fine because `val` and `train` datasets are fine.
    # for thing in val:
    #     print(thing)
    #     break
    # Invoking the generator (which is `test`) in the following way triggers the error.
    for thing in test:
        print(thing)
        break


if __name__ == "__main__":
    _main()

You should see the error I pasted above.

Expected behavior A clear and concise description of what you expected to happen.

I expect no error.

Environment

PyTorch version: 1.13.0+cu117
Is debug build: False
CUDA used to build PyTorch: 11.7
ROCM used to build PyTorch: N/A

OS: Microsoft Windows 11 Pro
GCC version: (x86_64-posix-seh, Built by strawberryperl.com project) 8.3.0
Clang version: Could not collect
CMake version: version 3.20.2
Libc version: N/A

Python version: 3.10.1 (tags/v3.10.1:2cd268a, Dec 6 2021, 19:10:37) [MSC v.1929 64 bit (AMD64)] (64-bit runtime)
Python platform: Windows-10-10.0.22000-SP0
Is CUDA available: True
CUDA runtime version: 11.7.64
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3080
Nvidia driver version: 526.86
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] mypy==0.950
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.23.4
[pip3] torch==1.13.0+cu117
[pip3] torchaudio==0.13.0+cu117
[pip3] torchdata==0.5.0
[pip3] torchtext==0.14.0
[pip3] torchvision==0.14.0+cu117
[conda] Could not collect
You can get the script and run it with:

Additional context Add any other context about the problem here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions