Skip to content
This repository was archived by the owner on Sep 10, 2025. It is now read-only.
This repository was archived by the owner on Sep 10, 2025. It is now read-only.

torchtext.legacy.datasets.IWSLT is unusable due to outdated URL #1357

@chrisyeh96

Description

@chrisyeh96

🐛 Bug

The IWSLT dataset's URL was updated some time in late 2020, as mentioned in #1091. When torchtext v0.9.0 updated the IWSLT datasets to use the new dataset URL on Google Drive (see #1115), the corresponding torchtext.legacy.datasets.IWSLT dataset was not updated to the new URL.

Consequently, using torchtext.legacy.datasets.IWSLT causes torchtext to download an HTML page with a 404 message, instead of the actual dataset. This leads to the error: OSError: Not a gzipped file.

To Reproduce

Code

from torchtext.legacy import data, datasets
f = data.Field()
datasets.IWSLT.splits(exts=('.de', '.en'), fields=(f, f))

Output

---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
/usr/lib/python3.7/tarfile.py in gzopen(cls, name, mode, fileobj, compresslevel, **kwargs)
   1645         try:
-> 1646             t = cls.taropen(name, mode, fileobj, **kwargs)
   1647         except OSError:

------⬍ 12 frames------
OSError: Not a gzipped file (b'<!')

During handling of the above exception, another exception occurred:

ReadError                                 Traceback (most recent call last)
/usr/lib/python3.7/tarfile.py in gzopen(cls, name, mode, fileobj, compresslevel, **kwargs)
   1648             fileobj.close()
   1649             if mode == 'r':
-> 1650                 raise ReadError("not a gzip file")
   1651             raise
   1652         except:

ReadError: not a gzip file

Note that switching to earlier versions of torchtext (e.g., v0.9 or v0.8) don't help, because that does not resolve the underlying 3rd-party URL issue.

Environment Info

  • torchtext v0.10.0
  • System: tested on Google Colab and local machine, details unimportant

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions