🐛 Bug
The IWSLT dataset's URL was updated some time in late 2020, as mentioned in #1091. When torchtext v0.9.0 updated the IWSLT datasets to use the new dataset URL on Google Drive (see #1115), the corresponding torchtext.legacy.datasets.IWSLT dataset was not updated to the new URL.
Consequently, using torchtext.legacy.datasets.IWSLT causes torchtext to download an HTML page with a 404 message, instead of the actual dataset. This leads to the error: OSError: Not a gzipped file.
To Reproduce
Code
from torchtext.legacy import data, datasets
f = data.Field()
datasets.IWSLT.splits(exts=('.de', '.en'), fields=(f, f))
Output
---------------------------------------------------------------------------
OSError Traceback (most recent call last)
/usr/lib/python3.7/tarfile.py in gzopen(cls, name, mode, fileobj, compresslevel, **kwargs)
1645 try:
-> 1646 t = cls.taropen(name, mode, fileobj, **kwargs)
1647 except OSError:
------⬍ 12 frames------
OSError: Not a gzipped file (b'<!')
During handling of the above exception, another exception occurred:
ReadError Traceback (most recent call last)
/usr/lib/python3.7/tarfile.py in gzopen(cls, name, mode, fileobj, compresslevel, **kwargs)
1648 fileobj.close()
1649 if mode == 'r':
-> 1650 raise ReadError("not a gzip file")
1651 raise
1652 except:
ReadError: not a gzip file
Note that switching to earlier versions of torchtext (e.g., v0.9 or v0.8) don't help, because that does not resolve the underlying 3rd-party URL issue.
Environment Info
- torchtext v0.10.0
- System: tested on Google Colab and local machine, details unimportant