This repository was archived by the owner on Sep 10, 2025. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 814
This repository was archived by the owner on Sep 10, 2025. It is now read-only.
Can't download IWSLT dataset to Google Colab #1098
Copy link
Copy link
Closed
Description
I am experimenting with an implementation of the "Attention is All You Need" paper.
This is the implementation.
And I am using the Google Colab to be able to use the GPU.
But the code for downloading the dataset using PyTorch results in error.
This is the code I used:
from torchtext import data, datasets
import spacy
spacy_de = spacy.load('de')
spacy_en = spacy.load('en')
def tokenize_de(text):
return [tok.text for tok in spacy_de.tokenizer(text)]
def tokenize_en(text):
return [tok.text for tok in spacy_en.tokenizer(text)]
BOS_WORD = '<s>'
EOS_WORD = '</s>'
BLANK_WORD = "<blank>"
SRC = data.Field(tokenize=tokenize_de, pad_token=BLANK_WORD)
TGT = data.Field(tokenize=tokenize_en, init_token = BOS_WORD,
eos_token = EOS_WORD, pad_token=BLANK_WORD)
MAX_LEN = 100
train, val, test = datasets.IWSLT.splits(
exts=('.de', '.en'), fields=(SRC, TGT),
filter_pred=lambda x: len(vars(x)['src']) <= MAX_LEN and
len(vars(x)['trg']) <= MAX_LEN)
MIN_FREQ = 2
SRC.build_vocab(train.src, min_freq=MIN_FREQ)
TGT.build_vocab(train.trg, min_freq=MIN_FREQ)
And this is the error I got:
OSError Traceback (most recent call last)
/usr/lib/python3.6/tarfile.py in gzopen(cls, name, mode, fileobj, compresslevel, **kwargs)
1644 try:
-> 1645 t = cls.taropen(name, mode, fileobj, **kwargs)
1646 except OSError:
12 frames
OSError: Not a gzipped file (b'<!')
During handling of the above exception, another exception occurred:
ReadError Traceback (most recent call last)
/usr/lib/python3.6/tarfile.py in gzopen(cls, name, mode, fileobj, compresslevel, **kwargs)
1647 fileobj.close()
1648 if mode == 'r':
-> 1649 raise ReadError("not a gzip file")
1650 raise
1651 except:
ReadError: not a gzip file
Is this know issue? Or am I missing something?
Metadata
Metadata
Assignees
Labels
No labels