tweaks code to support new url for IWSLT dataset #1115

garyhlai · 2020-12-29T17:09:52Z

No description provided.

facebook-github-bot · 2020-12-29T17:09:56Z

Hi @ghlai9665!

Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file.

In order for us to review and merge your code, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

zhangguanheng66 · 2020-12-29T17:34:58Z

You can use "flake8 translation.py" for lint check

zhangguanheng66

Do we need to update the MD5 hash because of a new file?

zhangguanheng66 · 2020-12-29T17:36:44Z

torchtext/experimental/datasets/raw/translation.py

    elif isinstance(URLS[dataset_name], str):
-        dataset_tar = download_from_url(URLS[dataset_name], root=root, hash_value=MD5[dataset_name], hash_type='md5')
-        extracted_files.extend(extract_archive(dataset_tar))
+        dataset_tar = download_from_url(URLS[dataset_name])


Do we need to check the hash here?

Running
dataset_tar = download_from_url(URLS[dataset_name], root=root, hash_value=MD5[dataset_name], hash_type='md5')

gives me this error:

RuntimeError: The hash of .data/2016-01.tgz does not match. Delete the file manually and retry.

That's why I changed it to dataset_tar = download_from_url(URLS[dataset_name])

Why is checking the hash important and what problems could the current change introduce?

Thanks!

Oh is this purpose of MD5 hash to stop the download if we find that the file we're downloading is not what we expected by checking the hash?

Just made the changes!

zhangguanheng66 · 2020-12-29T17:41:47Z

For the failed CI test, I think you need to install the de package first for spacy. see example here.

zhangguanheng66 · 2020-12-29T17:47:08Z

I think you can add de to the circleci setup envir file to link1 and link2. You can still use the customized tokenizers in the test.

garyhlai · 2020-12-29T17:56:49Z

For the failed CI test, I think you need to install the de package first for spacy. see example here.

Is switching back to the default tokenizer OK? It seems to work ok for the test.

zhangguanheng66 · 2020-12-29T18:01:04Z

For the failed CI test, I think you need to install the de package first for spacy. see example here.

Is switching back to the default tokenizer OK? It seems to work ok for the test.

Sure. It's fine for me.

garyhlai · 2020-12-29T18:09:46Z

You can use "flake8 translation.py" for lint check

I ran flake8 translation.py but it didn't give me any errors.

Running flake8 --version gave me

3.7.9 (mccabe: 0.6.1, pycodestyle: 2.5.0, pyflakes: 2.1.1) CPython 3.7.9 on Darwin

Working on the errors given by flake8 test_builtin_datasets.py though. I think the style errors are coming from there.

codecov · 2020-12-29T18:43:41Z

Codecov Report

Merging #1115 (75ac292) into master (adc489b) will increase coverage by 1.35%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #1115      +/-   ##
==========================================
+ Coverage   77.54%   78.89%   +1.35%     
==========================================
  Files          45       45              
  Lines        3086     3090       +4     
==========================================
+ Hits         2393     2438      +45     
+ Misses        693      652      -41

Impacted Files	Coverage Δ
torchtext/experimental/datasets/raw/translation.py	`91.57% <100.00%> (+26.74%)`	⬆️
torchtext/experimental/datasets/translation.py	`83.33% <0.00%> (+1.38%)`	⬆️
torchtext/utils.py	`89.54% <0.00%> (+10.45%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update adc489b...75ac292. Read the comment docs.

zhangguanheng66 · 2020-12-29T20:02:59Z

You can use "flake8 translation.py" for lint check

I ran flake8 translation.py but it didn't give me any errors.

Running flake8 --version gave me

3.7.9 (mccabe: 0.6.1, pycodestyle: 2.5.0, pyflakes: 2.1.1) CPython 3.7.9 on Darwin

Working on the errors given by flake8 test_builtin_datasets.py though. I think the style errors are coming from there.

yeap. I was giving an example. LGTM now.

zhangguanheng66

Thanks for the contribution. LGTM.

maroxtn · 2021-01-13T17:21:45Z

I still get the same error when I try to download the dataset ?

zhangguanheng66 · 2021-01-13T18:45:30Z

The fix went to the nightly release branch. From the CI test added in the PR, it looks good now.

maroxtn · 2021-01-14T11:25:21Z

@zhangguanheng66 Sorry for my beginner questions, but I tried to install the nightly version with this !pip install --pre torch torchtext -f https://download.pytorch.org/whl/nightly/cu101/torch_nightly.html , then I import torchtext as I would normally do, and get the same error.
Could you tell me what I'm doing wrong?

zhangguanheng66 · 2021-01-14T14:18:16Z

@zhangguanheng66 Sorry for my beginner questions, but I tried to install the nightly version with this !pip install --pre torch torchtext -f https://download.pytorch.org/whl/nightly/cu101/torch_nightly.html , then I import torchtext as I would normally do, and get the same error.
Could you tell me what I'm doing wrong?

Could you double check that you are using what you installed? For example

python -c "import torchtext; print(torchtext.__file__)"

and make sure it's your nightly release. Otherwise, you have to uninstall and then install.

maroxtn · 2021-01-14T15:24:18Z

I did the following

$pip uninstall torchtext -y
$pip install --pre torch torchtext -f https://download.pytorch.org/whl/nightly/cu101/torch_nightly.html
$python -c "import torchtext; print(torchtext.__file__)"
/opt/conda/lib/python3.7/site-packages/torchtext/__init__.py

The I imported torchtext, and I still got the same error. Am I doing something wrong ?

maroxtn · 2021-01-15T21:01:29Z

Can you confirm that I am doing everything right @zhangguanheng66 ?

zhangguanheng66 · 2021-01-15T22:15:01Z

Could you delete all the torchtext packages in /opt/conda/lib/python3.7/site-packages/ folder and reinstall the package again? Make sure you don't have the old package. Can you also check the version?

python -c "import torchtext; print(torchtext.__version__)"

maroxtn · 2021-01-15T22:35:57Z

@zhangguanheng66 I deleted the files !rm -rf /opt/conda/lib/python3.7/site-packages/torchtext then I uninstalled the package using pip, then reinstalled the nightly version and I got this as the version:

 0.9.0.dev20210115

zhangguanheng66 · 2021-01-16T18:25:39Z

@zhangguanheng66 I deleted the files !rm -rf /opt/conda/lib/python3.7/site-packages/torchtext then I uninstalled the package using pip, then reinstalled the nightly version and I got this as the version:
 0.9.0.dev20210115

Yup. This is the correct version.

maroxtn · 2021-01-16T18:30:21Z

But it still yields the same error. Do you experience the same issue ?

zhangguanheng66 · 2021-01-16T18:47:00Z

But it still yields the same error. Do you experience the same issue ?

Can you send me a code snippet and copy/paste the error here?

maroxtn · 2021-01-16T19:09:43Z

This notebook reproduces the error : https://colab.research.google.com/drive/1kGqGEkWBFxY7dtU-0xeDzQiKj3zg1y88?usp=sharing

zhangguanheng66 · 2021-01-16T19:14:57Z

This notebook reproduces the error : https://colab.research.google.com/drive/1kGqGEkWBFxY7dtU-0xeDzQiKj3zg1y88?usp=sharing

You should switch to the IWSLT dataset in the experimental folder. The one in the root folder is not maintained and we will retire them very soon as legacy code.

maroxtn · 2021-01-16T19:24:21Z

Thanks for the response, but could instruct me on how to use the datasets in the experimental folder ?

zhangguanheng66 · 2021-01-16T20:45:06Z

Take a look at this example link.

imflash217 · 2021-03-19T18:19:35Z

Take a look at this example link.

I am getting this error while execting as in the example:

What am I doing wrong here?

---------------------------------------------------------------------------
ReadError                                 Traceback (most recent call last)
<ipython-input-8-639784b62f93> in <module>
----> 1 train_dataset, valid_dataset, test_dataset = IWSLT(tokenizer=(src_tokenizer,tgt_tokenizer))

~/anaconda3/envs/aogtr/lib/python3.8/site-packages/torchtext/experimental/datasets/translation.py in IWSLT(train_filenames, valid_filenames, test_filenames, tokenizer, root, vocab, data_select, removed_tokens)
    428     """
    429 
--> 430     return _setup_datasets("IWSLT",
    431                            train_filenames=train_filenames,
    432                            valid_filenames=valid_filenames,

~/anaconda3/envs/aogtr/lib/python3.8/site-packages/torchtext/experimental/datasets/translation.py in _setup_datasets(dataset_name, train_filenames, valid_filenames, test_filenames, data_select, root, vocab, tokenizer, removed_tokens)
     38             "tokenizer must be an instance of tuple with length two"
     39             "or None")
---> 40     train, val, test = DATASETS[dataset_name](train_filenames=train_filenames,
     41                                               valid_filenames=valid_filenames,
     42                                               test_filenames=test_filenames,

~/anaconda3/envs/aogtr/lib/python3.8/site-packages/torchtext/experimental/datasets/raw/translation.py in IWSLT(train_filenames, valid_filenames, test_filenames, root)
    451     URLS["IWSLT"] = URLS["IWSLT"].format(src_language, tgt_language, languages)
    452 
--> 453     return _setup_datasets(
    454         "IWSLT",
    455         train_filenames=train_filenames,

~/anaconda3/envs/aogtr/lib/python3.8/site-packages/torchtext/experimental/datasets/raw/translation.py in _setup_datasets(dataset_name, train_filenames, valid_filenames, test_filenames, root)
    137     elif isinstance(URLS[dataset_name], str):
    138         dataset_tar = download_from_url(URLS[dataset_name], root=root)
--> 139         extracted_files.extend(extract_archive(dataset_tar))
    140     else:
    141         raise ValueError(

~/anaconda3/envs/aogtr/lib/python3.8/site-packages/torchtext/utils.py in extract_archive(from_path, to_path, overwrite)
    189     if from_path.endswith(('.tar.gz', '.tgz')):
    190         logging.info('Opening tar file {}.'.format(from_path))
--> 191         with tarfile.open(from_path, 'r') as tar:
    192             files = []
    193             for file_ in tar:

~/anaconda3/envs/aogtr/lib/python3.8/tarfile.py in open(cls, name, mode, fileobj, bufsize, **kwargs)
   1604                         fileobj.seek(saved_pos)
   1605                     continue
-> 1606             raise ReadError("file could not be opened successfully")
   1607 
   1608         elif ":" in mode:

ReadError: file could not be opened successfully

parmeet · 2021-03-19T20:29:31Z

Take a look at this example link.

I am getting this error while execting as in the example:

What am I doing wrong here?

---------------------------------------------------------------------------
ReadError                                 Traceback (most recent call last)
<ipython-input-8-639784b62f93> in <module>
----> 1 train_dataset, valid_dataset, test_dataset = IWSLT(tokenizer=(src_tokenizer,tgt_tokenizer))

~/anaconda3/envs/aogtr/lib/python3.8/site-packages/torchtext/experimental/datasets/translation.py in IWSLT(train_filenames, valid_filenames, test_filenames, tokenizer, root, vocab, data_select, removed_tokens)
    428     """
    429 
--> 430     return _setup_datasets("IWSLT",
    431                            train_filenames=train_filenames,
    432                            valid_filenames=valid_filenames,

~/anaconda3/envs/aogtr/lib/python3.8/site-packages/torchtext/experimental/datasets/translation.py in _setup_datasets(dataset_name, train_filenames, valid_filenames, test_filenames, data_select, root, vocab, tokenizer, removed_tokens)
     38             "tokenizer must be an instance of tuple with length two"
     39             "or None")
---> 40     train, val, test = DATASETS[dataset_name](train_filenames=train_filenames,
     41                                               valid_filenames=valid_filenames,
     42                                               test_filenames=test_filenames,

~/anaconda3/envs/aogtr/lib/python3.8/site-packages/torchtext/experimental/datasets/raw/translation.py in IWSLT(train_filenames, valid_filenames, test_filenames, root)
    451     URLS["IWSLT"] = URLS["IWSLT"].format(src_language, tgt_language, languages)
    452 
--> 453     return _setup_datasets(
    454         "IWSLT",
    455         train_filenames=train_filenames,

~/anaconda3/envs/aogtr/lib/python3.8/site-packages/torchtext/experimental/datasets/raw/translation.py in _setup_datasets(dataset_name, train_filenames, valid_filenames, test_filenames, root)
    137     elif isinstance(URLS[dataset_name], str):
    138         dataset_tar = download_from_url(URLS[dataset_name], root=root)
--> 139         extracted_files.extend(extract_archive(dataset_tar))
    140     else:
    141         raise ValueError(

~/anaconda3/envs/aogtr/lib/python3.8/site-packages/torchtext/utils.py in extract_archive(from_path, to_path, overwrite)
    189     if from_path.endswith(('.tar.gz', '.tgz')):
    190         logging.info('Opening tar file {}.'.format(from_path))
--> 191         with tarfile.open(from_path, 'r') as tar:
    192             files = []
    193             for file_ in tar:

~/anaconda3/envs/aogtr/lib/python3.8/tarfile.py in open(cls, name, mode, fileobj, bufsize, **kwargs)
   1604                         fileobj.seek(saved_pos)
   1605                     continue
-> 1606             raise ReadError("file could not be opened successfully")
   1607 
   1608         elif ":" in mode:

ReadError: file could not be opened successfully

Thank you for raising this issue. Unfortunately, the example code is broken (I will work on PR to fix it). We have split the IWSLT into two separate datasets 'IWSLT2016' and 'IWSLT2017'. Could you try with following code snippet:

from torchtext.experimental.datasets import IWSLT2016
from torchtext.data.utils import get_tokenizer
src_tokenizer = get_tokenizer("spacy", language='de_core_news_sm')
tgt_tokenizer = get_tokenizer("basic_english")
train_dataset, valid_dataset, test_dataset = IWSLT2016(tokenizer=(src_tokenizer,
                                                              tgt_tokenizer))
src_vocab, tgt_vocab = train_dataset.get_vocab()
src_data, tgt_data = train_dataset[10]

imflash217 · 2021-03-21T21:37:35Z

Thanks @parmeet , the above code with IWSLT2017 etc. works fine.
Now, I am stuck in another issue that torchtext 0.9 requires PyTorch 1.8.0 which breaks my other implementation.
How could I install torchtext==0.9 without upgrading my PyTorch version?
Thanks

parmeet · 2021-03-25T04:31:41Z

Thanks @parmeet , the above code with IWSLT2017 etc. works fine.
Now, I am stuck in another issue that torchtext 0.9 requires PyTorch 1.8.0 which breaks my other implementation.
How could I install torchtext==0.9 without upgrading my PyTorch version?
Thanks

I am afraid that that will most likely not work. The library is build again a specific version of PyTorch and may not be backward compatible with previous versions of PyTorch.

imflash217 · 2021-03-26T03:03:48Z

Thanks @parmeet , the above code with IWSLT2017 etc. works fine.
Now, I am stuck in another issue that torchtext 0.9 requires PyTorch 1.8.0 which breaks my other implementation.
How could I install torchtext==0.9 without upgrading my PyTorch version?
Thanks

I am afraid that that will most likely not work. The library is build again a specific version of PyTorch and may not be backward compatible with previous versions of PyTorch.

This is a bummer. 😰
I am stuck in a loop here. I can't upgrade my CUDA from10.1 to 10.2+ due to a lot of other dependencies. So, my Pytorch will not upgrade to 1.8 and hence my torchtext will be <0.9 where the IWSLT part is not working.

Any suggestions on this.

parmeet · 2021-03-26T03:47:38Z

Thanks @parmeet , the above code with IWSLT2017 etc. works fine.
Now, I am stuck in another issue that torchtext 0.9 requires PyTorch 1.8.0 which breaks my other implementation.
How could I install torchtext==0.9 without upgrading my PyTorch version?
Thanks

I am afraid that that will most likely not work. The library is build again a specific version of PyTorch and may not be backward compatible with previous versions of PyTorch.

This is a bummer. 😰
I am stuck in a loop here. I can't upgrade my CUDA from10.1 to 10.2+ due to a lot of other dependencies. So, my Pytorch will not upgrade to 1.8 and hence my torchtext will be <0.9 where the IWSLT part is not working.

Any suggestions on this.

The raw datasets in torchtext 0.9 are basically Iterables. One workaround would be to materialize them in list and save them (you will basically create a separate conda env. where you will install CPU only pytorch version along with torchtext 0.9 and then materialize IWSLT2016/17 raw dataset into List and finally save them). Then you can simply load them in your preferred environment. Hope this helps!

imflash217 · 2021-03-28T20:04:04Z

Thanks @parmeet ,
So, I tried to use the latest build of torch and torchtext in GoogleColab.
But, I am now getting this error.
Any suggestions on this. Thanks.

parmeet · 2021-03-28T22:29:13Z

Thanks @parmeet ,
So, I tried to use the latest build of torch and torchtext in GoogleColab.
But, I am now getting this error.
Any suggestions on this. Thanks.

The code seems to be using legacy primitives which we have retired in the latest release (they are put in the legacy folder for time being, please have a look at the Release Notes https://github.com/pytorch/text/releases/tag/v0.9.0-rc5).

Can you try to work with the example I shared above for IWLST2016, together with the migration Tutorial here https://github.com/pytorch/text/blob/master/examples/legacy_tutorial/migration_tutorial.ipynb. I think Step 2 of the migration tutorial provides a migration guide for building the vocab.

garyhlai added 2 commits December 30, 2020 01:03

tweaks code to support new IWSLT url & adds IWSLT test

c418136

adds iwslt ci test

5867bdf

facebook-github-bot added the cla signed label Dec 29, 2020

zhangguanheng66 suggested changes Dec 29, 2020

View reviewed changes

switches to the default tokenizer in test instead of spacy

5ac9a0e

fixes style errors from flake8

9e7121e

updates md5 of iwslt and fixes download_from_url hash check

75ac292

zhangguanheng66 approved these changes Dec 29, 2020

View reviewed changes

zhangguanheng66 merged commit 8eee23c into pytorch:master Dec 29, 2020

This was referenced Dec 29, 2020

Cannot download IWSLT dataset #1091

Closed

Can't download IWSLT dataset to Google Colab #1098

Closed

chrisyeh96 mentioned this pull request Jul 15, 2021

torchtext.legacy.datasets.IWSLT is unusable due to outdated URL #1357

Closed

tweaks code to support new url for IWSLT dataset #1115

tweaks code to support new url for IWSLT dataset #1115

Uh oh!

Conversation

garyhlai commented Dec 29, 2020

Uh oh!

facebook-github-bot commented Dec 29, 2020

Uh oh!

zhangguanheng66 commented Dec 29, 2020

Uh oh!

zhangguanheng66 left a comment

Choose a reason for hiding this comment

Uh oh!

zhangguanheng66 Dec 29, 2020

Choose a reason for hiding this comment

Uh oh!

garyhlai Dec 29, 2020

Choose a reason for hiding this comment

Uh oh!

garyhlai Dec 29, 2020

Choose a reason for hiding this comment

Uh oh!

garyhlai Dec 29, 2020

Choose a reason for hiding this comment

Uh oh!

zhangguanheng66 commented Dec 29, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhangguanheng66 commented Dec 29, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

garyhlai commented Dec 29, 2020

Uh oh!

zhangguanheng66 commented Dec 29, 2020

Uh oh!

garyhlai commented Dec 29, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Dec 29, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

zhangguanheng66 commented Dec 29, 2020

Uh oh!

zhangguanheng66 left a comment

Choose a reason for hiding this comment

Uh oh!

maroxtn commented Jan 13, 2021

Uh oh!

zhangguanheng66 commented Jan 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maroxtn commented Jan 14, 2021

Uh oh!

zhangguanheng66 commented Jan 14, 2021

Uh oh!

maroxtn commented Jan 14, 2021

Uh oh!

maroxtn commented Jan 15, 2021

Uh oh!

zhangguanheng66 commented Jan 15, 2021

Uh oh!

maroxtn commented Jan 15, 2021

Uh oh!

zhangguanheng66 commented Jan 16, 2021

Uh oh!

maroxtn commented Jan 16, 2021

Uh oh!

zhangguanheng66 commented Jan 16, 2021

Uh oh!

maroxtn commented Jan 16, 2021

Uh oh!

zhangguanheng66 commented Jan 16, 2021

Uh oh!

maroxtn commented Jan 16, 2021

Uh oh!

zhangguanheng66 commented Jan 16, 2021

Uh oh!

imflash217 commented Mar 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

zhangguanheng66 commented Dec 29, 2020 •

edited

Loading

zhangguanheng66 commented Dec 29, 2020 •

edited

Loading

garyhlai commented Dec 29, 2020 •

edited

Loading

codecov bot commented Dec 29, 2020 •

edited

Loading

zhangguanheng66 commented Jan 13, 2021 •

edited

Loading

imflash217 commented Mar 19, 2021 •

edited

Loading

parmeet commented Mar 25, 2021 •

edited

Loading

imflash217 commented Mar 26, 2021 •

edited

Loading

parmeet commented Mar 26, 2021 •

edited

Loading