Skip to content
This repository was archived by the owner on Sep 10, 2025. It is now read-only.

Conversation

@garyhlai
Copy link
Contributor

No description provided.

@facebook-github-bot
Copy link
Contributor

Hi @ghlai9665!

Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file.

In order for us to review and merge your code, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

@zhangguanheng66
Copy link
Contributor

You can use "flake8 translation.py" for lint check

Copy link
Contributor

@zhangguanheng66 zhangguanheng66 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to update the MD5 hash because of a new file?

elif isinstance(URLS[dataset_name], str):
dataset_tar = download_from_url(URLS[dataset_name], root=root, hash_value=MD5[dataset_name], hash_type='md5')
extracted_files.extend(extract_archive(dataset_tar))
dataset_tar = download_from_url(URLS[dataset_name])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to check the hash here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running
dataset_tar = download_from_url(URLS[dataset_name], root=root, hash_value=MD5[dataset_name], hash_type='md5')

gives me this error:

RuntimeError: The hash of .data/2016-01.tgz does not match. Delete the file manually and retry.

That's why I changed it to dataset_tar = download_from_url(URLS[dataset_name])

Why is checking the hash important and what problems could the current change introduce?

Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh is this purpose of MD5 hash to stop the download if we find that the file we're downloading is not what we expected by checking the hash?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just made the changes!

@zhangguanheng66
Copy link
Contributor

zhangguanheng66 commented Dec 29, 2020

For the failed CI test, I think you need to install the de package first for spacy. see example here.

@zhangguanheng66
Copy link
Contributor

zhangguanheng66 commented Dec 29, 2020

I think you can add de to the circleci setup envir file to link1 and link2. You can still use the customized tokenizers in the test.

@garyhlai
Copy link
Contributor Author

For the failed CI test, I think you need to install the de package first for spacy. see example here.

Is switching back to the default tokenizer OK? It seems to work ok for the test.

@zhangguanheng66
Copy link
Contributor

For the failed CI test, I think you need to install the de package first for spacy. see example here.

Is switching back to the default tokenizer OK? It seems to work ok for the test.

Sure. It's fine for me.

@garyhlai
Copy link
Contributor Author

garyhlai commented Dec 29, 2020

You can use "flake8 translation.py" for lint check

I ran flake8 translation.py but it didn't give me any errors.

Running flake8 --version gave me

3.7.9 (mccabe: 0.6.1, pycodestyle: 2.5.0, pyflakes: 2.1.1) CPython 3.7.9 on Darwin

Working on the errors given by flake8 test_builtin_datasets.py though. I think the style errors are coming from there.

@codecov
Copy link

codecov bot commented Dec 29, 2020

Codecov Report

Merging #1115 (75ac292) into master (adc489b) will increase coverage by 1.35%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1115      +/-   ##
==========================================
+ Coverage   77.54%   78.89%   +1.35%     
==========================================
  Files          45       45              
  Lines        3086     3090       +4     
==========================================
+ Hits         2393     2438      +45     
+ Misses        693      652      -41     
Impacted Files Coverage Δ
torchtext/experimental/datasets/raw/translation.py 91.57% <100.00%> (+26.74%) ⬆️
torchtext/experimental/datasets/translation.py 83.33% <0.00%> (+1.38%) ⬆️
torchtext/utils.py 89.54% <0.00%> (+10.45%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update adc489b...75ac292. Read the comment docs.

@zhangguanheng66
Copy link
Contributor

You can use "flake8 translation.py" for lint check

I ran flake8 translation.py but it didn't give me any errors.

Running flake8 --version gave me

3.7.9 (mccabe: 0.6.1, pycodestyle: 2.5.0, pyflakes: 2.1.1) CPython 3.7.9 on Darwin

Working on the errors given by flake8 test_builtin_datasets.py though. I think the style errors are coming from there.

yeap. I was giving an example. LGTM now.

Copy link
Contributor

@zhangguanheng66 zhangguanheng66 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution. LGTM.

@zhangguanheng66 zhangguanheng66 merged commit 8eee23c into pytorch:master Dec 29, 2020
@maroxtn
Copy link

maroxtn commented Jan 13, 2021

I still get the same error when I try to download the dataset ?

@zhangguanheng66
Copy link
Contributor

zhangguanheng66 commented Jan 13, 2021

The fix went to the nightly release branch. From the CI test added in the PR, it looks good now.

@maroxtn
Copy link

maroxtn commented Jan 14, 2021

@zhangguanheng66 Sorry for my beginner questions, but I tried to install the nightly version with this !pip install --pre torch torchtext -f https://download.pytorch.org/whl/nightly/cu101/torch_nightly.html , then I import torchtext as I would normally do, and get the same error.
Could you tell me what I'm doing wrong?

@zhangguanheng66
Copy link
Contributor

@zhangguanheng66 Sorry for my beginner questions, but I tried to install the nightly version with this !pip install --pre torch torchtext -f https://download.pytorch.org/whl/nightly/cu101/torch_nightly.html , then I import torchtext as I would normally do, and get the same error.
Could you tell me what I'm doing wrong?

Could you double check that you are using what you installed? For example

python -c "import torchtext; print(torchtext.__file__)"

and make sure it's your nightly release. Otherwise, you have to uninstall and then install.

@maroxtn
Copy link

maroxtn commented Jan 14, 2021

I did the following

$pip uninstall torchtext -y
$pip install --pre torch torchtext -f https://download.pytorch.org/whl/nightly/cu101/torch_nightly.html
$python -c "import torchtext; print(torchtext.__file__)"
/opt/conda/lib/python3.7/site-packages/torchtext/__init__.py

The I imported torchtext, and I still got the same error. Am I doing something wrong ?

@maroxtn
Copy link

maroxtn commented Jan 15, 2021

Can you confirm that I am doing everything right @zhangguanheng66 ?

@zhangguanheng66
Copy link
Contributor

Could you delete all the torchtext packages in /opt/conda/lib/python3.7/site-packages/ folder and reinstall the package again? Make sure you don't have the old package. Can you also check the version?

python -c "import torchtext; print(torchtext.__version__)"

@maroxtn
Copy link

maroxtn commented Jan 15, 2021

@zhangguanheng66 I deleted the files !rm -rf /opt/conda/lib/python3.7/site-packages/torchtext then I uninstalled the package using pip, then reinstalled the nightly version and I got this as the version:

 0.9.0.dev20210115

@zhangguanheng66
Copy link
Contributor

@zhangguanheng66 I deleted the files !rm -rf /opt/conda/lib/python3.7/site-packages/torchtext then I uninstalled the package using pip, then reinstalled the nightly version and I got this as the version:

 0.9.0.dev20210115

Yup. This is the correct version.

@maroxtn
Copy link

maroxtn commented Jan 16, 2021

But it still yields the same error. Do you experience the same issue ?

@zhangguanheng66
Copy link
Contributor

But it still yields the same error. Do you experience the same issue ?

Can you send me a code snippet and copy/paste the error here?

@maroxtn
Copy link

maroxtn commented Jan 16, 2021

@zhangguanheng66
Copy link
Contributor

This notebook reproduces the error : https://colab.research.google.com/drive/1kGqGEkWBFxY7dtU-0xeDzQiKj3zg1y88?usp=sharing

You should switch to the IWSLT dataset in the experimental folder. The one in the root folder is not maintained and we will retire them very soon as legacy code.

@maroxtn
Copy link

maroxtn commented Jan 16, 2021

Thanks for the response, but could instruct me on how to use the datasets in the experimental folder ?

@zhangguanheng66
Copy link
Contributor

Take a look at this example link.

@imflash217
Copy link

imflash217 commented Mar 19, 2021

Take a look at this example link.

I am getting this error while execting as in the example:

What am I doing wrong here?

---------------------------------------------------------------------------
ReadError                                 Traceback (most recent call last)
<ipython-input-8-639784b62f93> in <module>
----> 1 train_dataset, valid_dataset, test_dataset = IWSLT(tokenizer=(src_tokenizer,tgt_tokenizer))

~/anaconda3/envs/aogtr/lib/python3.8/site-packages/torchtext/experimental/datasets/translation.py in IWSLT(train_filenames, valid_filenames, test_filenames, tokenizer, root, vocab, data_select, removed_tokens)
    428     """
    429 
--> 430     return _setup_datasets("IWSLT",
    431                            train_filenames=train_filenames,
    432                            valid_filenames=valid_filenames,

~/anaconda3/envs/aogtr/lib/python3.8/site-packages/torchtext/experimental/datasets/translation.py in _setup_datasets(dataset_name, train_filenames, valid_filenames, test_filenames, data_select, root, vocab, tokenizer, removed_tokens)
     38             "tokenizer must be an instance of tuple with length two"
     39             "or None")
---> 40     train, val, test = DATASETS[dataset_name](train_filenames=train_filenames,
     41                                               valid_filenames=valid_filenames,
     42                                               test_filenames=test_filenames,

~/anaconda3/envs/aogtr/lib/python3.8/site-packages/torchtext/experimental/datasets/raw/translation.py in IWSLT(train_filenames, valid_filenames, test_filenames, root)
    451     URLS["IWSLT"] = URLS["IWSLT"].format(src_language, tgt_language, languages)
    452 
--> 453     return _setup_datasets(
    454         "IWSLT",
    455         train_filenames=train_filenames,

~/anaconda3/envs/aogtr/lib/python3.8/site-packages/torchtext/experimental/datasets/raw/translation.py in _setup_datasets(dataset_name, train_filenames, valid_filenames, test_filenames, root)
    137     elif isinstance(URLS[dataset_name], str):
    138         dataset_tar = download_from_url(URLS[dataset_name], root=root)
--> 139         extracted_files.extend(extract_archive(dataset_tar))
    140     else:
    141         raise ValueError(

~/anaconda3/envs/aogtr/lib/python3.8/site-packages/torchtext/utils.py in extract_archive(from_path, to_path, overwrite)
    189     if from_path.endswith(('.tar.gz', '.tgz')):
    190         logging.info('Opening tar file {}.'.format(from_path))
--> 191         with tarfile.open(from_path, 'r') as tar:
    192             files = []
    193             for file_ in tar:

~/anaconda3/envs/aogtr/lib/python3.8/tarfile.py in open(cls, name, mode, fileobj, bufsize, **kwargs)
   1604                         fileobj.seek(saved_pos)
   1605                     continue
-> 1606             raise ReadError("file could not be opened successfully")
   1607 
   1608         elif ":" in mode:

ReadError: file could not be opened successfully

@parmeet
Copy link
Contributor

parmeet commented Mar 19, 2021

Take a look at this example link.

I am getting this error while execting as in the example:

What am I doing wrong here?

---------------------------------------------------------------------------
ReadError                                 Traceback (most recent call last)
<ipython-input-8-639784b62f93> in <module>
----> 1 train_dataset, valid_dataset, test_dataset = IWSLT(tokenizer=(src_tokenizer,tgt_tokenizer))

~/anaconda3/envs/aogtr/lib/python3.8/site-packages/torchtext/experimental/datasets/translation.py in IWSLT(train_filenames, valid_filenames, test_filenames, tokenizer, root, vocab, data_select, removed_tokens)
    428     """
    429 
--> 430     return _setup_datasets("IWSLT",
    431                            train_filenames=train_filenames,
    432                            valid_filenames=valid_filenames,

~/anaconda3/envs/aogtr/lib/python3.8/site-packages/torchtext/experimental/datasets/translation.py in _setup_datasets(dataset_name, train_filenames, valid_filenames, test_filenames, data_select, root, vocab, tokenizer, removed_tokens)
     38             "tokenizer must be an instance of tuple with length two"
     39             "or None")
---> 40     train, val, test = DATASETS[dataset_name](train_filenames=train_filenames,
     41                                               valid_filenames=valid_filenames,
     42                                               test_filenames=test_filenames,

~/anaconda3/envs/aogtr/lib/python3.8/site-packages/torchtext/experimental/datasets/raw/translation.py in IWSLT(train_filenames, valid_filenames, test_filenames, root)
    451     URLS["IWSLT"] = URLS["IWSLT"].format(src_language, tgt_language, languages)
    452 
--> 453     return _setup_datasets(
    454         "IWSLT",
    455         train_filenames=train_filenames,

~/anaconda3/envs/aogtr/lib/python3.8/site-packages/torchtext/experimental/datasets/raw/translation.py in _setup_datasets(dataset_name, train_filenames, valid_filenames, test_filenames, root)
    137     elif isinstance(URLS[dataset_name], str):
    138         dataset_tar = download_from_url(URLS[dataset_name], root=root)
--> 139         extracted_files.extend(extract_archive(dataset_tar))
    140     else:
    141         raise ValueError(

~/anaconda3/envs/aogtr/lib/python3.8/site-packages/torchtext/utils.py in extract_archive(from_path, to_path, overwrite)
    189     if from_path.endswith(('.tar.gz', '.tgz')):
    190         logging.info('Opening tar file {}.'.format(from_path))
--> 191         with tarfile.open(from_path, 'r') as tar:
    192             files = []
    193             for file_ in tar:

~/anaconda3/envs/aogtr/lib/python3.8/tarfile.py in open(cls, name, mode, fileobj, bufsize, **kwargs)
   1604                         fileobj.seek(saved_pos)
   1605                     continue
-> 1606             raise ReadError("file could not be opened successfully")
   1607 
   1608         elif ":" in mode:

ReadError: file could not be opened successfully

Thank you for raising this issue. Unfortunately, the example code is broken (I will work on PR to fix it). We have split the IWSLT into two separate datasets 'IWSLT2016' and 'IWSLT2017'. Could you try with following code snippet:

from torchtext.experimental.datasets import IWSLT2016
from torchtext.data.utils import get_tokenizer
src_tokenizer = get_tokenizer("spacy", language='de_core_news_sm')
tgt_tokenizer = get_tokenizer("basic_english")
train_dataset, valid_dataset, test_dataset = IWSLT2016(tokenizer=(src_tokenizer,
                                                              tgt_tokenizer))
src_vocab, tgt_vocab = train_dataset.get_vocab()
src_data, tgt_data = train_dataset[10]

@imflash217
Copy link

Thanks @parmeet , the above code with IWSLT2017 etc. works fine.
Now, I am stuck in another issue that torchtext 0.9 requires PyTorch 1.8.0 which breaks my other implementation.
How could I install torchtext==0.9 without upgrading my PyTorch version?
Thanks

@parmeet
Copy link
Contributor

parmeet commented Mar 25, 2021

Thanks @parmeet , the above code with IWSLT2017 etc. works fine.
Now, I am stuck in another issue that torchtext 0.9 requires PyTorch 1.8.0 which breaks my other implementation.
How could I install torchtext==0.9 without upgrading my PyTorch version?
Thanks

I am afraid that that will most likely not work. The library is build again a specific version of PyTorch and may not be backward compatible with previous versions of PyTorch.

@imflash217
Copy link

imflash217 commented Mar 26, 2021

Thanks @parmeet , the above code with IWSLT2017 etc. works fine.
Now, I am stuck in another issue that torchtext 0.9 requires PyTorch 1.8.0 which breaks my other implementation.
How could I install torchtext==0.9 without upgrading my PyTorch version?
Thanks

I am afraid that that will most likely not work. The library is build again a specific version of PyTorch and may not be backward compatible with previous versions of PyTorch.

This is a bummer. 😰
I am stuck in a loop here. I can't upgrade my CUDA from10.1 to 10.2+ due to a lot of other dependencies. So, my Pytorch will not upgrade to 1.8 and hence my torchtext will be <0.9 where the IWSLT part is not working.

Any suggestions on this.

@parmeet
Copy link
Contributor

parmeet commented Mar 26, 2021

Thanks @parmeet , the above code with IWSLT2017 etc. works fine.
Now, I am stuck in another issue that torchtext 0.9 requires PyTorch 1.8.0 which breaks my other implementation.
How could I install torchtext==0.9 without upgrading my PyTorch version?
Thanks

I am afraid that that will most likely not work. The library is build again a specific version of PyTorch and may not be backward compatible with previous versions of PyTorch.

This is a bummer. 😰
I am stuck in a loop here. I can't upgrade my CUDA from10.1 to 10.2+ due to a lot of other dependencies. So, my Pytorch will not upgrade to 1.8 and hence my torchtext will be <0.9 where the IWSLT part is not working.

Any suggestions on this.

The raw datasets in torchtext 0.9 are basically Iterables. One workaround would be to materialize them in list and save them (you will basically create a separate conda env. where you will install CPU only pytorch version along with torchtext 0.9 and then materialize IWSLT2016/17 raw dataset into List and finally save them). Then you can simply load them in your preferred environment. Hope this helps!

@imflash217
Copy link

Thanks @parmeet ,
So, I tried to use the latest build of torch and torchtext in GoogleColab.
But, I am now getting this error.
Any suggestions on this. Thanks.

image

@parmeet
Copy link
Contributor

parmeet commented Mar 28, 2021

Thanks @parmeet ,
So, I tried to use the latest build of torch and torchtext in GoogleColab.
But, I am now getting this error.
Any suggestions on this. Thanks.

image

The code seems to be using legacy primitives which we have retired in the latest release (they are put in the legacy folder for time being, please have a look at the Release Notes https://github.com/pytorch/text/releases/tag/v0.9.0-rc5).

Can you try to work with the example I shared above for IWLST2016, together with the migration Tutorial here https://github.com/pytorch/text/blob/master/examples/legacy_tutorial/migration_tutorial.ipynb. I think Step 2 of the migration tutorial provides a migration guide for building the vocab.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants