-
Notifications
You must be signed in to change notification settings - Fork 814
tweaks code to support new url for IWSLT dataset #1115
Conversation
|
Hi @ghlai9665! Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. In order for us to review and merge your code, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. If you have received this in error or have any questions, please contact us at [email protected]. Thanks! |
|
You can use "flake8 translation.py" for lint check |
zhangguanheng66
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to update the MD5 hash because of a new file?
| elif isinstance(URLS[dataset_name], str): | ||
| dataset_tar = download_from_url(URLS[dataset_name], root=root, hash_value=MD5[dataset_name], hash_type='md5') | ||
| extracted_files.extend(extract_archive(dataset_tar)) | ||
| dataset_tar = download_from_url(URLS[dataset_name]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to check the hash here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Running
dataset_tar = download_from_url(URLS[dataset_name], root=root, hash_value=MD5[dataset_name], hash_type='md5')
gives me this error:
RuntimeError: The hash of .data/2016-01.tgz does not match. Delete the file manually and retry.
That's why I changed it to dataset_tar = download_from_url(URLS[dataset_name])
Why is checking the hash important and what problems could the current change introduce?
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh is this purpose of MD5 hash to stop the download if we find that the file we're downloading is not what we expected by checking the hash?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just made the changes!
|
For the failed CI test, I think you need to install the de package first for spacy. see example here. |
Is switching back to the default tokenizer OK? It seems to work ok for the test. |
Sure. It's fine for me. |
I ran Running
Working on the errors given by |
Codecov Report
@@ Coverage Diff @@
## master #1115 +/- ##
==========================================
+ Coverage 77.54% 78.89% +1.35%
==========================================
Files 45 45
Lines 3086 3090 +4
==========================================
+ Hits 2393 2438 +45
+ Misses 693 652 -41
Continue to review full report at Codecov.
|
yeap. I was giving an example. LGTM now. |
zhangguanheng66
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution. LGTM.
|
I still get the same error when I try to download the dataset ? |
|
The fix went to the nightly release branch. From the CI test added in the PR, it looks good now. |
|
@zhangguanheng66 Sorry for my beginner questions, but I tried to install the nightly version with this |
Could you double check that you are using what you installed? For example and make sure it's your nightly release. Otherwise, you have to uninstall and then install. |
|
I did the following The I imported torchtext, and I still got the same error. Am I doing something wrong ? |
|
Can you confirm that I am doing everything right @zhangguanheng66 ? |
|
Could you delete all the torchtext packages in |
|
@zhangguanheng66 I deleted the files |
Yup. This is the correct version. |
|
But it still yields the same error. Do you experience the same issue ? |
Can you send me a code snippet and copy/paste the error here? |
|
This notebook reproduces the error : https://colab.research.google.com/drive/1kGqGEkWBFxY7dtU-0xeDzQiKj3zg1y88?usp=sharing |
You should switch to the |
|
Thanks for the response, but could instruct me on how to use the datasets in the experimental folder ? |
|
Take a look at this example link. |
I am getting this error while execting as in the example: What am I doing wrong here? |
Thank you for raising this issue. Unfortunately, the example code is broken (I will work on PR to fix it). We have split the IWSLT into two separate datasets 'IWSLT2016' and 'IWSLT2017'. Could you try with following code snippet: |
|
Thanks @parmeet , the above code with |
I am afraid that that will most likely not work. The library is build again a specific version of PyTorch and may not be backward compatible with previous versions of PyTorch. |
This is a bummer. 😰 Any suggestions on this. |
The raw datasets in torchtext 0.9 are basically Iterables. One workaround would be to materialize them in list and save them (you will basically create a separate conda env. where you will install CPU only pytorch version along with torchtext 0.9 and then materialize IWSLT2016/17 raw dataset into List and finally save them). Then you can simply load them in your preferred environment. Hope this helps! |
|
Thanks @parmeet , |
The code seems to be using legacy primitives which we have retired in the latest release (they are put in the legacy folder for time being, please have a look at the Release Notes https://github.com/pytorch/text/releases/tag/v0.9.0-rc5). Can you try to work with the example I shared above for IWLST2016, together with the migration Tutorial here https://github.com/pytorch/text/blob/master/examples/legacy_tutorial/migration_tutorial.ipynb. I think Step 2 of the migration tutorial provides a migration guide for building the vocab. |

No description provided.