- 
                Notifications
    You must be signed in to change notification settings 
- Fork 736
[BC-Breaking] Remove download and change root directory of CommonVoice #1082
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| raise RuntimeError( | ||
| f"The hash of {filepath} does not match. Delete the file manually and retry." | ||
| ) | ||
| extract_archive(archive) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep archive extraction, e.g. here
| if not os.path.isdir(self._path): | ||
| if not os.path.isfile(archive): | ||
| checksum = _CHECKSUMS.get(url, None) | ||
| download_url(url, root, hash_value=checksum) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While we can't download the dataset due to legal reason, we can still extract it.
If the user has not extracted the dataset yet, we can do so.
If we're about to extract and the user has provided a checksum (from the common voice website) we can also verify it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It turned out that dataset archives have different directory structure for different version.
- Version 4 - cv-corpus-4-2019-12-10
$ tar -tvf tmp/en.tar.gz
2019-12-09 17:47 .DS_Store
2019-12-10 08:09 clips/
2019-12-10 05:41 dev.tsv
2019-12-10 05:41 invalidated.tsv
2019-12-10 05:41 other.tsv
2019-12-10 05:41 test.tsv
2019-12-10 05:41 train.tsv
2019-12-10 05:41 validated.tsv
...
- Version 5.1 - cv-corpus-5.1-2020-06-22
$ tar -tvf en.tar.gz
2020-07-11 23:59 cv-corpus-5.1-2020-06-22/en/
2020-07-11 20:40 cv-corpus-5.1-2020-06-22/en/clips/
2020-07-11 23:59 cv-corpus-5.1-2020-06-22/en/dev.tsv
2020-07-11 23:59 cv-corpus-5.1-2020-06-22/en/invalidated.tsv
2020-07-11 23:59 cv-corpus-5.1-2020-06-22/en/other.tsv
2020-07-11 20:41 cv-corpus-5.1-2020-06-22/en/reported.tsv
2020-07-11 23:59 cv-corpus-5.1-2020-06-22/en/test.tsv
2020-07-11 23:59 cv-corpus-5.1-2020-06-22/en/train.tsv
...
Also, because of this (and the fact that folder_in_archive argument which is not used properly), the previous implementation (download + extract) does not work. #1088 (I suspect it ever worked.)
The thing is that the current implementation expects the files to be presented at
<ROOT_DIR>/CommonVoice/<VERSION>/<LANG_CODE>.
But the previous code behavior extracts the archive at either
<ROOT_DIR>/ or
<ROOT_DIR>/<VERSION>/<LANG_CODE>.
I was thinking to resort to the original behavior when handling archive, but since there was not working behavior in the previous implementation when it comes to download + extract, I again propose not to try to extract the archive and ask users to extract the archive and provide the directory where the data are found. (and deprecate folder_in_archive) argument in addition to download and url arguments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
offline discussion: change the root to expect the exact directory users extract the archive.
acbf081    to
    26e9de4      
    Compare
  
    26e9de4    to
    29aef9f      
    Compare
  
    There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. In fact, we could consider this simplified interface for other datasets too, cc #910.
| for name, val in deprecated: | ||
| if val is not None: | ||
| warnings.warn( | ||
| f"`{name}` argument is no longer used and deprecated. " | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a follow-up, can we improve the message here? What should someone do who is currently already using this dataset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought no longer used + Please remove it from the function call would suffice. What do you suggest?
|  | ||
| base_url = "https://voice-prod-bundler-ee1969a6ce8178826482b88e843c335139bd3fb4.s3.amazonaws.com" | ||
| url = os.path.join(base_url, version, language + ext_archive) | ||
| "Please download the dataset and extract it manually.") | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a follow-up, we can provide a bit more guidance (a link or such) on how to go about this.
Co-authored-by: holly1238 <[email protected]>
After #1080, this PR
rootargumentAPI-wise, it is not BC-braking, but users have to change the root directory they give to
rootargument.rootargument<root_dir><root_dir>/CommonVoice/<VERSION>/<LANGUAGE><root_dir><root_dir>If you were using,
root="/data", url="english", version="cv-corpus-4-2019-12-10", then you need to change therootto/data/CommonVoice/cv-corpus-4-2019-12-10/en, (and you can should theurlandversionargument)See #1077 for the agenda.
Closes #1083 #1077