-
Notifications
You must be signed in to change notification settings - Fork 741
Add SpeechCommands train/valid/test split #966
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
140a156
b3b723c
51921bd
fc1837f
7c28a2e
cb22817
5b38342
8be27ad
c3b1911
355ee4c
3e7a071
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,5 +1,5 @@ | ||
| import os | ||
| from typing import Tuple | ||
| from typing import Tuple, Optional | ||
|
|
||
| import torchaudio | ||
| from torch.utils.data import Dataset | ||
|
|
@@ -22,6 +22,15 @@ | |
| } | ||
|
|
||
|
|
||
| def _load_list(root, *filenames): | ||
| output = [] | ||
| for filename in filenames: | ||
| filepath = os.path.join(root, filename) | ||
| with open(filepath) as fileobj: | ||
| output += [os.path.normpath(os.path.join(root, line.strip())) for line in fileobj] | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you add a comment on why |
||
| return output | ||
|
|
||
|
|
||
| def load_speechcommands_item(filepath: str, path: str) -> Tuple[Tensor, int, str, str, int]: | ||
| relpath = os.path.relpath(filepath, path) | ||
| label, filename = os.path.split(relpath) | ||
|
|
@@ -48,13 +57,25 @@ class SPEECHCOMMANDS(Dataset): | |
| The top-level directory of the dataset. (default: ``"SpeechCommands"``) | ||
| download (bool, optional): | ||
| Whether to download the dataset if it is not found at root path. (default: ``False``). | ||
| subset (Optional[str]): | ||
| Select a subset of the dataset [None, "training", "validation", "testing"]. None means | ||
| the whole dataset. "validation" and "testing" are defined in "validation_list.txt" and | ||
| "testing_list.txt", respectively, and "training" is the rest. (default: ``None``) | ||
| """ | ||
|
|
||
| def __init__(self, | ||
| root: str, | ||
| url: str = URL, | ||
| folder_in_archive: str = FOLDER_IN_ARCHIVE, | ||
| download: bool = False) -> None: | ||
| download: bool = False, | ||
| subset: Optional[str] = None, | ||
| ) -> None: | ||
|
|
||
| assert subset is None or subset in ["training", "validation", "testing"], ( | ||
| "When `subset` not None, it must take a value from " | ||
| + "{'training', 'validation', 'testing'}." | ||
| ) | ||
|
|
||
| if url in [ | ||
| "speech_commands_v0.01", | ||
| "speech_commands_v0.02", | ||
|
|
@@ -79,9 +100,22 @@ def __init__(self, | |
| download_url(url, root, hash_value=checksum, hash_type="md5") | ||
| extract_archive(archive, self._path) | ||
|
|
||
| walker = walk_files(self._path, suffix=".wav", prefix=True) | ||
| walker = filter(lambda w: HASH_DIVIDER in w and EXCEPT_FOLDER not in w, walker) | ||
| self._walker = list(walker) | ||
| if subset == "validation": | ||
| self._walker = _load_list(self._path, "validation_list.txt") | ||
| elif subset == "testing": | ||
| self._walker = _load_list(self._path, "testing_list.txt") | ||
| elif subset == "training": | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I see there is a behavior inconsistency between If certain valid files are removed from the dataset, then this dataset implementation will keep working for What's your take on that?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "testing", "validation", "training" are defined by the dataset by the training/validation files. It is technically undefined by the dataset outside of that. Given how the three are defined by the dataset in those files, as a user, I'd expect changes to those file to propagate. I'd say it'd be fair to add a quick note in the docstring.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Added.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I agree with that, however IIRC, one of the goal of torchaudio's dataset implementation is to make it easy to modify the dataset. Something along the line of point 3&4 of #852 (comment) . With this rule, we have to take the extra step to think through what kind of modification is valid/invalid and what is the expected behavior, and put it in the implementation. I am trying to raise the awareness of it.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. And of course, if we are just getting rid of the easy modification of the dataset, we do no need think about the any modification to dataset, and we leave it as UB.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Actually, SpeechCommands does explain how those two files are generated, and how to generalize their approach, see README. |
||
| excludes = set(_load_list(self._path, "validation_list.txt", "testing_list.txt")) | ||
| walker = walk_files(self._path, suffix=".wav", prefix=True) | ||
| self._walker = [ | ||
| w for w in walker | ||
| if HASH_DIVIDER in w | ||
| and EXCEPT_FOLDER not in w | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'll avoid changing the logic here to avoid breaking any codes, say if a user moved a file in background_noise folder/etc. This can be looked at in a later PR. |
||
| and os.path.normpath(w) not in excludes | ||
| ] | ||
| else: | ||
| walker = walk_files(self._path, suffix=".wav", prefix=True) | ||
| self._walker = [w for w in walker if HASH_DIVIDER in w and EXCEPT_FOLDER not in w] | ||
|
|
||
| def __getitem__(self, n: int) -> Tuple[Tensor, int, str, str, int]: | ||
| """Load the n-th sample from the dataset. | ||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.