Improve tests for Dataset

`torchaudio` had minimalistic test for dataset implementations. See [here](https://github.com/pytorch/audio/blob/master/test/datasets/datasets_test.py)

Recently we have improved our test utilities and now we can generate synthetic data which emulates a subset of dataset. See examples [YesNo](https://github.com/pytorch/audio/blob/master/test/datasets/yesno_test.py) and [GTZAN](https://github.com/pytorch/audio/blob/master/test/datasets/gtzan_test.py) 

We would like to do the same for the remaining datasets

 - [x] VCTK
 - [x] LibriSpeech
 - [x] LJSpeech
 - [x] SpeechCommands
 - [x] CMUArctic
 - [x] CommonVoice

General Direction
1. Check the dataset of interest and pick subset of files, (check their naming conventions, sampling rate and number of channels)
2. Follow the approach of existing test module, create a new test module `test/datasets/XXX_test.py` and define your test class.
3. Generate pseudo dataset in `setUpClass` method. Create a list of expected data.
4. Traverse the directory with Dataset implementation
5. Check that files are traversed in the expected order, then loaded data match.
6. Check that Dataset traversed the expected number of files.
7. If the dataset has multiple operational modes, like [subset in GTZAN](https://github.com/pytorch/audio/blob/master/test/datasets/gtzan_test.py#L57-L88) also add these as test methods.
8. Once the new test is added, remove the original test and <Details><summary>the associated assets.</summary>

    ```
    test/assets/ARCTIC/cmu_us_aew_arctic/etc/txt.done.data
    test/assets/ARCTIC/cmu_us_aew_arctic/wav/arctic_a0024.wav
    test/assets/CommonVoice/cv-corpus-4-2019-12-10/tt/clips/common_voice_tt_00000000.wav
    test/assets/CommonVoice/cv-corpus-4-2019-12-10/tt/train.tsv
    test/assets/LJSpeech-1.1/metadata.csv
    test/assets/LJSpeech-1.1/wavs/LJ001-0001.wav
    test/assets/LibriSpeech/dev-clean/1272/128104/1272-128104-0000.flac
    test/assets/LibriSpeech/dev-clean/1272/128104/1272-128104.trans.txt
    test/assets/SpeechCommands/speech_commands_v0.02/go/0a9f9af7_nohash_0.wav
    test/assets/VCTK-Corpus/txt/p224/p224_002.txt
    test/assets/VCTK-Corpus/wav48/p224/p224_002.wav
    ```
    </Details>

9. Once the PR is ready add @mthrok as reviewer.

**Note**
- It is highly recommended to use Anaconda
- Please use nightly build of PyTorch. https://pytorch.org/
- You can run test with `pytest test/datasets/XXX_test.py`.
- PR example #819 
- For the simplicity, please use `wav` format when saving synthetic data (`save_wav`) even if the reference dataset uses other format. (decoding formats like `mp3` adds complexity to test logic, which we are trying to avoid)
- When saving wave data with `save_wav`, the `dtype` of the Tensor makes difference. If the reference dataset uses WAV format, use the same bit depth (like `int16`). If the reference dataset uses compressed format, like `mp3` or `flac`, use `float32` wav.
- Data loaded with Dataset implementation typical has normalized (values in `[-1.0, 1.0]`), `float32` type. (which is why `normalize_wav` is used to generate reference data in the examples above.)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve tests for Dataset #821

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Improve tests for Dataset #821

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions