Skip to content

Improve tests for Dataset #821

@mthrok

Description

@mthrok

torchaudio had minimalistic test for dataset implementations. See here

Recently we have improved our test utilities and now we can generate synthetic data which emulates a subset of dataset. See examples YesNo and GTZAN

We would like to do the same for the remaining datasets

  • VCTK
  • LibriSpeech
  • LJSpeech
  • SpeechCommands
  • CMUArctic
  • CommonVoice

General Direction

  1. Check the dataset of interest and pick subset of files, (check their naming conventions, sampling rate and number of channels)

  2. Follow the approach of existing test module, create a new test module test/datasets/XXX_test.py and define your test class.

  3. Generate pseudo dataset in setUpClass method. Create a list of expected data.

  4. Traverse the directory with Dataset implementation

  5. Check that files are traversed in the expected order, then loaded data match.

  6. Check that Dataset traversed the expected number of files.

  7. If the dataset has multiple operational modes, like subset in GTZAN also add these as test methods.

  8. Once the new test is added, remove the original test and

    the associated assets.

    test/assets/ARCTIC/cmu_us_aew_arctic/etc/txt.done.data
    test/assets/ARCTIC/cmu_us_aew_arctic/wav/arctic_a0024.wav
    test/assets/CommonVoice/cv-corpus-4-2019-12-10/tt/clips/common_voice_tt_00000000.wav
    test/assets/CommonVoice/cv-corpus-4-2019-12-10/tt/train.tsv
    test/assets/LJSpeech-1.1/metadata.csv
    test/assets/LJSpeech-1.1/wavs/LJ001-0001.wav
    test/assets/LibriSpeech/dev-clean/1272/128104/1272-128104-0000.flac
    test/assets/LibriSpeech/dev-clean/1272/128104/1272-128104.trans.txt
    test/assets/SpeechCommands/speech_commands_v0.02/go/0a9f9af7_nohash_0.wav
    test/assets/VCTK-Corpus/txt/p224/p224_002.txt
    test/assets/VCTK-Corpus/wav48/p224/p224_002.wav
    
  9. Once the PR is ready add @mthrok as reviewer.

Note

  • It is highly recommended to use Anaconda
  • Please use nightly build of PyTorch. https://pytorch.org/
  • You can run test with pytest test/datasets/XXX_test.py.
  • PR example Make GTZAN dataset sorted and use on-the-fly data in GTZAN test #819
  • For the simplicity, please use wav format when saving synthetic data (save_wav) even if the reference dataset uses other format. (decoding formats like mp3 adds complexity to test logic, which we are trying to avoid)
  • When saving wave data with save_wav, the dtype of the Tensor makes difference. If the reference dataset uses WAV format, use the same bit depth (like int16). If the reference dataset uses compressed format, like mp3 or flac, use float32 wav.
  • Data loaded with Dataset implementation typical has normalized (values in [-1.0, 1.0]), float32 type. (which is why normalize_wav is used to generate reference data in the examples above.)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions