Skip to content

Low WER training pipeline in torchaudio with wav2letter #913

@vincentqb

Description

@vincentqb

torchaudio is targeting speech recognition as full audio application (internal). Along this line, we implemented wav2letter pipeline to obtain a low character error rate (CER). We want to expand on this and showcase a new pipeline which also has a low word error rate (WER). To achieve this, we consider the following additions to torchaudio from higher to lower priority.

Token Decoder: Add a lexicon-constrained beam search algorithm, based on fairseq (search class, sequence generator) since it is torchscriptable.

Acoustic Model: Add a transformer-based acoustic model, e.g. speech-transformer, comparison.

Language Model: Add KenLM to use a 4-gram language model based on LibriSpeech Language Model, as done in paper.

Training Loss: Add the RNN Transducer loss to replace the CTC loss in the pipeline.

Transformations: SpecAugment is already available in wav2letter pipeline.

See also internal

cc @astaff @dongreenberg @cpuhrsch

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions