-
Couldn't load subscription status.
- Fork 734
Description
torchaudio is targeting speech recognition as full audio application (internal). Along this line, we implemented wav2letter pipeline to obtain a low character error rate (CER). We want to expand on this and showcase a new pipeline which also has a low word error rate (WER). To achieve this, we consider the following additions to torchaudio from higher to lower priority.
Token Decoder: Add a lexicon-constrained beam search algorithm, based on fairseq (search class, sequence generator) since it is torchscriptable.
- Links: fairseq, wav2letter, ParlAI, user repository, caffe2, pyspeech.
- Kaldi related: Kaldi Viterbi BeamSearch SimpleDecoder, internal, and OpenFST.
- Other domains: paper for vision
- Related algorithm: FAISS’s topk internal, paper and github
Acoustic Model: Add a transformer-based acoustic model, e.g. speech-transformer, comparison.
- wav2vec 2.0 (paper, fairseq’s github) combines transformer and lexicon-constrained beam search
- multi-speaker, streaming (paper, post)
Language Model: Add KenLM to use a 4-gram language model based on LibriSpeech Language Model, as done in paper.
- wav2letter interfaces with both KenLM and ConvLM
- This could land in torchtext.
Training Loss: Add the RNN Transducer loss to replace the CTC loss in the pipeline.
Transformations: SpecAugment is already available in wav2letter pipeline.
See also internal