Low WER training pipeline in torchaudio with wav2letter

torchaudio is targeting speech recognition as full audio application ([internal](https://docs.google.com/document/d/1mygzn6vf6Jl4e_0fzJSciRGRmb8dj8NwibwBG-tyUj8/edit?ts=5ef3580b#bookmark=id.7evu2kt47eiz)). Along this line, we implemented [wav2letter pipeline](https://github.com/pytorch/audio/pull/632) to obtain a low character error rate (CER). We want to expand on this and showcase a new pipeline which also has a low word error rate (WER). To achieve this, we consider the following additions to torchaudio from higher to lower priority.

**Token Decoder:** Add a lexicon-constrained beam search algorithm, based on  fairseq ([search class](https://github.com/pytorch/fairseq/blob/master/fairseq/search.py), [sequence generator](https://github.com/pytorch/fairseq/blob/master/fairseq/sequence_generator.py#L272)) since it is torchscriptable.

* Links: [fairseq](https://github.com/pytorch/fairseq/blob/master/fairseq/search.py), [wav2letter](https://github.com/facebookresearch/wav2letter/tree/v0.2/src/libraries/decoder), [ParlAI](https://github.com/facebookresearch/ParlAI/blob/3d3386caa9a1334a5846ae49b6878e3c28f4990d/parlai/core/torch_generator_agent.py#L940), [user repository](https://github.com/budzianowski/PyTorch-Beam-Search-Decoding), [caffe2](https://caffe2.ai/doxygen-python/html/beam__search_8py_source.html), [pyspeech](https://www.internalfb.com/intern/diffusion/FBS/browse/master/fbcode/deeplearning/projects/pyspeech/pyspeech/fb/external_decoder).
* Kaldi related: Kaldi Viterbi BeamSearch [SimpleDecoder](https://kaldi-asr.org/doc/lattices.html), [internal](https://fb.quip.com/FsPWAFfvufsZ), and [OpenFST](http://www.openfst.org/twiki/bin/view/FST/WebHome).
* Other domains: paper for [vision](http://vision.stanford.edu/cs598_spring07/papers/Lecun98.pdf)
* Related algorithm: FAISS’s topk [internal](https://fb.workplace.com/groups/434225650103564/permalink/1401892803336839/), [paper](https://arxiv.org/pdf/1702.08734.pdf) and [github](https://github.com/facebookresearch/faiss)

**Acoustic Model:** Add a transformer-based acoustic model, e.g. [speech-transformer](http://150.162.46.34:8080/icassp2018/ICASSP18_USB/pdfs/0005884.pdf), [comparison](https://arxiv.org/pdf/1909.06317v2.pdf).

* wav2vec 2.0 ([paper](https://arxiv.org/abs/2006.11477), fairseq’s [github](https://github.com/pytorch/fairseq/blob/master/examples/wav2vec/README.md)) combines transformer and lexicon-constrained beam search
* [multi-speaker](https://arxiv.org/pdf/2002.03921.pdf), streaming ([paper](https://arxiv.org/pdf/2005.08042.pdf), [post](https://fb.workplace.com/groups/831302610278251/permalink/4284333054975172/))

**Language Model:** Add [KenLM](https://github.com/kpu/kenlm) to use a 4-gram language model based on [LibriSpeech Language Model](http://www.openslr.org/11/), as done in [paper](https://arxiv.org/pdf/2005.08042.pdf).

* [wav2letter](https://github.com/facebookresearch/wav2letter/wiki/Beam-Search-Decoder) interfaces with both [KenLM](https://github.com/kpu/kenlm) and [ConvLM](https://arxiv.org/pdf/1812.06864.pdf)
* This could land in torchtext.

**Training Loss:** Add the RNN Transducer loss to replace the CTC loss in the pipeline.

* RNNT ([paper](https://arxiv.org/pdf/1211.3711.pdf), [github](https://github.com/HawkAaron/warp-transducer)) 
* see also [internal](https://fb.quip.com/Emo2AcGAEc8U)

**Transformations:** SpecAugment is already available in wav2letter [pipeline](https://github.com/pytorch/audio/pull/632).

See also [internal](https://fb.quip.com/Emo2AcGAEc8U)

cc @astaff @dongreenberg @cpuhrsch  

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Low WER training pipeline in torchaudio with wav2letter #913

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Low WER training pipeline in torchaudio with wav2letter #913

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions