Skip to content

Roadmap ahead for torchaudio #1196

@vincentqb

Description

@vincentqb

There are many exciting work elements that are planned for torchaudio.

  • Provide support for large scale training.
    • Support a large-scale training reference task using wav2vec on librivox, and offer a pre-trained version of the model.
    • Support the emergence of audio specific transformer models by exploring abstractions would be beneficial to provide.
  • Extend support for speech recognition.
    • Investigate the addition of beam search, and a 4-gram language model, see here and here, to reduce the word error rate in the existing pipeline.
    • ✅ Support in-memory codec encoding and decoding, see here, to support codec based data augmentation.
    • ✅ Add the Kaldi pitch feature, see here, that is used in the audio community.
    • Implement a prototype of WFST-based ASR model, using GTN or K2, see here.
    • Add RNN transducer loss, see here and follow-up, to train RNN transducer models efficiently.
  • Provide high-performance data loading and media decoding experience.
    • Provide fast audio I/O module, see here.
    • Provide audio streaming abstractions with examples, see here.
  • Improve our codebase
    • ✅ Create libtorchaudio by building the C++ extension outside of Python, see here.

The goal of torchaudio is to accelerate research through novel, production-ready building blocks. As such, we would love to hear feedback on the plan, so make sure to reach out to us, @mthrok and @vincentqb!

cc internal

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions