-
Notifications
You must be signed in to change notification settings - Fork 739
Description
Background
Fast audio loading is critical to audio application, and this is more so for the case of music data because of the following properties which are different from speech applications.
- Higher sampling rate such as 44.1k and 48k. (8k and 16k are typical choices in speech)
- Long duration, like 5 mins. (utterance in speech is only a few seconds)
- Training with random seeks is a de-facto standard training method for taks like source separation, VAD, diarization and speaker/language identification tasks.
- Audio data are more often in formats other than WAV. (mp3, aac, flac, ogg, ect...)
Proposal
Add a new I/O scheme to torchaudio, which utilizes libraries that provide faster decoding, wide coverage of codecs and portable across different OSs (Linux / macOS / Windows).
Currently torchaudio binds libsox. Libsox is not supported on Windows. There are a variety of decoding libraries that we can take advantage of. These include
- minimp3 (CC0-1.0 License)
Fast mp3 decoding library. - minimp4 (CC0-1.0 License)
Similar to minimp3, a MP4 decoding library by the same author. - libsndfile (LGPL-2.1 License)
Fast for wav format.*
Also handles flac, ogg/vorbis - SpeeXDSP (License)
Resampling - (Optionally) ffmpeg (libavcodec) (LGPL v2.1+, MIT/X11/BSD etc)
Covers a much wider range of codecs, with higher decode/encode quality, but not as fast.
Can handle AAC format (in addition to what is already listed above) and a lot more.
Unlike the existing torchaudio backends, which implement the same generic interfaces, the new I/O will provide one unified Python interface to all the supported platforms (Linux / macOS / Windows), and delegate the library selection to underlying C++ implementation.
Benchmark for some of these libraries are available at https://github.com/faroit/python_audio_loading_benchmark . (Thanks @faroit !)
Non-Goals
In-memory decoding
In-memory-decoding support is nice to have, but currently, we do not know if it is possible to pass memory objects from Python to C++ via TorchScript. For the sake of simplicity, we exclude this feature from the scope of this proposal. For Python-only solution, see #800 for the gist.
Streaming decoding
Streaming decoding support will be critical for supporting real-time applications. However it is difficult to design real-time decoding as a stand-alone module, because the design of the downstream process, such as preprocessing, feeding to NN, and using the result, are all relevant to the upstream I/O mechanism, therefore, the streaming decoding is excluded from this proposal.
Effects (filterings)
ffmpeg supports filterings like libsox does. We can make it available too but this is outside the scope of fast audio loading.
Interface
Python frontend to the C++ interface. No (significant) logic should happen here.
# in torchaudio.io module
# we can call this from `torchaudio.load` too.
def load_audio_from_file(
path: str,
*,
offset: Optional[float] = None,
duration: Optional[float] = None,
sample_rate: Optional[float] = None,
normalize: bool = True,
channels_first: bool = True,
offset_unit: str = "second",
format: Optional[str] = None,
) -> namedtuple(waveform, sample_rate):
"""Load audio from file
Args:
path (str or pathlib.Path):
Path to the audio file.
offset (float, optional):
Offset of reading, in the unit provided as `offset_unit`.
defaults to the beginning of the audio file.
duration (float, optional):
Duration of reading, in the unit provided as `offset_unit`.
defaults to the rest of the audio file.
sample_rate (float, optional):
When provided, the audio is resampled.
normalize (bool, optional):
When `True`, this function always return `float32`, and
sample values are normalized to `[-1.0, 1.0]`.
If input file is integer WAV, giving `False` will change the
resulting Tensor type to integer type.
This argument has no effect for formats other than
integer WAV type.
channels_first (bool, optional):
When `True`, the returned Tensor has dimension
`[channel, time]` otherwise `[time, channel]`.
offset_unit (str, optional):
The unit of `offset` and `duration`.
`"second"` or `"frame"` (default to "second")
format (str, optional):
Override the format detection.
Returns:
namedtuple `(waveform, sample_rate)`:
`waveform` is a Tensor type, and `sample_rate` is float.
"""Example Usage (Python)
import torchaudio
# Load the entire file
waveform, sample_rate = torchaudio.io.load_audio_from_file(
"foobar.wav",
)
# Load the segment at 1.0 - 3.0 seconds
waveform, sample_rate = torchaudio.io.load_audio_from_file(
"foobar.wav",
offset = 1.,
duration = 2.,
)
# Load the entire file, resample it to 8000
waveform, sample_rate = torchaudio.io.load_audio_form_file(
"foobar.wav",
sample_rate = 8000,
)
# Load the segment at 1.0 - 3.0 seconds, resample it to 8000
waveform, sample_rate = torchaudio.io.load_audio_form_file(
"foobar.wav",
offset = 1.,
duration = 2.,
sample_rate = 8000,
)FAQ
Will the proposed API replace the current torchaudio.load ?
No, this proposal does not remove torchaudio.load or ask users to migrate to the new API. Instead, torchaudio.load will make use of the proposed API. (the detail of how it does is TBD)
When we think of supporting other types of I/O, such as memory-object, file-like object, or streaming object, we will design APIs separately and plug-in to torchaudio.load.
This way, we decouple the concerns and requirements, yet are able to extend the functionality.