-
Notifications
You must be signed in to change notification settings - Fork 739
Description
Hi,
I'm trying to do some experiments with the kaldi-compliant MFCCs, but I run into some possible issues:
1- When I run the following code
file='/home/mirco/datasets/TIMIT/test/dr5/fnlp0/si1308.wav'
[signal,fs]=sf.read(file)
signal=torch.from_numpy(signal).unsqueeze(0).float()
fea=mfcc(signal)
print(fea)
The mfccs are different every time I run the script:
run 1
tensor([[ 29.2496, -32.6150, -7.1791, ..., -6.2034, -5.8100, 3.5894],
[ 28.1680, -35.9921, -8.5621, ..., -13.5980, -4.2804, -8.8075],
[ 29.2831, -31.8580, -8.8565, ..., -5.8166, -4.2538, 6.4913],
...,
[ 27.5078, -36.1139, -12.1319, ..., -11.6493, 0.2557, -4.9566],
[ 28.9667, -33.5803, -6.6644, ..., -6.1208, 2.7111, 2.7867],
[ 28.6988, -33.6590, -12.0312, ..., -3.0909, -0.0643, -4.1769]])
run 2
tensor([[ 27.8255, -33.2356, -8.8006, ..., -13.2640, 1.0311, 4.8004],
[ 29.5605, -34.0147, -10.3465, ..., -4.0096, -1.5156, -3.2499],
[ 29.3978, -31.3415, -6.4141, ..., 10.6100, 2.3651, 6.1324],
...,
[ 29.4321, -33.0013, -11.9812, ..., -2.9076, 6.3498, 1.8854],
[ 28.2726, -34.0620, -9.5291, ..., -5.4033, 6.0385, -0.1867],
[ 29.5408, -33.7757, -9.1063, ..., 5.8796, -6.1365, -3.2730]])
This is due to the dithering, that is a type of noise. The problem can be easily solved by setting a manual_seed before executing the code. To avoid issues, the users should be aware of that. Maybe it could be great to provide an example in the documentation.
2- Even if I remove the dithering in both kaldi and torchaudio mfccs, the two vectors are very very different (the options in the two cases are exactly the same):t
torch.audio
> tensor([[-64.7435, -23.0893, 1.5796, ..., -4.9001, -1.5039, -2.7683],
> [-61.5527, -17.7455, -5.9670, ..., 4.4663, 2.5523, -0.9595],
> [-58.9998, -21.4523, -10.7197, ..., 10.2993, 9.5475, -0.3667],
> ...,
> [-65.0258, -23.9535, 3.2329, ..., 4.1740, 9.9711, -1.2087],
> [-65.4491, -23.4586, 3.0314, ..., 3.4530, -0.2666, -3.1916],
> [-65.9383, -23.1859, 3.6318, ..., 2.9440, 3.5066, -2.4232]])
kaldi
fnlp0_si1308 [
33.93769 -26.93453 -4.314013 -9.108547 -2.538414 -7.403401 -7.393436 -19.1162 2.36114 -3.599539 -8.258158 -3.048464 -2.534939
37.16539 -20.76378 -10.65134 -14.69143 -5.084549 -13.17811 -19.8767 -11.37231 0.9925694 3.125628 1.008414 0.7657758 -1.482104
....
3- When I put the tensor into the GPU (with .to('cuda')), I have the following error:
log_energy = torch.max(strided_input.pow(2).sum(1), epsilon).log() # size (m)RuntimeError: Expected object of backend CUDA but got backend CPU for argument #2 'other'
How can I compute the MFCCs using cuda?
4- A good thing is that the execution time (even on a single cpu only) is compatible with the kaldi one (around 15 second for the entire TIMIT dataset). I would expect a further speed up on the GPU.
5- I tried to feed the MFCCs coefficient inside a standard speech recognizer and the performance with the original kaldi coefficients is still much better. Do you have the same experience?
Thank you and thanks for developing this very useful toolkit!