Skip to content

Some issues with Kaldi MFCCs features #263

@ghost

Description

Hi,
I'm trying to do some experiments with the kaldi-compliant MFCCs, but I run into some possible issues:

1- When I run the following code

 file='/home/mirco/datasets/TIMIT/test/dr5/fnlp0/si1308.wav'
 [signal,fs]=sf.read(file)
 signal=torch.from_numpy(signal).unsqueeze(0).float()
 fea=mfcc(signal)
 print(fea)

The mfccs are different every time I run the script:

run 1

tensor([[ 29.2496, -32.6150,  -7.1791,  ...,  -6.2034,  -5.8100,   3.5894],
       [ 28.1680, -35.9921,  -8.5621,  ..., -13.5980,  -4.2804,  -8.8075],
       [ 29.2831, -31.8580,  -8.8565,  ...,  -5.8166,  -4.2538,   6.4913],
       ...,
       [ 27.5078, -36.1139, -12.1319,  ..., -11.6493,   0.2557,  -4.9566],
       [ 28.9667, -33.5803,  -6.6644,  ...,  -6.1208,   2.7111,   2.7867],
       [ 28.6988, -33.6590, -12.0312,  ...,  -3.0909,  -0.0643,  -4.1769]])

run 2

tensor([[ 27.8255, -33.2356,  -8.8006,  ..., -13.2640,   1.0311,   4.8004],
       [ 29.5605, -34.0147, -10.3465,  ...,  -4.0096,  -1.5156,  -3.2499],
       [ 29.3978, -31.3415,  -6.4141,  ...,  10.6100,   2.3651,   6.1324],
       ...,
       [ 29.4321, -33.0013, -11.9812,  ...,  -2.9076,   6.3498,   1.8854],
       [ 28.2726, -34.0620,  -9.5291,  ...,  -5.4033,   6.0385,  -0.1867],
       [ 29.5408, -33.7757,  -9.1063,  ...,   5.8796,  -6.1365,  -3.2730]])

This is due to the dithering, that is a type of noise. The problem can be easily solved by setting a manual_seed before executing the code. To avoid issues, the users should be aware of that. Maybe it could be great to provide an example in the documentation.

2- Even if I remove the dithering in both kaldi and torchaudio mfccs, the two vectors are very very different (the options in the two cases are exactly the same):t

torch.audio

> tensor([[-64.7435, -23.0893,   1.5796,  ...,  -4.9001,  -1.5039,  -2.7683],
>        [-61.5527, -17.7455,  -5.9670,  ...,   4.4663,   2.5523,  -0.9595],
>        [-58.9998, -21.4523, -10.7197,  ...,  10.2993,   9.5475,  -0.3667],
>        ...,
>        [-65.0258, -23.9535,   3.2329,  ...,   4.1740,   9.9711,  -1.2087],
>        [-65.4491, -23.4586,   3.0314,  ...,   3.4530,  -0.2666,  -3.1916],
>        [-65.9383, -23.1859,   3.6318,  ...,   2.9440,   3.5066,  -2.4232]])

kaldi

fnlp0_si1308  [
 33.93769 -26.93453 -4.314013 -9.108547 -2.538414 -7.403401 -7.393436 -19.1162 2.36114 -3.599539 -8.258158 -3.048464 -2.534939
 37.16539 -20.76378 -10.65134 -14.69143 -5.084549 -13.17811 -19.8767 -11.37231 0.9925694 3.125628 1.008414 0.7657758 -1.482104
....

3- When I put the tensor into the GPU (with .to('cuda')), I have the following error:
log_energy = torch.max(strided_input.pow(2).sum(1), epsilon).log() # size (m)RuntimeError: Expected object of backend CUDA but got backend CPU for argument #2 'other'

How can I compute the MFCCs using cuda?

4- A good thing is that the execution time (even on a single cpu only) is compatible with the kaldi one (around 15 second for the entire TIMIT dataset). I would expect a further speed up on the GPU.

5- I tried to feed the MFCCs coefficient inside a standard speech recognizer and the performance with the original kaldi coefficients is still much better. Do you have the same experience?

Thank you and thanks for developing this very useful toolkit!

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions