Some issues with Kaldi MFCCs features

Hi,
I'm trying to do some experiments with the kaldi-compliant MFCCs, but I run into some possible issues:

1- When I run the following code


```
 file='/home/mirco/datasets/TIMIT/test/dr5/fnlp0/si1308.wav'
 [signal,fs]=sf.read(file)
 signal=torch.from_numpy(signal).unsqueeze(0).float()
 fea=mfcc(signal)
 print(fea)
```

The mfccs are different every time I run the script:

run 1
```
tensor([[ 29.2496, -32.6150,  -7.1791,  ...,  -6.2034,  -5.8100,   3.5894],
       [ 28.1680, -35.9921,  -8.5621,  ..., -13.5980,  -4.2804,  -8.8075],
       [ 29.2831, -31.8580,  -8.8565,  ...,  -5.8166,  -4.2538,   6.4913],
       ...,
       [ 27.5078, -36.1139, -12.1319,  ..., -11.6493,   0.2557,  -4.9566],
       [ 28.9667, -33.5803,  -6.6644,  ...,  -6.1208,   2.7111,   2.7867],
       [ 28.6988, -33.6590, -12.0312,  ...,  -3.0909,  -0.0643,  -4.1769]])
```
run 2
```
tensor([[ 27.8255, -33.2356,  -8.8006,  ..., -13.2640,   1.0311,   4.8004],
       [ 29.5605, -34.0147, -10.3465,  ...,  -4.0096,  -1.5156,  -3.2499],
       [ 29.3978, -31.3415,  -6.4141,  ...,  10.6100,   2.3651,   6.1324],
       ...,
       [ 29.4321, -33.0013, -11.9812,  ...,  -2.9076,   6.3498,   1.8854],
       [ 28.2726, -34.0620,  -9.5291,  ...,  -5.4033,   6.0385,  -0.1867],
       [ 29.5408, -33.7757,  -9.1063,  ...,   5.8796,  -6.1365,  -3.2730]])
```

This is due to the dithering, that is a type of noise. The problem can be easily solved by setting a manual_seed before executing the code. To avoid issues, the users should be aware of that. Maybe it could be great to provide an example in the documentation.

2- Even if I remove the dithering in both kaldi and torchaudio mfccs, the two vectors are very very different (the options in the two cases are exactly the same):t

torch.audio

```
> tensor([[-64.7435, -23.0893,   1.5796,  ...,  -4.9001,  -1.5039,  -2.7683],
>        [-61.5527, -17.7455,  -5.9670,  ...,   4.4663,   2.5523,  -0.9595],
>        [-58.9998, -21.4523, -10.7197,  ...,  10.2993,   9.5475,  -0.3667],
>        ...,
>        [-65.0258, -23.9535,   3.2329,  ...,   4.1740,   9.9711,  -1.2087],
>        [-65.4491, -23.4586,   3.0314,  ...,   3.4530,  -0.2666,  -3.1916],
>        [-65.9383, -23.1859,   3.6318,  ...,   2.9440,   3.5066,  -2.4232]])
```

kaldi
```
fnlp0_si1308  [
 33.93769 -26.93453 -4.314013 -9.108547 -2.538414 -7.403401 -7.393436 -19.1162 2.36114 -3.599539 -8.258158 -3.048464 -2.534939
 37.16539 -20.76378 -10.65134 -14.69143 -5.084549 -13.17811 -19.8767 -11.37231 0.9925694 3.125628 1.008414 0.7657758 -1.482104
....
```

3- When I put the tensor into the GPU (with .to('cuda')), I have the following error:
`   log_energy = torch.max(strided_input.pow(2).sum(1), epsilon).log()  # size (m)RuntimeError: Expected object of backend CUDA but got backend CPU for argument #2 'other'`

How can I compute the MFCCs using cuda?

4- A good thing is that the execution time (even on a single cpu only) is compatible with the kaldi one (around 15 second for the entire TIMIT dataset). I would expect a further speed up on the GPU.

5- I tried to feed the MFCCs coefficient inside a standard speech recognizer and the performance with the original kaldi coefficients is still much better. Do you have the same experience?

Thank you and thanks for developing this very useful toolkit!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Some issues with Kaldi MFCCs features #263

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Some issues with Kaldi MFCCs features #263

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions