-
Notifications
You must be signed in to change notification settings - Fork 4.7k
Open
Description
I try process WAV file with zeroes in Data section. File duration is 1,2 seconds (attached it).
Whisper.cpp give hallucination (and wrong duration).
$ ./main -m ./models/ggml-large-v3.bin -l ru --threads 8 -mc 0 samples/zeroes.wav
whisper_init_from_file_with_params_no_state: loading model from './models/ggml-large-v3.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab = 51866
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 1280
whisper_model_load: n_audio_head = 20
whisper_model_load: n_audio_layer = 32
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 1280
whisper_model_load: n_text_head = 20
whisper_model_load: n_text_layer = 32
whisper_model_load: n_mels = 128
whisper_model_load: ftype = 1
whisper_model_load: qntvr = 0
whisper_model_load: type = 5 (large v3)
whisper_model_load: adding 1609 extra tokens
whisper_model_load: n_langs = 100
ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4070, compute capability 8.9, VMM: yes
whisper_backend_init: using CUDA backend
whisper_model_load: CUDA0 total size = 3094.36 MB
whisper_model_load: model size = 3094.36 MB
whisper_backend_init: using CUDA backend
whisper_init_state: kv self size = 220.20 MB
whisper_init_state: kv cross size = 245.76 MB
whisper_init_state: compute buffer (conv) = 36.26 MB
whisper_init_state: compute buffer (encode) = 926.66 MB
whisper_init_state: compute buffer (cross) = 9.38 MB
whisper_init_state: compute buffer (decode) = 209.26 MB
system_info: n_threads = 8 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | CUDA = 1 | COREML = 0 | OPENVINO = 0 |
main: processing 'samples/zeroes.wav' (19200 samples, 1.2 sec), 8 threads, 1 processors, 5 beams + best of 5, lang = ru, task = transcribe, timestamps = 1 ...
[00:00:00.000 --> 00:00:29.980] Продолжение следует...
whisper_print_timings: load time = 685.11 ms
whisper_print_timings: fallbacks = 0 p / 0 h
whisper_print_timings: mel time = 4.86 ms
whisper_print_timings: sample time = 24.48 ms / 79 runs ( 0.31 ms per run)
whisper_print_timings: encode time = 120.78 ms / 1 runs ( 120.78 ms per run)
whisper_print_timings: decode time = 0.00 ms / 1 runs ( 0.00 ms per run)
whisper_print_timings: batchd time = 323.14 ms / 77 runs ( 4.20 ms per run)
whisper_print_timings: prompt time = 0.00 ms / 1 runs ( 0.00 ms per run)
whisper_print_timings: total time = 1164.00 ms
$ ./main -m ./models/ggml-large-v2.bin -l ru --threads 8 -mc 0 samples/zeroes.wav
whisper_init_from_file_with_params_no_state: loading model from './models/ggml-large-v2.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab = 51865
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 1280
whisper_model_load: n_audio_head = 20
whisper_model_load: n_audio_layer = 32
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 1280
whisper_model_load: n_text_head = 20
whisper_model_load: n_text_layer = 32
whisper_model_load: n_mels = 80
whisper_model_load: ftype = 1
whisper_model_load: qntvr = 0
whisper_model_load: type = 5 (large)
whisper_model_load: adding 1608 extra tokens
whisper_model_load: n_langs = 99
ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4070, compute capability 8.9, VMM: yes
whisper_backend_init: using CUDA backend
whisper_model_load: CUDA0 total size = 3093.99 MB
whisper_model_load: model size = 3093.99 MB
whisper_backend_init: using CUDA backend
whisper_init_state: kv self size = 220.20 MB
whisper_init_state: kv cross size = 245.76 MB
whisper_init_state: compute buffer (conv) = 34.82 MB
whisper_init_state: compute buffer (encode) = 926.66 MB
whisper_init_state: compute buffer (cross) = 9.38 MB
whisper_init_state: compute buffer (decode) = 209.26 MB
system_info: n_threads = 8 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | CUDA = 1 | COREML = 0 | OPENVINO = 0 |
main: processing 'samples/zeroes.wav' (19200 samples, 1.2 sec), 8 threads, 1 processors, 5 beams + best of 5, lang = ru, task = transcribe, timestamps = 1 ...
[00:00:00.000 --> 00:00:04.000] Редактор субтитров А.Семкин Корректор А.Егорова
whisper_print_timings: load time = 2376.23 ms
whisper_print_timings: fallbacks = 1 p / 0 h
whisper_print_timings: mel time = 5.14 ms
whisper_print_timings: sample time = 50.08 ms / 152 runs ( 0.33 ms per run)
whisper_print_timings: encode time = 238.64 ms / 1 runs ( 238.64 ms per run)
whisper_print_timings: decode time = 0.00 ms / 1 runs ( 0.00 ms per run)
whisper_print_timings: batchd time = 821.07 ms / 148 runs ( 5.55 ms per run)
whisper_print_timings: prompt time = 0.00 ms / 1 runs ( 0.00 ms per run)
whisper_print_timings: total time = 3498.43 ms
$ ./main -m ./models/ggml-large-v3.bin -l ru --threads 8 -mc 0 samples/zeroes.wav -ng
whisper_init_from_file_with_params_no_state: loading model from './models/ggml-large-v3.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab = 51866
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 1280
whisper_model_load: n_audio_head = 20
whisper_model_load: n_audio_layer = 32
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 1280
whisper_model_load: n_text_head = 20
whisper_model_load: n_text_layer = 32
whisper_model_load: n_mels = 128
whisper_model_load: ftype = 1
whisper_model_load: qntvr = 0
whisper_model_load: type = 5 (large v3)
whisper_model_load: adding 1609 extra tokens
whisper_model_load: n_langs = 100
ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4070, compute capability 8.9, VMM: yes
whisper_model_load: CPU total size = 3094.36 MB
whisper_model_load: model size = 3094.36 MB
whisper_init_state: kv self size = 220.20 MB
whisper_init_state: kv cross size = 245.76 MB
whisper_init_state: compute buffer (conv) = 36.26 MB
whisper_init_state: compute buffer (encode) = 926.66 MB
whisper_init_state: compute buffer (cross) = 9.38 MB
whisper_init_state: compute buffer (decode) = 209.26 MB
system_info: n_threads = 8 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | CUDA = 1 | COREML = 0 | OPENVINO = 0 |
main: processing 'samples/zeroes.wav' (19200 samples, 1.2 sec), 8 threads, 1 processors, 5 beams + best of 5, lang = ru, task = transcribe, timestamps = 1 ...
[00:00:00.000 --> 00:00:29.980] Субтитры создавал DimaTorzok
whisper_print_timings: load time = 957.60 ms
whisper_print_timings: fallbacks = 0 p / 0 h
whisper_print_timings: mel time = 6.50 ms
whisper_print_timings: sample time = 24.92 ms / 75 runs ( 0.33 ms per run)
whisper_print_timings: encode time = 4063.61 ms / 1 runs ( 4063.61 ms per run)
whisper_print_timings: decode time = 565.81 ms / 10 runs ( 56.58 ms per run)
whisper_print_timings: batchd time = 1186.10 ms / 63 runs ( 18.83 ms per run)
whisper_print_timings: prompt time = 0.00 ms / 1 runs ( 0.00 ms per run)
whisper_print_timings: total time = 6809.96 ms
I check it on last master branch:
$ git describe --tags
v1.5.4-183-gb602819
I think, this is a bug.
Metadata
Metadata
Assignees
Labels
No labels