Skip to content

Eval bug: data corruption on CUDA experts offload #16945

@pwilkin

Description

@pwilkin

Name and Version

pwilkin@SYN-PC-11:/devel/models$ llama-cli --version
load_backend: loaded BLAS backend from /devel/tools/llama.cpp/build/bin/libggml-blas.so
register_backend: registered backend BLAS (1 devices)
register_device: registered device BLAS (OpenBLAS)
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 5070 Ti, compute capability 12.0, VMM: yes
Device 1: NVIDIA GeForce RTX 5070 Ti, compute capability 12.0, VMM: yes
load_backend: loaded CUDA backend from /devel/tools/llama.cpp/build/bin/libggml-cuda.so
register_backend: registered backend CUDA (2 devices)
register_device: registered device CUDA0 (NVIDIA GeForce RTX 5070 Ti)
register_device: registered device CUDA1 (NVIDIA GeForce RTX 5070 Ti)
ggml_backend_load_best: /devel/tools/llama.cpp/build/bin/libggml-cpu-alderlake.so score: 128
ggml_backend_load_best: /devel/tools/llama.cpp/build/bin/libggml-cpu-icelake.so score: 0
ggml_backend_load_best: /devel/tools/llama.cpp/build/bin/libggml-cpu-haswell.so score: 64
ggml_backend_load_best: /devel/tools/llama.cpp/build/bin/libggml-cpu-skylakex.so score: 0
ggml_backend_load_best: /devel/tools/llama.cpp/build/bin/libggml-cpu-sapphirerapids.so score: 0
ggml_backend_load_best: /devel/tools/llama.cpp/build/bin/libggml-cpu-sandybridge.so score: 21
ggml_backend_load_best: /devel/tools/llama.cpp/build/bin/libggml-cpu-sse42.so score: 5
ggml_backend_load_best: /devel/tools/llama.cpp/build/bin/libggml-cpu-x64.so score: 1
load_backend: loaded CPU backend from /devel/tools/llama.cpp/build/bin/libggml-cpu-alderlake.so
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (Intel(R) Core(TM) i7-14700KF)
version: 6921 (eca77bf)
built with cc (Ubuntu 15.2.0-4ubuntu4) 15.2.0 for x86_64-linux-gnu

Operating systems

Linux

GGML backends

CUDA

Hardware

2x 5070 Ti

Models

Minimax-M2

Problem description & steps to reproduce

This runs complex queries without problems:

llama-server -m MiniMaxAI_MiniMax-M2-IQ3_M/MiniMaxAI_MiniMax-M2-IQ3_M-00001-of-00003.gguf -ngl 99 --cpu-moe --host 0.0.0.0 -c 50000 -fa on --alias syndatis --threads 24 --chat-template-file /devel/tools/llama.cpp/models/templates/unsloth-MiniMax-M2.jinja --jinja

This starts generating corrupted outputs with any prompt of non-trivial size (say 500):

llama-server -m MiniMaxAI_MiniMax-M2-IQ3_M/MiniMaxAI_MiniMax-M2-IQ3_M-00001-of-00003.gguf -ngl 99 -ot "\.([0-9]|[0-5][0-9]|5[0-3])\.ffn_.*_exps=CPU,blk.5[4-8].*=CUDA0,blk.(6[0-2]|5[8-9]).*=CUDA1" --host 0.0.0.0 -c 50000 -fa on --alias syndatis --threads 24 --chat-template-file /devel/tools/llama.cpp/models/templates/unsloth-MiniMax-M2.jinja --jinja

Possibly related to #16935

First Bad Commit

No response

Relevant log output

N/A

Metadata

Metadata

Assignees

No one assigned

    Labels

    Nvidia GPUIssues specific to Nvidia GPUsbugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions