Skip to content

[soft max] capping the num tasks to 4 is limiting the prompt eval perf #5103

@snadampal

Description

@snadampal

Please include information about your system, the steps to reproduce the bug, and the version of llama.cpp that you are using. If possible, please provide a minimal code example that reproduces the bug.

System: AWS Graviton3, c7g.16xl instance with Ubuntu 22.04
llama.cpp version: latest, commit: 6f9939d

the following commit is capping the num of tasks to 4. I would like to understand why 4?

commit adf3de4f69ff7e44131222f05f9c7447ac0be3cb (HEAD, tag: b1605)
Author: Georgi Gerganov <[email protected]>
Date:   Sun Dec 3 15:56:22 2023 +0200

    ggml : fix soft max out-of-bounds access (#4307)

    ggml-ci

Without 4 and just using the src num rows or n_threads for n_tasks, the prompt eval performance is improved by 4% for DOT kernels and 9% for MMLA kernels (PR)
n_tasks = MIN(n_threads, ggml_nrows(node->src[0]));

Reproducer:
./main -m /llama.cpp/models/open_llama_13b/ggml-model-q8_0.gguf -c 1015 -n 256 -t 64 --file <input_file.txt>

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions