Skip to content

use_fp8_context_fmha broken outputs #1539

@siddhatiwari

Description

@siddhatiwari

System Info

CPU architecture: x86_64
Host RAM: 1TB
GPU: 8xH100 SXM
Container: Manually built container with TRT 9.3 Dockerfile.trt_llm_backend
TensorRT-LLM version: 0.10.0.dev2024043000
Driver Version: 535.161.07
CUDA Version: 12.2
OS: Ubuntu 22.04

Who can help?

@byshiue

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Build llama 70b with the following parameters:

python3 ../quantization/quantize.py \
  --model_dir ./llama-70b \
  --dtype float16 \
  --qformat fp8 \
  --kv_cache_dtype fp8 \
  --output_dir ./llama-70b_fp8 \
  --calib_size 512 \
  --tp_size 2

trtllm-build --checkpoint_dir ./llama-70b_fp8 \
             --output_dir engines/llama-70b\
             --gemm_plugin float16 \
        --max_batch_size 256 \
        --max_input_len 2560 \
        --max_output_len 512 \
        --context_fmha enable \
        --gpt_attention_plugin float16 \
        --paged_kv_cache enable \
        --remove_input_padding enable \
        --multi_block_mode disable \
        --max_num_tokens 20480 \
        --use_custom_all_reduce enable \
        --use_fused_mlp \
        --enable_xqa enable \
        --workers 2 \
        --use_fp8_context_fmha enable \
        --strongly_typed

Sample output:
It's alright. I understand. It's not entirely your fault either; I was the one who started it, after给 MratifMrciiifecycleplements controvers Fra fluidMreree Mr Monsieurplements ergLENG Mr McK McGimenermeisterchusieuregründatif stripadamenteifecyclephabet Référenceuti Rotten给anych FulЁ Mr Mr Mr mint Mr Monsieur Fen Polit Mr Mr Mr Mr Mr Mr Mr Mr Mr Mr Mr Mr Mr Mr Mr Mr给 Monsieurciiatif FulRowcide Mr Mr Mr Mr Mr Mrcrement Mr Mr Mr Porto MrMr chant Mr Mr Mrifecycle Mr Mr Mr Mr Mr Mr给 MrMr Mr Mr Mr Mr FlMr Mr Mr Mr Mr Mr Mr Mr Mr Mr Mr Mr Mratif Mr Mr Mr Mr Mr Mr Mr Mr给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给

Expected behavior

Should not have broken output

actual behavior

Has broken output

additional notes

Same issue with use_paged_context_fmha enable

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions