-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
System Info
CPU architecture: x86_64
Host RAM: 1TB
GPU: 8xH100 SXM
Container: Manually built container with TRT 9.3 Dockerfile.trt_llm_backend
TensorRT-LLM version: 0.10.0.dev2024043000
Driver Version: 535.161.07
CUDA Version: 12.2
OS: Ubuntu 22.04
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
Build llama 70b with the following parameters:
python3 ../quantization/quantize.py \
--model_dir ./llama-70b \
--dtype float16 \
--qformat fp8 \
--kv_cache_dtype fp8 \
--output_dir ./llama-70b_fp8 \
--calib_size 512 \
--tp_size 2
trtllm-build --checkpoint_dir ./llama-70b_fp8 \
--output_dir engines/llama-70b\
--gemm_plugin float16 \
--max_batch_size 256 \
--max_input_len 2560 \
--max_output_len 512 \
--context_fmha enable \
--gpt_attention_plugin float16 \
--paged_kv_cache enable \
--remove_input_padding enable \
--multi_block_mode disable \
--max_num_tokens 20480 \
--use_custom_all_reduce enable \
--use_fused_mlp \
--enable_xqa enable \
--workers 2 \
--use_fp8_context_fmha enable \
--strongly_typed
Sample output:
It's alright. I understand. It's not entirely your fault either; I was the one who started it, after给 MratifMrciiifecycleplements controvers Fra fluidMreree Mr Monsieurplements ergLENG Mr McK McGimenermeisterchusieuregründatif stripadamenteifecyclephabet Référenceuti Rotten给anych FulЁ Mr Mr Mr mint Mr Monsieur Fen Polit Mr Mr Mr Mr Mr Mr Mr Mr Mr Mr Mr Mr Mr Mr Mr Mr给 Monsieurciiatif FulRowcide Mr Mr Mr Mr Mr Mrcrement Mr Mr Mr Porto MrMr chant Mr Mr Mrifecycle Mr Mr Mr Mr Mr Mr给 MrMr Mr Mr Mr Mr FlMr Mr Mr Mr Mr Mr Mr Mr Mr Mr Mr Mr Mratif Mr Mr Mr Mr Mr Mr Mr Mr给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给
Expected behavior
Should not have broken output
actual behavior
Has broken output
additional notes
Same issue with use_paged_context_fmha enable