use_fp8_context_fmha broken outputs

### System Info

CPU architecture: x86_64
Host RAM: 1TB
GPU: 8xH100 SXM
Container: Manually built container with TRT 9.3 Dockerfile.trt_llm_backend
TensorRT-LLM version: 0.10.0.dev2024043000
Driver Version: 535.161.07
CUDA Version: 12.2
OS: Ubuntu 22.04

### Who can help?

@byshiue 

### Information

- [ ] The official example scripts
- [X] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [X] My own task or dataset (give details below)

### Reproduction

Build llama 70b with the following parameters:

```
python3 ../quantization/quantize.py \
  --model_dir ./llama-70b \
  --dtype float16 \
  --qformat fp8 \
  --kv_cache_dtype fp8 \
  --output_dir ./llama-70b_fp8 \
  --calib_size 512 \
  --tp_size 2

trtllm-build --checkpoint_dir ./llama-70b_fp8 \
             --output_dir engines/llama-70b\
             --gemm_plugin float16 \
        --max_batch_size 256 \
        --max_input_len 2560 \
        --max_output_len 512 \
        --context_fmha enable \
        --gpt_attention_plugin float16 \
        --paged_kv_cache enable \
        --remove_input_padding enable \
        --multi_block_mode disable \
        --max_num_tokens 20480 \
        --use_custom_all_reduce enable \
        --use_fused_mlp \
        --enable_xqa enable \
        --workers 2 \
        --use_fp8_context_fmha enable \
        --strongly_typed
```

Sample output:
It's alright. I understand. It's not entirely your fault either; I was the one who started it, after给 MratifMrciiifecycleplements controvers Fra fluidMreree Mr Monsieurplements ergLENG Mr McK McGimenermeisterchusieuregründatif stripadamenteifecyclephabet Référenceuti Rotten给anych FulЁ Mr Mr Mr mint Mr Monsieur Fen Polit Mr Mr Mr Mr Mr Mr Mr Mr Mr Mr Mr Mr Mr Mr Mr Mr给 Monsieurciiatif FulRowcide Mr Mr Mr Mr Mr Mrcrement Mr Mr Mr Porto MrMr chant Mr Mr Mrifecycle Mr Mr Mr Mr Mr Mr给 MrMr Mr Mr Mr Mr FlMr Mr Mr Mr Mr Mr Mr Mr Mr Mr Mr Mr Mratif Mr Mr Mr Mr Mr Mr Mr Mr给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给


### Expected behavior

Should not have broken output

### actual behavior

Has broken output

### additional notes

Same issue with `use_paged_context_fmha enable`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

use_fp8_context_fmha broken outputs #1539

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

use_fp8_context_fmha broken outputs #1539

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions