Repeated outputs for long input tasks on Llama 3 70B compared to vLLM and HF's transformers

### System Info

- This was tested o na tp=4 4xH100 SXM setup
- I tested these 2 releases: https://github.com/NVIDIA/TensorRT-LLM/pull/1763 and https://github.com/NVIDIA/TensorRT-LLM/pull/1725

### Who can help?

_No response_

### Information

- [X] The official example scripts
- [X] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

I built TensorRT-LLM engine in several different ways, outlined below, and compared the output quality on domain specific task that involves long inputs (typically >>2000 input tokens and >500 output tokens).

The outputs from TensorRT-LLM (obtained through running the `run.py` script, as well as through running the GptManager in all different modes: V1, InflightBatching, InflightFusedBatching) exhibit repepetition in the outputs ~20% of the time (sample outputs below).

When running the same with vLLM, using the same sampling params (namely temperature, presencePenalty and frequencyPenalty), the outputs do not exhibit these repetitive patterns.

Here some some of the ways I tried to build the TensorRT-LLM engine:

- I tried all of float16, bfloat16 and also fp8 quantization
- I tried `context_fmha enable/disable` and  and also `context_fmha_fp32_acc enable/disable`
- I tried `use_custom_all_reduce enable/disable`
- I tried `gemm_plugin auto/disable`
- I tried various values for `presencePenalty` and `frequencyPenalty` (unset, 0.05, 0.1, 0.3), bust most tests were with `0.1` for both

One concrete example:

```
python convert_checkpoint.py \
--model_dir /workspace/llama3-70b \
--output_dir /workspace/llama3-70b-bf16-tp4 \
--dtype bfloat16 \
--tp_size 4

trtllm-build \
--checkpoint_dir /workspace/llama3-70b-bf16-tp4 \
--output_dir /workspace/llama3-70b-bf16-tp4-engine \
--gpt_attention_plugin bfloat16 \
--gemm_plugin bfloat16 \
--use_custom_all_reduce disable \
--max_num_tokens 16384 \
--max_batch_size 24 \
--max_input_len 8192 \
--max_output_len 4096
```

I also tried running sequentially without batching, and even building the engine with `max_batch_size 1` to eliminate the possibility of batching related bugs (I saw there were a few before). I also once tried building with `max_input_len 7424` and `max_output_len 768` to eliminate the possibility of somehow messing up the RoPE (not sure if max_input_len and max_output_len actually affect that or not).

### Expected behavior

The outputs should not loop that frequently, there's likely some inference inaccuracy / mismatch.

### actual behavior

The input would usually be some part of a story + instruction to continue the story. This is an example of an output.

```
 She looks up when she hears me set down her drink.

“Martini,” I say with a smile.

She smiles back at me with her eyes this time.

“Thank you,” she says.

I don’t know what it is about her voice that makes me feel like she’s saying something else entirely.

I don’t know what it is about her voice that makes me feel like she’s saying something else entirely.

I don’t know what it is about her voice that makes me feel like she’s saying something else entirely.

I don’t know what it is about her voice that makes me feel like she’s saying something else entirely.

I don’t know what it is about her voice that makes me feel like she’s saying something else entirely.

I don’t know what it is about her voice that makes me feel like she’s saying something else entirely.
```

The repetition is usually at a sentence level like this, but sometimes also several sentences repeat.

### additional notes

I am wondering if anyone else experienced similar issues, and whether someone did a recent analysis comparing Tensort-LLM to other inference stacks. I saw that most tests are restricted to short inputs and outputs like MMLU, which might not exhibit these issues.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repeated outputs for long input tasks on Llama 3 70B compared to vLLM and HF's transformers #1788

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Repeated outputs for long input tasks on Llama 3 70B compared to vLLM and HF's transformers #1788

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions