Skip to content

Repeated outputs for long input tasks on Llama 3 70B compared to vLLM and HF's transformers #1788

@DreamGenX

Description

@DreamGenX

System Info

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

I built TensorRT-LLM engine in several different ways, outlined below, and compared the output quality on domain specific task that involves long inputs (typically >>2000 input tokens and >500 output tokens).

The outputs from TensorRT-LLM (obtained through running the run.py script, as well as through running the GptManager in all different modes: V1, InflightBatching, InflightFusedBatching) exhibit repepetition in the outputs ~20% of the time (sample outputs below).

When running the same with vLLM, using the same sampling params (namely temperature, presencePenalty and frequencyPenalty), the outputs do not exhibit these repetitive patterns.

Here some some of the ways I tried to build the TensorRT-LLM engine:

  • I tried all of float16, bfloat16 and also fp8 quantization
  • I tried context_fmha enable/disable and and also context_fmha_fp32_acc enable/disable
  • I tried use_custom_all_reduce enable/disable
  • I tried gemm_plugin auto/disable
  • I tried various values for presencePenalty and frequencyPenalty (unset, 0.05, 0.1, 0.3), bust most tests were with 0.1 for both

One concrete example:

python convert_checkpoint.py \
--model_dir /workspace/llama3-70b \
--output_dir /workspace/llama3-70b-bf16-tp4 \
--dtype bfloat16 \
--tp_size 4

trtllm-build \
--checkpoint_dir /workspace/llama3-70b-bf16-tp4 \
--output_dir /workspace/llama3-70b-bf16-tp4-engine \
--gpt_attention_plugin bfloat16 \
--gemm_plugin bfloat16 \
--use_custom_all_reduce disable \
--max_num_tokens 16384 \
--max_batch_size 24 \
--max_input_len 8192 \
--max_output_len 4096

I also tried running sequentially without batching, and even building the engine with max_batch_size 1 to eliminate the possibility of batching related bugs (I saw there were a few before). I also once tried building with max_input_len 7424 and max_output_len 768 to eliminate the possibility of somehow messing up the RoPE (not sure if max_input_len and max_output_len actually affect that or not).

Expected behavior

The outputs should not loop that frequently, there's likely some inference inaccuracy / mismatch.

actual behavior

The input would usually be some part of a story + instruction to continue the story. This is an example of an output.

 She looks up when she hears me set down her drink.

“Martini,” I say with a smile.

She smiles back at me with her eyes this time.

“Thank you,” she says.

I don’t know what it is about her voice that makes me feel like she’s saying something else entirely.

I don’t know what it is about her voice that makes me feel like she’s saying something else entirely.

I don’t know what it is about her voice that makes me feel like she’s saying something else entirely.

I don’t know what it is about her voice that makes me feel like she’s saying something else entirely.

I don’t know what it is about her voice that makes me feel like she’s saying something else entirely.

I don’t know what it is about her voice that makes me feel like she’s saying something else entirely.

The repetition is usually at a sentence level like this, but sometimes also several sentences repeat.

additional notes

I am wondering if anyone else experienced similar issues, and whether someone did a recent analysis comparing Tensort-LLM to other inference stacks. I saw that most tests are restricted to short inputs and outputs like MMLU, which might not exhibit these issues.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions