Skip to content

Different prompts generated for different concurrency levels despite same seed/config #222

@Harshith-umesh

Description

@Harshith-umesh

Summary

When running the concurrent benchmark across multiple concurrency levels (1, 2, 4, 8, 16, 32, 64), guidellm generates different prompts for each concurrency level, even when using the same seed/ synthetic data generation configuration. This makes it impossible to do fair "apples-to-apples" comparisons across different concurrency levels.

Expected Behavior

When using the same seed and data configuration, all concurrency levels should generate the same set of prompts to ensure fair comparison across different concurrency levels.

Evidence from Benchmark Results

Here's the analysis from a benchmark run with concurrency levels [1, 2, 4, 8, 16, 32, 64]:

Configuration (identical for all benchmarks):

{
  "request_loader": {
    "type_": "generative_request_loader",
    "data": "{\"prompt_tokens\":1000,\"prompt_tokens_min\":1,\"prompt_tokens_max\":4096,\"output_tokens\":1000,\"output_tokens_min\":1,\"output_tokens_max\":4096}",
    "processor": "meta-llama/Llama-3.1-8B-Instruct",
    "data_args": null,
    "processor_args": null
  }
}

Different First Prompts by Concurrency Level:

  • Concurrency 1: "such a sacrifice to her advantage as years of gratitude cannot enough acknowledge. By this time she is actually with them! If such goodness does not m..."

  • Concurrency 2: ". How unlucky that you should have a reasonable answer to give, and that I should be so reasonable as to admit it! But I wonder how long you would h..."

  • Concurrency 4: "rejection. He who, she had been persuaded, would avoid her as his greatest enemy, seemed, on this accidental meeting, most eager to preserve the acqua..."

  • Concurrency 8: "aunt now began; and each of them pronounced him to be infinitely superior to anything they had expected. He is perfectly well behaved,..."

  • Concurrency 16: "you speak ill of a human being in my life. I would wish not to be hasty in censuring anyone; but I always speak what I think. I know you do and it is ..."

  • Concurrency 32: "herself the reason, she had very little to say in reply. Mr. Gardiner left them soon after breakfast. The fishing scheme had been renewed the day befo..."

  • Concurrency 64: ". Bingley did not take the house so much for the convenience of the neighbourhood as for his own, and we must expect him to keep or quit it on the sam..."

Environment:

  • Platform: OpenShift
  • Backend: Ollama (llama3.1:8b-instruct-fp16)
  • Data source: Synthetic text generation using Pride and Prejudice

Reproduction Steps

  1. Run a benchmark sweep with multiple concurrency levels:
    guidellm --target "<your llm server > --data {"prompt_tokens":1000,"prompt_tokens_min":1,"prompt_tokens_max":4096,"output_tokens":1000,"output_tokens_min":1,"output_tokens_max":4096} --model llama3.1:8b-instruct-fp16 --rate-type concurrent --rate 1,2,4,8,16,32,64
  2. Examine the resulting benchmark file
  3. Compare prompts across different concurrency levels

Metadata

Metadata

Labels

internalfiled by core contributor or associate

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions