Skip to content

New Feature Idea Prompt R&D pipeline #41

@DarkStarStrix

Description

@DarkStarStrix

Feature Request: Implement PromptBench Module for LLMKit

Description

This issue proposes the implementation of PromptBench, a new module for LLMKit to systematize and evaluate prompt performance for real-world developer use cases (e.g., writing features, debugging code, generating documentation). The module will provide a reusable, testable, and reproducible pipeline for evaluating prompts across multiple models.

The implementation will follow the Minimal Viable Spec (MVS) outlined in the provided documentation, focusing on core functionality, CLI integration, and developer usability.

Goals

  • Enable systematic prompt evaluation for tasks like bug fixing, feature development, and documentation.
  • Support multiple LLM models (e.g., GPT-4, Claude-3, Command-R) for comparative analysis.
  • Provide a reusable prompt template library and evaluation schema.
  • Create a foundation for community-driven prompt evaluation datasets (future export to JSONL for HuggingFace or Weights & Biases).

Scope

Core Concepts

  1. PromptProfile: Defines a single prompt evaluation unit for a task.
    • Example: A profile for fixing a Python sorting bug (bug_fix_001.yaml).
    • Fields: id, title, task_type, description, input_code, context, system_prompt, user_prompt, tags.
    • Stored in llmkit/bench/profiles/.
  2. PromptRun: Captures metadata and responses from a prompt run against a specific model.
    • Fields: run_id, prompt_id, model, response, token_usage (prompt, completion, total), latency_seconds, timestamp.
    • Stored in llmkit/bench/runs/.
  3. PromptEval: Allows subjective evaluation of LLM responses (manual or automated).
    • Fields: run_id, correctness, clarity, coherence, usefulness, notes.
    • Stored in llmkit/bench/evals/.

Module Structure

llmkit/
├── bench/
│   ├── profiles/           # Task definitions (YAML/JSON)
│   ├── runs/              # Model responses (JSON)
│   ├── evals/             # Evaluations (JSON)
│   ├── lib/               # Prompt templates & fragments
│   │   ├── system/        # System prompt templates (e.g., bug_fix.txt)
│   │   ├── user/          # User prompt scaffolding (e.g., bug_fix.jinja)
│   ├── run.py             # CLI wiring for running prompts
│   ├── util.py            # Utilities for loading profiles, sending prompts
│   └── __init__.py        # Core module initialization
├── docs/
│   └── bench.md           # Usage documentation

CLI Commands (MVP)

  • llmkit bench run <profile_id> --model <model_name>: Run a single prompt profile against a model.
  • llmkit bench eval <run_id>: Evaluate a run interactively or via file input.
  • llmkit bench compare <profile_id> --models <model1>,<model2>: Compare a task across models.
  • llmkit bench list: List all available prompt profiles.
  • llmkit bench create: Create a new profile interactively.

Prompt Template Library

  • Default system prompt templates and user prompt scaffolding for each task type (e.g., bug_fix, write_doc, feature_add).
  • Optional "hints" or completions for prompts.
  • Stored in llmkit/bench/lib/system/ and llmkit/bench/lib/user/.

MVP Feature Set

Feature Status
Task profiles To implement
Prompt run + response dump To implement
Evaluation schema To implement
CLI runners To implement
Prompt templating To implement
Web view/export Out of scope

Acceptance Criteria

  1. Implement the PromptBench module as described in the provided document, with all core concepts (PromptProfile, PromptRun, PromptEval).
  2. Create the directory structure under llmkit/bench/ as specified.
  3. Implement the CLI commands listed in the MVP specification.
  4. Provide at least two example profiles in llmkit/examples/bench/profiles/ (e.g., bug_fix_001.yaml for a Python bug fix).
  5. Include default prompt templates for at least one task type (e.g., bug_fix) in llmkit/bench/lib/.
  6. Add comprehensive documentation in docs/bench.md, covering:
    • Module overview and purpose.
    • Instructions for creating and running profiles.
    • Example CLI commands.
    • How to evaluate and compare results.
  7. Ensure the module is testable locally with at least one working flow (e.g., running a bug fix task and evaluating its output).
  8. Verify compatibility with at least two LLM providers (e.g., OpenAI, Anthropic, or Cohere).
  9. Submit a PR labeled as [feat: prompt bench module] with the implementation and documentation.

Development Plan

  1. Local Development:

    • Build and test profiles manually to validate the structure.
    • Implement core utilities in util.py for loading profiles and sending prompts.
    • Wire up CLI in run.py for command execution.
  2. PR Draft:

    • Once CLI is stable with one or two working flows, submit a draft PR.
    • Include:
      • docs/bench.md with usage instructions.
      • Example profiles in bench/profiles/.
      • CLI implementation in bench/run.py.
      • Utility functions in bench/utils.py.
    • Label PR as [feat: prompt bench module [draft]].
  3. Testing:

    • Test locally with sample profiles and multiple models.
    • Ensure error handling for invalid profiles, missing models, or API failures.
    • Validate JSON/YAML schemas for profiles, runs, and evals.

Strategic Payoff

  • Enhances LLMKit’s utility for developers by providing a standardized prompt evaluation pipeline.
  • Generates reusable evaluation data for the open-source community.
  • Lays the groundwork for a community-driven prompt evaluation dataset (future export to HuggingFace or Weights & Biases).

Additional Notes

  • Out of Scope for MVP:
    • Web-based UI for viewing/exporting results.
    • Automated LLM-based evaluations (though the schema supports it).
  • Future Enhancements:
    • Export evaluation data as JSONL for sharing with HuggingFace or Weights & Biases.
    • Add support for batch processing of multiple profiles.
    • Add visualization tools for comparing model performance.

References

  • Based on the LLMKit.pdf specification provided.
  • Example PromptProfile:
    id: bug_fix_001
    title: Fix sorting bug in Python
    task_type: bug_fix
    description: A basic Python sort function returns incorrect results data on duplicate elements.
    input_code: |
      def sort(arr):
          # Buggy sorting logic
    context: >
      This is part of a larger application used to sort user-generated data.
    system_prompt: You are a senior Python engineer...
    user_prompt: Please fix the sorting logic.
    tags: [python, debugging, bug, development, test]
    
  • Example PromptRun:
    {
        "run_id": "bug_fix_001_gpt-4",
        "prompt_id": "bug_fix_001",
        "model": "gpt-4",
        "response": "...LLM output...",
        "token_usage": {
            "prompt_tokens": 110,
            "completion_tokens": 89,
            "total_tokens": 199
        },
        "latency_seconds": 1.7,
        "timestamp": "2025-06-12T14:00:00Z"
    }
    

Assignees

  • TBD.

Labels

  • feature
  • enhancement
  • cli
  • prompt-engineering

Milestone

  • TBD (set to next release milestone or create a new one for PromptBench).

LLMkit.pdf

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions