-
Notifications
You must be signed in to change notification settings - Fork 19
Closed
Description
Feature Request: Implement PromptBench Module for LLMKit
Description
This issue proposes the implementation of PromptBench, a new module for LLMKit to systematize and evaluate prompt performance for real-world developer use cases (e.g., writing features, debugging code, generating documentation). The module will provide a reusable, testable, and reproducible pipeline for evaluating prompts across multiple models.
The implementation will follow the Minimal Viable Spec (MVS) outlined in the provided documentation, focusing on core functionality, CLI integration, and developer usability.
Goals
- Enable systematic prompt evaluation for tasks like bug fixing, feature development, and documentation.
- Support multiple LLM models (e.g., GPT-4, Claude-3, Command-R) for comparative analysis.
- Provide a reusable prompt template library and evaluation schema.
- Create a foundation for community-driven prompt evaluation datasets (future export to JSONL for HuggingFace or Weights & Biases).
Scope
Core Concepts
- PromptProfile: Defines a single prompt evaluation unit for a task.
- Example: A profile for fixing a Python sorting bug (
bug_fix_001.yaml). - Fields:
id,title,task_type,description,input_code,context,system_prompt,user_prompt,tags. - Stored in
llmkit/bench/profiles/.
- Example: A profile for fixing a Python sorting bug (
- PromptRun: Captures metadata and responses from a prompt run against a specific model.
- Fields:
run_id,prompt_id,model,response,token_usage(prompt, completion, total),latency_seconds,timestamp. - Stored in
llmkit/bench/runs/.
- Fields:
- PromptEval: Allows subjective evaluation of LLM responses (manual or automated).
- Fields:
run_id,correctness,clarity,coherence,usefulness,notes. - Stored in
llmkit/bench/evals/.
- Fields:
Module Structure
llmkit/
├── bench/
│ ├── profiles/ # Task definitions (YAML/JSON)
│ ├── runs/ # Model responses (JSON)
│ ├── evals/ # Evaluations (JSON)
│ ├── lib/ # Prompt templates & fragments
│ │ ├── system/ # System prompt templates (e.g., bug_fix.txt)
│ │ ├── user/ # User prompt scaffolding (e.g., bug_fix.jinja)
│ ├── run.py # CLI wiring for running prompts
│ ├── util.py # Utilities for loading profiles, sending prompts
│ └── __init__.py # Core module initialization
├── docs/
│ └── bench.md # Usage documentation
CLI Commands (MVP)
llmkit bench run <profile_id> --model <model_name>: Run a single prompt profile against a model.llmkit bench eval <run_id>: Evaluate a run interactively or via file input.llmkit bench compare <profile_id> --models <model1>,<model2>: Compare a task across models.llmkit bench list: List all available prompt profiles.llmkit bench create: Create a new profile interactively.
Prompt Template Library
- Default system prompt templates and user prompt scaffolding for each task type (e.g.,
bug_fix,write_doc,feature_add). - Optional "hints" or completions for prompts.
- Stored in
llmkit/bench/lib/system/andllmkit/bench/lib/user/.
MVP Feature Set
| Feature | Status |
|---|---|
| Task profiles | To implement |
| Prompt run + response dump | To implement |
| Evaluation schema | To implement |
| CLI runners | To implement |
| Prompt templating | To implement |
| Web view/export | Out of scope |
Acceptance Criteria
- Implement the
PromptBenchmodule as described in the provided document, with all core concepts (PromptProfile,PromptRun,PromptEval). - Create the directory structure under
llmkit/bench/as specified. - Implement the CLI commands listed in the MVP specification.
- Provide at least two example profiles in
llmkit/examples/bench/profiles/(e.g.,bug_fix_001.yamlfor a Python bug fix). - Include default prompt templates for at least one task type (e.g.,
bug_fix) inllmkit/bench/lib/. - Add comprehensive documentation in
docs/bench.md, covering:- Module overview and purpose.
- Instructions for creating and running profiles.
- Example CLI commands.
- How to evaluate and compare results.
- Ensure the module is testable locally with at least one working flow (e.g., running a bug fix task and evaluating its output).
- Verify compatibility with at least two LLM providers (e.g., OpenAI, Anthropic, or Cohere).
- Submit a PR labeled as
[feat: prompt bench module]with the implementation and documentation.
Development Plan
Local Development:
- Build and test profiles manually to validate the structure.
- Implement core utilities in
util.pyfor loading profiles and sending prompts. - Wire up CLI in
run.pyfor command execution.
PR Draft:
- Once CLI is stable with one or two working flows, submit a draft PR.
- Include:
docs/bench.mdwith usage instructions.- Example profiles in
bench/profiles/. - CLI implementation in
bench/run.py. - Utility functions in
bench/utils.py.
- Label PR as
[feat: prompt bench module [draft]].
Testing:
- Test locally with sample profiles and multiple models.
- Ensure error handling for invalid profiles, missing models, or API failures.
- Validate JSON/YAML schemas for profiles, runs, and evals.
Strategic Payoff
- Enhances LLMKit’s utility for developers by providing a standardized prompt evaluation pipeline.
- Generates reusable evaluation data for the open-source community.
- Lays the groundwork for a community-driven prompt evaluation dataset (future export to HuggingFace or Weights & Biases).
Additional Notes
- Out of Scope for MVP:
- Web-based UI for viewing/exporting results.
- Automated LLM-based evaluations (though the schema supports it).
- Future Enhancements:
- Export evaluation data as JSONL for sharing with HuggingFace or Weights & Biases.
- Add support for batch processing of multiple profiles.
- Add visualization tools for comparing model performance.
References
- Based on the
LLMKit.pdfspecification provided. - Example
PromptProfile:id: bug_fix_001 title: Fix sorting bug in Python task_type: bug_fix description: A basic Python sort function returns incorrect results data on duplicate elements. input_code: | def sort(arr): # Buggy sorting logic context: > This is part of a larger application used to sort user-generated data. system_prompt: You are a senior Python engineer... user_prompt: Please fix the sorting logic. tags: [python, debugging, bug, development, test] - Example
PromptRun:{ "run_id": "bug_fix_001_gpt-4", "prompt_id": "bug_fix_001", "model": "gpt-4", "response": "...LLM output...", "token_usage": { "prompt_tokens": 110, "completion_tokens": 89, "total_tokens": 199 }, "latency_seconds": 1.7, "timestamp": "2025-06-12T14:00:00Z" }
Assignees
- TBD.
Labels
featureenhancementcliprompt-engineering
Milestone
- TBD (set to next release milestone or create a new one for PromptBench).
Metadata
Metadata
Assignees
Labels
No labels