Evaluation-Instruments

This repository contains a collection of evaluation instruments with a focus on AI use cases for healthcare. The instruments were developed to support both manual evaluation by trained evaluators and automated evaluation approaches such as deterministic statistical methods or probabilistic LLM-as-a-Judge. The instruments were initially calibrated to support specific use cases and should be reviewed prior to use if the planned use case differs.

The instruments provide a consistent interface for inputs and outputs to aid in reuse between use cases, composition into pipelines, and integration with seismometer for analysis. If an instrument requires an LLM, the litellm SDK API protocol is encouraged to streamline usage across language models.

How to use

This package is not currently published to PyPI, so it must be installed from source. It also does not provide direct support for interfacing with generative models, as this is an implicit requirement for generating the outputs needing evaluation.

Installation

This package is not currently published to PyPI, but it can be installed directly from GitHub using pip. This is the recommended approach for programmatic use (e.g., from scripts or services):

pip install git+https://github.com/epic-open-source/[email protected]

Note: It is recommended to pin to a specific release tag (e.g., @v0.0.2) to ensure reproducibility and avoid unexpected changes from future updates.

This installs the package along with evaluation instruments — rubrics, prompts, and supporting logic — used to assess AI-generated outputs in specific clinical contexts. Instruments are included for completeness but are not required for all workflows or runtime use.

Alternatively, you can clone the repository into a directory accessible to your Python environment,

git clone https://github.com/epic-open-source/evaluation-instruments.git

and then run:

pip install .

Getting Started

Navigate to the instrument of your choice, such as the PDSQI-9.

Using an instrument typically involves two steps:

Configure your backing model - The package follows the LiteLLM protocol for model integration. You'll need a function that accepts a messages array and returns an OpenAI-style JSON response. Using LiteLLM directly can simplify this to setting a few environment variables.
Align your data - The examples use a pattern where content for evaluation is stored in separate files to minimize memory usage. A DataFrame is built to store paths to these files. File contents are instrument and use-case dependent:
- For PDSQI-9, each file contains:
  - summary: The text string being evaluated
  - notes: A list of text representing the raw information being summarized
  - target_specialty: The specialty of the target user

Running Evaluations

When running evaluations, you can set a max_tokens threshold to stop after the first request exceeding that limit. For finer-grained control, consider using your model provider's token consumption monitoring and limiting features. This is not currently published to pypi so must be installed from source, and does not provide direct support for reaching out to generative models. If you have a model output to evaluate chances are good you already have a method to generate that output, so the goal here is to make something light that can fit into that ecosystem.

Evaluation Flow

The evaluation process follows a three-step pipeline for each row of input data, initiated by Evaluator.run_dataset:

prompt = prep_fn(row) – generates a message array from the input
raw_output = completion_fn(model, prompt) – gets model response
parsed, usage = post_process_fn(raw_output) – parses the model output

DataFrame --> prep_fn --> completion_fn --> post_process_fn --> parsed results

Tip: If log_enabled is set, all raw outputs are saved to disk with timestamps under evaluation_logs/.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.devcontainer		.devcontainer
.github		.github
changelog		changelog
examples/data		examples/data
src/evaluation_instruments		src/evaluation_instruments
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.txt		LICENSE.txt
MANIFEST.IN		MANIFEST.IN
README.md		README.md
SECURITY.md		SECURITY.md
pyproject.toml		pyproject.toml
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Evaluation-Instruments

How to use

Installation

Getting Started

Running Evaluations

Evaluation Flow

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors 7

Uh oh!

Languages

License

epic-open-source/evaluation-instruments

Folders and files

Latest commit

History

Repository files navigation

Evaluation-Instruments

How to use

Installation

Getting Started

Running Evaluations

Evaluation Flow

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors 7

Uh oh!

Languages

Packages