Skip to content

Commit 779139c

Browse files
authored
Dev/steven/jb eval (#55)
* Multi-turn Jailbreak eval results * Delete old jailbreak ROC curve image * Update data description * Data description correction
1 parent 967337c commit 779139c

File tree

4 files changed

+34
-22
lines changed

4 files changed

+34
-22
lines changed
214 KB
Loading
-205 KB
Binary file not shown.

docs/ref/checks/jailbreak.md

Lines changed: 17 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -96,37 +96,40 @@ When conversation history is available (e.g., in chat applications or agent work
9696

9797
### Dataset Description
9898

99-
This benchmark evaluates model performance on a diverse set of prompts:
99+
This benchmark combines multiple public datasets and synthetic benign conversations:
100100

101-
- **Subset of the open source jailbreak dataset [JailbreakV-28k](https://huggingface.co/datasets/JailbreakV-28K/JailBreakV-28k)** (n=2,000)
102-
- **Synthetic prompts** covering a diverse range of benign topics (n=1,000)
103-
- **Open source [Toxicity](https://github.com/surge-ai/toxicity/blob/main/toxicity_en.csv) dataset** containing harmful content that does not involve jailbreak attempts (n=1,000)
101+
- **Red Queen jailbreak corpus ([GitHub](https://github.com/kriti-hippo/red_queen/blob/main/Data/Red_Queen_Attack.zip))**: 14,000 positive samples collected with gpt-4o attacks.
102+
- **Tom Gibbs multi-turn jailbreak attacks ([Hugging Face](https://huggingface.co/datasets/tom-gibbs/multi-turn_jailbreak_attack_datasets/tree/main))**: 4,136 positive samples.
103+
- **Scale MHJ dataset ([Hugging Face](https://huggingface.co/datasets/ScaleAI/mhj))**: 537 positive samples.
104+
- **Synthetic benign conversations**: 12,433 negative samples generated by seeding prompts from [WildGuardMix](https://huggingface.co/datasets/allenai/wildguardmix?utm_source=chatgpt.com) where `adversarial=false` and `prompt_harm_label=false`, then expanding each single-turn input into five-turn dialogues using gpt-4.1.
104105

105-
**Total n = 4,000; positive class prevalence = 2,000 (50.0%)**
106+
**Total n = 31,106; positives = 18,673; negatives = 12,433**
107+
108+
For benchmarking, we randomly sampled 4,000 conversations from this pool using a 50/50 split between positive and negative samples.
106109

107110
### Results
108111

109112
#### ROC Curve
110113

111-
![ROC Curve](../../benchmarking/jailbreak_roc_curve.png)
114+
![ROC Curve](../../benchmarking/Jailbreak_roc_curves.png)
112115

113116
#### Metrics Table
114117

115118
| Model | ROC AUC | Prec@R=0.80 | Prec@R=0.90 | Prec@R=0.95 | Recall@FPR=0.01 |
116119
|--------------|---------|-------------|-------------|-------------|-----------------|
117-
| gpt-5 | 0.982 | 0.984 | 0.977 | 0.977 | 0.743 |
118-
| gpt-5-mini | 0.980 | 0.980 | 0.976 | 0.975 | 0.734 |
119-
| gpt-4.1 | 0.979 | 0.975 | 0.975 | 0.975 | 0.661 |
120-
| gpt-4.1-mini (default) | 0.979 | 0.974 | 0.972 | 0.972 | 0.654 |
120+
| gpt-5 | 0.994 | 0.993 | 0.993 | 0.993 | 0.997 |
121+
| gpt-5-mini | 0.813 | 0.832 | 0.832 | 0.832 | 0.000 |
122+
| gpt-4.1 | 0.999 | 0.999 | 0.999 | 0.999 | 1.000 |
123+
| gpt-4.1-mini (default) | 0.928 | 0.968 | 0.968 | 0.500 | 0.000 |
121124

122125
#### Latency Performance
123126

124127
| Model | TTC P50 (ms) | TTC P95 (ms) |
125128
|--------------|--------------|--------------|
126-
| gpt-5 | 4,569 | 7,256 |
127-
| gpt-5-mini | 5,019 | 9,212 |
128-
| gpt-4.1 | 841 | 1,861 |
129-
| gpt-4.1-mini | 749 | 1,291 |
129+
| gpt-5 | 7,370 | 12,218 |
130+
| gpt-5-mini | 7,055 | 11,579 |
131+
| gpt-4.1 | 2,998 | 4,204 |
132+
| gpt-4.1-mini | 1,538 | 2,089 |
130133

131134
**Notes:**
132135

src/guardrails/evals/core/visualizer.py

Lines changed: 17 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@
1212
import matplotlib.pyplot as plt
1313
import numpy as np
1414
import seaborn as sns
15+
from sklearn.metrics import roc_auc_score, roc_curve
1516

1617
logger = logging.getLogger(__name__)
1718

@@ -111,10 +112,8 @@ def create_roc_curves(self, results_by_model: dict[str, list[Any]], guardrail_na
111112
continue
112113

113114
try:
114-
from sklearn.metrics import roc_curve
115-
116115
fpr, tpr, _ = roc_curve(y_true, y_scores)
117-
roc_auc = np.trapz(tpr, fpr)
116+
roc_auc = roc_auc_score(y_true, y_scores)
118117
ax.plot(fpr, tpr, label=f"{model_name} (AUC = {roc_auc:.3f})", linewidth=2)
119118
except Exception as e:
120119
logger.error("Failed to calculate ROC curve for model %s: %s", model_name, e)
@@ -144,15 +143,25 @@ def _extract_roc_data(self, results: list[Any], guardrail_name: str) -> tuple[li
144143
y_scores = []
145144

146145
for result in results:
147-
if guardrail_name in result.expected_triggers:
148-
expected = result.expected_triggers[guardrail_name]
149-
actual = result.triggered.get(guardrail_name, False)
146+
if guardrail_name not in result.expected_triggers:
147+
logger.warning("Guardrail '%s' not found in expected_triggers for sample %s", guardrail_name, result.id)
148+
continue
150149

151-
y_true.append(1 if expected else 0)
152-
y_scores.append(1 if actual else 0)
150+
expected = result.expected_triggers[guardrail_name]
151+
y_true.append(1 if expected else 0)
152+
y_scores.append(self._get_confidence_score(result, guardrail_name))
153153

154154
return y_true, y_scores
155155

156+
def _get_confidence_score(self, result: Any, guardrail_name: str) -> float:
157+
"""Extract the model-reported confidence score for plotting."""
158+
if guardrail_name in result.details:
159+
guardrail_details = result.details[guardrail_name]
160+
if isinstance(guardrail_details, dict) and "confidence" in guardrail_details:
161+
return float(guardrail_details["confidence"])
162+
163+
return 1.0 if result.triggered.get(guardrail_name, False) else 0.0
164+
156165
def create_latency_comparison_chart(self, latency_results: dict[str, dict[str, Any]]) -> Path:
157166
"""Create a chart comparing latency across models."""
158167
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))

0 commit comments

Comments
 (0)