Skip to content

Commit 9c92bc0

Browse files
committed
Corrections
1 parent 0ec578c commit 9c92bc0

10 files changed

+309
-4
lines changed
107 KB
Loading
127 KB
Loading

src/langsmith/analyze-single-experiment.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,7 @@ You can customize the columns using the **Display** button to make it easier to
3333

3434
You can also set the high, middle, and low thresholds for numeric feedback scores in your experiment, which affects the threshold at which score chips render as red or green:
3535

36-
![Column heatmap configuration](/langsmith/column_heat_map.png)
36+
![Column heatmap configuration](/langsmith/images/column-heat-map.png)
3737

3838
### Sort and filter
3939

src/langsmith/export-backend.mdx

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
2-
title: Exporting LangSmith telemetry to your observability backend
3-
sidebarTitle: Exporting LangSmith telemetry to your observability backend
2+
title: Export LangSmith telemetry to your observability backend
3+
sidebarTitle: Export LangSmith telemetry to your observability backend
44
---
55

66
<Warning>
631 KB
Binary file not shown.
107 KB
Loading
127 KB
Loading

src/langsmith/langchain-runnable.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ sidebarTitle: Evaluate a runnable
88
* Runnable: [Python](https://python.langchain.com/docs/concepts/runnables/) and [JS/TS](https://js.langchain.com/docs/concepts/runnables/)
99
</Info>
1010

11-
`langchain` [Runnable](/docs/concepts/runnables/) objects (such as chat models, retrievers, chains, etc.) can be passed directly into `evaluate()` / `aevaluate()`.
11+
`langchain` [Runnable](https://python.langchain.com/docs/concepts/runnables/) objects (such as chat models, retrievers, chains, etc.) can be passed directly into `evaluate()` / `aevaluate()`.
1212

1313
## Setup
1414

src/langsmith/swe-benchmark.mdx

Lines changed: 305 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,305 @@
1+
---
2+
title: Run SWE-bench with LangSmith
3+
sidebarTitle: Run SWE-bench
4+
---
5+
6+
SWE-bench is one of the most popular (and difficult!) benchmarks for developers to test their coding agents against. In this walkthrough we will show you how to load the SWE-bench dataset into LangSmith and easily run evals on it, allowing you to have much better visibility into your agents behaviour then using the off-the-shelf SWE-bench eval suite. This allows you to pin specific problems quicker and iterate on your agent rapidly to improve performance!
7+
8+
## Loading the data
9+
10+
To load the data, we will pull the `dev` split from Hugging Face, but for your use case you may wish to pull one of the `test`, or `train` splits, and if you want to combine multiple splits you can use `pd.concat`.
11+
12+
```python
13+
import pandas as pd
14+
15+
splits = {
16+
'dev': 'data/dev-00000-of-00001.parquet',
17+
'test': 'data/test-00000-of-00001.parquet',
18+
'train': 'data/train-00000-of-00001.parquet'
19+
}
20+
21+
df = pd.read_parquet("hf://datasets/princeton-nlp/SWE-bench/" + splits["dev"])
22+
```
23+
24+
### Editing the 'version' column
25+
26+
<Note>
27+
This is a very important step! If you skip, the rest of the code WILL NOT WORK!
28+
</Note>
29+
30+
The `version` column contains all string values but all are in float format so they get converted to floats when you upload the CSV to create a LangSmith dataset. Although you can convert the values to strings during your experiments, the issue arises with values like `"0.10"`. When getting converted to a float, you get the value `0.1`, which would become `"0.1"` if you converted it to a string - causing a key error during execution of your proposed patch.
31+
32+
In order to fix this, we need LangSmith to stop trying to convert the `version` column to floats. In order to do this, we can just append a string prefix to each of them that is not float compatible. We then need to split on this prefix when doing evaluation to get the actual `version` value. The prefix we choose here is the string `"version:"`.
33+
34+
<Note>
35+
The ability to select column types when uploading a CSV to LangSmith will be added in the future to avoid having to use this workaround.
36+
</Note>
37+
38+
```python
39+
df['version'] = df['version'].apply(lambda x: f"version:{x}")
40+
```
41+
42+
## Upload the data to LangSmith
43+
44+
### Save to CSV
45+
46+
To upload the data to LangSmith, we first need to save it to a CSV, which we can do using the `to_csv` function provided by pandas. Make sure to save this file somewhere that is easily accessible to you.
47+
48+
```python
49+
df.to_csv("./../SWE-bench.csv",index=False)
50+
```
51+
52+
### Upload CSV to LangSmith Manually
53+
54+
We are now ready to upload the CSV to LangSmith. Once you are on the LangSmith website (smith.langchain.com), go to the `Datasets & Testing` tab on the left side navigation bar, and then click the `+ New Dataset` button in the top right corner.
55+
56+
Then click the `Upload CSV` button on the top, and select the CSV file you saved in the previous step. You can then give your dataset a name and description.
57+
58+
Next, select `Key-Value` as the dataset type. Lastly head to the `Create Schema` section and add ALL OF THE KEYS as `Input fields`. There are no `Output fields` in this example because our evaluator is not comparing against a reference, but instead will run the output of our experiments in docker containers to ensure that the code actually solves the PR issue.
59+
60+
Once you have populated the `Input fields` (and left the `Output fields` empty!) you can click the blue `Create` button in the top right corner, and your dataset will be created!
61+
62+
### Upload CSV to LangSmith Programmatically
63+
64+
Alternatively you can upload your csv to LangSmith using the sdk as shown in the code block below:
65+
66+
```python
67+
dataset = client.upload_csv(
68+
csv_file="./../SWE-bench-dev.csv",
69+
input_keys=list(df.columns),
70+
output_keys=[],
71+
name="swe-bench-programatic-upload",
72+
description="SWE-bench dataset",
73+
data_type="kv"
74+
)
75+
```
76+
77+
### Create dataset split for quicker testing
78+
79+
Since running the SWE-bench evaluator takes a long time when run on all examples, you can create a "test" split for quickly testing the evaluator and your code. Read [this guide](/langsmith/manage-datasets-in-application#create-and-manage-dataset-splits) to learn more about managing dataset splits, or watch this short video that shows how to do it (to get to the starting page of the video, just click on your dataset created above and go to the `Examples` tab):
80+
81+
[](/langsmith/images/creating-split.mp4)
82+
83+
## Running our prediction function
84+
85+
Running evaluation over SWE-bench works a little differently than most evals you will typically run on LangSmith since we don't have a reference output. Because of this, we first generate all of our outputs without running an evaluator (note how the `evaluate` call doesn't have the `evaluators` parameter set). In this case we returned a dummy predict function, but you can insert your agent logic inside the `predict` function to make it work as intended.
86+
87+
```python
88+
from langsmith import evaluate
89+
from langsmith import Client
90+
91+
client = Client()
92+
93+
def predict(inputs: dict):
94+
return {
95+
"instance_id": inputs['instance_id'],
96+
"model_patch": "None",
97+
"model_name_or_path": "test-model"
98+
}
99+
100+
result = evaluate(
101+
predict,
102+
data=client.list_examples(
103+
dataset_id="a9bffcdf-1dfe-4aef-8805-8806f0110067",
104+
splits=["test"]
105+
),
106+
)
107+
```
108+
109+
View the evaluation results for experiment: 'perfect-lip-22' at: [https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/a9bffcdf-1dfe-4aef-8805-8806f0110067/compare?selectedSessions=182de5dc-fc9d-4065-a3e1-34527f952fd8](https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/a9bffcdf-1dfe-4aef-8805-8806f0110067/compare?selectedSessions=182de5dc-fc9d-4065-a3e1-34527f952fd8)
110+
111+
3it \[00:00, 24.48it/s]
112+
113+
## Evaluating our predictions using SWE-bench
114+
115+
Now we can run the following code to run the predicted patches we generated above in Docker. This code is edited slightly from the `SWE-bench` [run\_evaluation.py](https://github.com/princeton-nlp/SWE-bench/blob/main/swebench/harness/run_evaluation.py) file.
116+
117+
Basically, the code sets up docker images to run the predictions in parallel, which greatly reduces the time needed for evaluation. This screenshot explains the basics of how `SWE-bench` does evaluation under the hood. To understand it in full, make sure to read through the code in the [GitHub repository](https://github.com/princeton-nlp/SWE-bench).
118+
119+
![Eval Diagram](/langsmith/images/swebench-evaluation.png)
120+
121+
The function `convert_runs_to_langsmith_feedback` converts the logs generated by the docker file into a nice .json file that contains feedback in the typical key/score method of LangSmith.
122+
123+
```python
124+
from swebench.harness.run_evaluation import run_instances
125+
import resource
126+
import docker
127+
from swebench.harness.docker_utils import list_images, clean_images
128+
from swebench.harness.docker_build import build_env_images
129+
from pathlib import Path
130+
import json
131+
import os
132+
133+
RUN_EVALUATION_LOG_DIR = Path("logs/run_evaluation")
134+
LANGSMITH_EVALUATION_DIR = './langsmith_feedback/feedback.json'
135+
136+
def convert_runs_to_langsmith_feedback(
137+
predictions: dict,
138+
full_dataset: list,
139+
run_id: str
140+
) -> float:
141+
"""
142+
Convert logs from docker containers into LangSmith feedback.
143+
Args:
144+
predictions (dict): Predictions dict generated by the model
145+
full_dataset (list): List of all instances
146+
run_id (str): Run ID
147+
"""
148+
feedback_for_all_instances = {}
149+
for instance in full_dataset:
150+
feedback_for_instance = []
151+
instance_id = instance['instance_id']
152+
prediction = predictions[instance_id]
153+
154+
if prediction.get("model_patch", None) in ["", None]:
155+
# Prediction returned an empty patch
156+
feedback_for_all_instances[prediction['run_id']] = [
157+
{"key": "non-empty-patch", "score": 0},
158+
{"key": "completed-patch", "score": 0},
159+
{"key": "resolved-patch", "score": 0}
160+
]
161+
continue
162+
163+
feedback_for_instance.append({"key": "non-empty-patch", "score": 1})
164+
report_file = (
165+
RUN_EVALUATION_LOG_DIR
166+
/ run_id
167+
/ prediction["model_name_or_path"].replace("/", "__")
168+
/ prediction['instance_id']
169+
/ "report.json"
170+
)
171+
172+
if report_file.exists():
173+
# If report file exists, then the instance has been run
174+
feedback_for_instance.append({"key": "completed-patch", "score": 1})
175+
report = json.loads(report_file.read_text())
176+
# Check if instance actually resolved the PR
177+
if report[instance_id]["resolved"]:
178+
feedback_for_instance.append({"key": "resolved-patch", "score": 1})
179+
else:
180+
feedback_for_instance.append({"key": "resolved-patch", "score": 0})
181+
else:
182+
# The instance did not run successfully
183+
feedback_for_instance += [
184+
{"key": "completed-patch", "score": 0},
185+
{"key": "resolved-patch", "score": 0}
186+
]
187+
feedback_for_all_instances[prediction['run_id']] = feedback_for_instance
188+
189+
os.makedirs(os.path.dirname(LANGSMITH_EVALUATION_DIR), exist_ok=True)
190+
with open(LANGSMITH_EVALUATION_DIR, 'w') as json_file:
191+
json.dump(feedback_for_all_instances, json_file)
192+
193+
def evaluate_predictions(
194+
dataset: list,
195+
predictions: list,
196+
max_workers: int,
197+
force_rebuild: bool,
198+
cache_level: str,
199+
clean: bool,
200+
open_file_limit: int,
201+
run_id: str,
202+
timeout: int,
203+
):
204+
"""
205+
Run evaluation harness for the given dataset and predictions.
206+
"""
207+
# set open file limit
208+
assert len(run_id) > 0, "Run ID must be provided"
209+
resource.setrlimit(resource.RLIMIT_NOFILE, (open_file_limit, open_file_limit))
210+
client = docker.from_env()
211+
existing_images = list_images(client)
212+
print(f"Running {len(dataset)} unevaluated instances...")
213+
214+
# build environment images + run instances
215+
build_env_images(client, dataset, force_rebuild, max_workers)
216+
run_instances(predictions, dataset, cache_level, clean, force_rebuild, max_workers, run_id, timeout)
217+
218+
# clean images + make final report
219+
clean_images(client, existing_images, cache_level, clean)
220+
convert_runs_to_langsmith_feedback(predictions, dataset, run_id)
221+
```
222+
223+
```python
224+
dataset = []
225+
predictions = {}
226+
227+
for res in result:
228+
predictions[res['run'].outputs['instance_id']] = {
229+
**res['run'].outputs,
230+
**{"run_id": str(res['run'].id)}
231+
}
232+
dataset.append(res['run'].inputs['inputs'])
233+
234+
for d in dataset:
235+
d['version'] = d['version'].split(":")[1]
236+
```
237+
238+
```python
239+
evaluate_predictions(
240+
dataset,
241+
predictions,
242+
max_workers=8,
243+
force_rebuild=False,
244+
cache_level="env",
245+
clean=False,
246+
open_file_limit=4096,
247+
run_id="test",
248+
timeout=1_800
249+
)
250+
```
251+
252+
```bash
253+
Running 3 unevaluated instances...
254+
Base image sweb.base.arm64:latest already exists, skipping build.
255+
Base images built successfully.
256+
Total environment images to build: 2
257+
Building environment images: 100%|██████████| 2/2 [00:47<00:00, 23.94s/it]
258+
All environment images built successfully.
259+
Running 3 instances...
260+
0%| | 0/3 [00:00<?, ?it/s]
261+
Evaluation error for sqlfluff__sqlfluff-884: >>>>> Patch Apply Failed:
262+
patch unexpectedly ends in middle of line
263+
patch: **** Only garbage was found in the patch input.
264+
Check (logs/run_evaluation/test/test-model/sqlfluff__sqlfluff-884/run_instance.log) for more information.
265+
Evaluation error for sqlfluff__sqlfluff-4151: >>>>> Patch Apply Failed:
266+
patch unexpectedly ends in middle of line
267+
patch: **** Only garbage was found in the patch input.
268+
Check (logs/run_evaluation/test/test-model/sqlfluff__sqlfluff-4151/run_instance.log) for more information.
269+
Evaluation error for sqlfluff__sqlfluff-2849: >>>>> Patch Apply Failed:
270+
patch: **** Only garbage was found in the patch input.
271+
patch unexpectedly ends in middle of line
272+
Check (logs/run_evaluation/test/test-model/sqlfluff__sqlfluff-2849/run_instance.log) for more information.
273+
100%|██████████| 3/3 [00:30<00:00, 10.04s/it]
274+
All instances run.
275+
Cleaning cached images...
276+
Removed 0 images.
277+
```
278+
279+
## Sending Evaluation to LangSmith
280+
281+
Now, we can actually send our evaluation feedback to LangSmith by using the `evaluate_existing` function. Our evaluate function is incredibly simple in this case, because the `convert_runs_to_langsmith_feedback` function above made our life very easy by saving all the feedback to a single file.
282+
283+
```python
284+
from langsmith import evaluate_existing
285+
from langsmith.schemas import Example, Run
286+
287+
def swe_bench_evaluator(run: Run, example: Example):
288+
with open(LANGSMITH_EVALUATION_DIR, 'r') as json_file:
289+
langsmith_eval = json.load(json_file)
290+
return {"results": langsmith_eval[str(run.id)]}
291+
292+
experiment_name = result.experiment_name
293+
evaluate_existing(experiment_name, evaluators=[swe_bench_evaluator])
294+
```
295+
296+
```bash
297+
View the evaluation results for experiment: 'perfect-lip-22' at:
298+
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/a9bffcdf-1dfe-4aef-8805-8806f0110067/compare?selectedSessions=182de5dc-fc9d-4065-a3e1-34527f952fd8
299+
3it [00:01, 1.52it/s]
300+
<ExperimentResults perfect-lip-22>
301+
```
302+
303+
After running, we can go to the experiments tab of our dataset, and check that our feedback keys were properly assigned. If they were, you should see something that resembles the following image:
304+
305+
![LangSmith feedback](/langsmith/images/swebench-langsmith-feedback.png)

0 commit comments

Comments
 (0)