- Create a file named
.envin the project directory. - Add your OpenAI API key to the
.envfile:OPENAI_API_KEY=<your_api_key>
- Prepare the input data in the format of example_data.jsonl, including the input query, function call list, and expected answers.
- Run the script:
python run_chat.py
- Prepare your models' predictions along with the reference answers in the format of results.jsonl.
- Run the script:
python run_evaluation.py
This report presents an evaluation of various language models on a function call dataset related to Block and Web3. The dataset comprises 187 test samples, and the evaluation focuses on Exact Match Accuracy:
- Exact Match Accuracy (exact_match_acc): Measures correctness of the generated function calls, including order and arguments.
Each data instance includes a query and a list of available tools. The model must generate function calls using the provided tools to correctly respond to the query.
{
"query": "Track crosschain message verification, implement timeout recovery procedures.",
"answers": [
{"name": "track_crosschain_message", "arguments": {"message_id": "msg12345"}},
{"name": "schedule_timeout_check", "arguments": {"message_id": "msg12345", "timeout": "30"}}
],
"tools": [
{"type": "function", "function": {"name": "track_crosschain_message", "description": "Track the status of a crosschain message", "parameters": {"type": "object", "properties": {"message_id": {"type": "string"}}}}},
{"type": "function", "function": {"name": "schedule_timeout_check", "description": "Schedule a timeout check for a message", "parameters": {"type": "object", "properties": {"message_id": {"type": "string"}, "timeout": {"type": "integer"}}}}}
]
}We evaluated multiple models using different inference methods:
- GPT-4o
- GPT-4o-mini
- Qwen2.5-7B-Instruct
- DeepSeek-v3
- Gemini-1.5-flash
- Gemini-2.0-flash
- Our fine-tuned Qwen2.5-7B-Instruct on the training dataset
| Model | Exact Match Accuracy |
|---|---|
| Proprietary Models | |
| GPT-4o | 0.4598 |
| GPT-4o-mini | 0.3529 |
| Gemini-1.5-flash | 0.4438 |
| Gemini-2.0-flash | 0.3957 |
| Open-Sourced Models | |
| DeepSeek-v3 | 0.2887 |
| Qwen2.5-7B-Instruct | 0.3100 |
| Ours | 0.7593 |
To validate model robustness, we evaluated it on the Berkeley Function-Calling Leaderboard (BFCL) (BFCL). The fine-tuned Qwen2.5 model demonstrated competitive performance.
| Model | Exact Match Accuracy |
|---|---|
| Proprietary Models | |
| GPT-4o | 0.9925 |
| GPT-4o-mini | 0.9974 |
| Gemini-1.5-flash | 0.9975 |
| Gemini-2.0-flash | 0.9938 |
| Open-Sourced Models | |
| DeepSeek-v3 | 0.9450 |
| Qwen2.5-7B-Instruct | 0.9725 |
| Ours | 0.9950 |
| Model | Exact Match Accuracy |
|---|---|
| Proprietary Models | |
| GPT-4o | 0.9393 |
| GPT-4o-mini | 0.9343 |
| Gemini-1.5-flash | 0.9251 |
| Gemini-2.0-flash | 0.9161 |
| Open-Sourced Models | |
| DeepSeek-v3 | 0.9300 |
| Qwen2.5-7B-Instruct | 0.7700 |
| Ours | 0.8900 |
- Our fine-tuned Qwen2.5 significantly outperforms other open-source models in the Block/Web3 domain, achieving an exact match accuracy of 0.7593, a remarkable improvement over the pre-trained version (0.3100).
- Our model maintains high performance in general function-calling tasks, achieving 0.9950 on the BFCL-v3-simple benchmark and 0.8900 on BFCL-v3-parallel-multi, making it a competitive alternative to proprietary models.
- Proprietary models (GPT-4o, Gemini-1.5-flash) continue to dominate general benchmarks, but our fine-tuned model closes the gap while excelling in domain-specific tasks.
- Fine-tuning Qwen2.5 on a domain-specific dataset led to a 145% improvement in exact match accuracy in the Block/Web3 domain.
- The model sustains high accuracy in general function-calling tasks, making it both versatile and specialized.
- Open-source models like DeepSeek-v3 struggle with exact function sequence generation, highlighting the need for better structured data training.
- Proprietary models still lead in general function-calling tasks, suggesting that large-scale pretraining and reinforcement tuning play a crucial role.
This benchmark demonstrates that fine-tuning is highly effective in improving function-calling accuracy for Block and Web3 tasks. Our model not only excels in domain-specific applications but also maintains strong performance in general function-calling tasks, presenting a viable alternative to proprietary solutions.