Agent-Function-Calling-Benchmark

Usage

1. Testing Proprietary Models (e.g., OpenAI)

Create a file named .env in the project directory.
Add your OpenAI API key to the .env file:
```
OPENAI_API_KEY=<your_api_key>
```
Prepare the input data in the format of example_data.jsonl, including the input query, function call list, and expected answers.
Run the script:
```
python run_chat.py
```

2. Testing Custom Models

Prepare your models' predictions along with the reference answers in the format of results.jsonl.
Run the script:
```
python run_evaluation.py
```

Experiment Report: Function Call Benchmark on Block and Web3 Dataset

1. Introduction

This report presents an evaluation of various language models on a function call dataset related to Block and Web3. The dataset comprises 187 test samples, and the evaluation focuses on Exact Match Accuracy:

Exact Match Accuracy (exact_match_acc): Measures correctness of the generated function calls, including order and arguments.

Dataset Description

Each data instance includes a query and a list of available tools. The model must generate function calls using the provided tools to correctly respond to the query.

Example Data Format:

{
  "query": "Track crosschain message verification, implement timeout recovery procedures.",
  "answers": [
    {"name": "track_crosschain_message", "arguments": {"message_id": "msg12345"}},
    {"name": "schedule_timeout_check", "arguments": {"message_id": "msg12345", "timeout": "30"}}
  ],
  "tools": [
    {"type": "function", "function": {"name": "track_crosschain_message", "description": "Track the status of a crosschain message", "parameters": {"type": "object", "properties": {"message_id": {"type": "string"}}}}},
    {"type": "function", "function": {"name": "schedule_timeout_check", "description": "Schedule a timeout check for a message", "parameters": {"type": "object", "properties": {"message_id": {"type": "string"}, "timeout": {"type": "integer"}}}}}
  ]
}

2. Methodology

We evaluated multiple models using different inference methods:

GPT-4o
GPT-4o-mini
Qwen2.5-7B-Instruct
DeepSeek-v3
Gemini-1.5-flash
Gemini-2.0-flash
Our fine-tuned Qwen2.5-7B-Instruct on the training dataset

3. Results

Block and Web3 Dataset

Model	Exact Match Accuracy
Proprietary Models
GPT-4o	0.4598
GPT-4o-mini	0.3529
Gemini-1.5-flash	0.4438
Gemini-2.0-flash	0.3957
Open-Sourced Models
DeepSeek-v3	0.2887
Qwen2.5-7B-Instruct	0.3100
Ours	0.7593

4. Results on General Dataset

To validate model robustness, we evaluated it on the Berkeley Function-Calling Leaderboard (BFCL) (BFCL). The fine-tuned Qwen2.5 model demonstrated competitive performance.

BFCL-v3-simple

Model	Exact Match Accuracy
Proprietary Models
GPT-4o	0.9925
GPT-4o-mini	0.9974
Gemini-1.5-flash	0.9975
Gemini-2.0-flash	0.9938
Open-Sourced Models
DeepSeek-v3	0.9450
Qwen2.5-7B-Instruct	0.9725
Ours	0.9950

BFCL-v3-parallel-multi

Model	Exact Match Accuracy
Proprietary Models
GPT-4o	0.9393
GPT-4o-mini	0.9343
Gemini-1.5-flash	0.9251
Gemini-2.0-flash	0.9161
Open-Sourced Models
DeepSeek-v3	0.9300
Qwen2.5-7B-Instruct	0.7700
Ours	0.8900

5. Analysis and Discussion

Performance Trends

Our fine-tuned Qwen2.5 significantly outperforms other open-source models in the Block/Web3 domain, achieving an exact match accuracy of 0.7593, a remarkable improvement over the pre-trained version (0.3100).
Our model maintains high performance in general function-calling tasks, achieving 0.9950 on the BFCL-v3-simple benchmark and 0.8900 on BFCL-v3-parallel-multi, making it a competitive alternative to proprietary models.
Proprietary models (GPT-4o, Gemini-1.5-flash) continue to dominate general benchmarks, but our fine-tuned model closes the gap while excelling in domain-specific tasks.

Strengths of Fine-tuned Models

Fine-tuning Qwen2.5 on a domain-specific dataset led to a 145% improvement in exact match accuracy in the Block/Web3 domain.
The model sustains high accuracy in general function-calling tasks, making it both versatile and specialized.

Limitations and Challenges

Open-source models like DeepSeek-v3 struggle with exact function sequence generation, highlighting the need for better structured data training.
Proprietary models still lead in general function-calling tasks, suggesting that large-scale pretraining and reinforcement tuning play a crucial role.

6. Conclusion

This benchmark demonstrates that fine-tuning is highly effective in improving function-calling accuracy for Block and Web3 tasks. Our model not only excels in domain-specific applications but also maintains strong performance in general function-calling tasks, presenting a viable alternative to proprietary solutions.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
data		data
LICENSE		LICENSE
README.md		README.md
function_call_eval.py		function_call_eval.py
run_baseline.py		run_baseline.py
run_chat.py		run_chat.py
run_evaluation.py		run_evaluation.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Agent-Function-Calling-Benchmark

Usage

1. Testing Proprietary Models (e.g., OpenAI)

2. Testing Custom Models

Experiment Report: Function Call Benchmark on Block and Web3 Dataset

1. Introduction

Dataset Description

Example Data Format:

2. Methodology

3. Results

Block and Web3 Dataset

4. Results on General Dataset

BFCL-v3-simple

BFCL-v3-parallel-multi

5. Analysis and Discussion

Performance Trends

Strengths of Fine-tuned Models

Limitations and Challenges

6. Conclusion

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

License

FLock-io/Agent-Function-Calling-Benchmark

Folders and files

Latest commit

History

Repository files navigation

Agent-Function-Calling-Benchmark

Usage

1. Testing Proprietary Models (e.g., OpenAI)

2. Testing Custom Models

Experiment Report: Function Call Benchmark on Block and Web3 Dataset

1. Introduction

Dataset Description

Example Data Format:

2. Methodology

3. Results

Block and Web3 Dataset

4. Results on General Dataset

BFCL-v3-simple

BFCL-v3-parallel-multi

5. Analysis and Discussion

Performance Trends

Strengths of Fine-tuned Models

Limitations and Challenges

6. Conclusion

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages