LLM Function Calling: Evaluating Tool Calls In LLM Pipelines

Published July 16, 2024

Function calling is an essential part of any AI engineer’s toolkit, enabling builders to enhance a model’s utility at specific tasks. As more LLM applications leveraging tool calls get deployed into production, the task of effectively evaluating their performance in LLM pipelines becomes more critical.

What Is Function Calling In AI?

First launched by OpenAI in the summer of 2023, function calling allows you to include function signatures in your API calls, connecting major LLMs – like OpenAI’s GPT-4o, Anthropic’s Claude, Google Gemini, and others – with external tools and APIs. This allows the foundation model to perform specific tasks, like retrieving real-time data or executing custom functions defined by users.This capability is critical when developing Agentic applications, but brings its own host of complications when it comes to evaluating your pipeline.

How To Evaluate Function Calls

Evaluating function calls involves looking at each step – from the model figuring out whether it should call a function to pulling out different parameters to include in the function call, generating the actual function call, and then generating a response – to better understand where things are going wrong.

Are There Open Source Options for Function Calling Evaluation?

Phoenix: Open Source Function Calling Eval

Phoenix – an open source library for tracing and evaluation – now offers a new built-in evaluator that measures the performance of function or tool calling within major LLMs.

The new evaluation checks for each error cases across the three areas where function calling typically goes wrong:

🗺️ Routing: deciding which function to call
🔢 Parameter extraction: pulling out the right parameters from the question
🧑‍💻 Function generation: actually generating the function code

It can be used within an experiment or on its own.

Example: Tool Calling Eval

Check out our new 📓 notebook on tool calling evals 📓! The purpose of this example is to help you evaluate the multiple-step LLM logic involved in tool calling and develop an experimental framework to iterate and improve on the default evaluation template.

Wrapping Up

Questions? Feel free to reach out in the Phoenix community.