Agent Evaluation
Test tool-using agents systematically.
The Problem
Your LLM agent uses tools (search, calculator, database). How do you test it systematically?
Traditional probes test text generation. Agents need tool execution testing.
The Solution
AgentProbe with trace integration.
from insideLLMs.probes import AgentProbe, ToolDefinition
# Define available tools
tools = [
ToolDefinition(
name="calculator",
description="Perform mathematical calculations",
parameters={"expression": "string"}
),
ToolDefinition(
name="search",
description="Search the web",
parameters={"query": "string"}
)
]
# Create agent probe
probe = AgentProbe(tools=tools)
# Test agent
result = probe.run(model, {
"task": "What is 15% of the GDP of France?",
"expected_tools": ["search", "calculator"]
})
print(result["tools_used"]) # ["search", "calculator"]
print(result["correct_sequence"]) # True
print(result["final_answer"]) # "..."
Tool Definition
from insideLLMs.probes import ToolDefinition
calculator = ToolDefinition(
name="calculator",
description="Evaluate mathematical expressions",
parameters={
"expression": {
"type": "string",
"description": "Math expression to evaluate"
}
},
implementation=lambda expr: eval(expr) # In production, use safe eval
)
Testing Tool Selection
# Test if agent chooses correct tools
test_cases = [
{
"task": "What is 25 * 17?",
"expected_tools": ["calculator"],
"expected_sequence": ["calculator"]
},
{
"task": "Who won the 2024 Olympics 100m?",
"expected_tools": ["search"],
"expected_sequence": ["search"]
},
{
"task": "What is the population of Tokyo divided by 2?",
"expected_tools": ["search", "calculator"],
"expected_sequence": ["search", "calculator"]
}
]
for case in test_cases:
result = probe.run(model, case)
assert result["tools_used"] == case["expected_tools"]
Trace Integration
AgentProbe automatically captures tool execution traces:
{
"task": "What is 15% of France's GDP?",
"trace": {
"steps": [
{
"step": 1,
"tool": "search",
"input": {"query": "France GDP 2024"},
"output": "France GDP: $3.05 trillion"
},
{
"step": 2,
"tool": "calculator",
"input": {"expression": "3.05 * 0.15"},
"output": "0.4575"
}
],
"final_answer": "Approximately $457.5 billion"
}
}
Multi-Step Agent Testing
# Test complex multi-step workflows
result = probe.run(model, {
"task": "Book a flight from London to Paris, then find a hotel",
"expected_tools": ["flight_search", "flight_booking", "hotel_search"],
"max_steps": 5
})
# Verify execution
assert len(result["trace"]["steps"]) <= 5
assert "flight_booking" in result["tools_used"]
assert result["task_completed"]
Scoring Agent Performance
from insideLLMs.probes import AgentProbe
probe = AgentProbe(tools=tools)
# Run on dataset
results = probe.run_batch(model, agent_test_dataset)
# Score
score = probe.score(results)
print(f"Tool selection accuracy: {score.value:.1%}")
print(f"Correct sequences: {score.details['correct_sequences']}")
print(f"Average steps: {score.details['avg_steps']}")
Configuration
# In harness config
probes:
- type: agent
args:
tools:
- name: calculator
description: Evaluate math expressions
- name: search
description: Search the web
max_steps: 5
trace_enabled: true
Real-World Example
# Test customer service agent
tools = [
ToolDefinition(name="check_order", ...),
ToolDefinition(name="process_refund", ...),
ToolDefinition(name="send_email", ...)
]
probe = AgentProbe(tools=tools)
test_cases = [
{
"task": "Customer wants to return order #12345",
"expected_tools": ["check_order", "process_refund", "send_email"],
"expected_outcome": "refund_processed"
}
]
results = probe.run_batch(model, test_cases)
# Verify agent behaviour
for result in results:
if not result["task_completed"]:
print(f"Failed: {result['task']}")
print(f"Tools used: {result['tools_used']}")
print(f"Expected: {result['expected_tools']}")
Why This Matters
Without AgentProbe:
- Manual testing of tool-using agents
- No systematic verification
- Can’t track tool selection patterns
- Difficult to catch regressions
With AgentProbe:
- Systematic agent testing
- Trace tool execution
- Verify correct tool selection
- Catch agent behaviour regressions
See Also
- Probes Catalog - Other probe types
- Custom Probe Tutorial - Build custom probes
- Tracing and Fingerprinting - Trace details