Agent Evaluation Stack: promptfoo, Ragas, and Langfuse
How to combine test suites, RAG quality checks, and production traces when evaluating AI agents.
Agent evaluation usually fails when teams try to make one tool answer every question. Testing, RAG quality, and production observability are different layers.
A practical stack is: promptfoo for repeatable test cases and red-team checks, Ragas for retrieval and answer-quality evaluation, and Langfuse for traces, prompts, datasets, and feedback loops.
Quick recommendation
- Use promptfoo when you need a repeatable test suite for prompts, agent tasks, or red-team cases.
- Use Ragas when retrieval quality is part of the agent's answer.
- Use Langfuse when you need production traces, prompt versions, datasets, and ongoing monitoring.
Comparison table
| Tool | Evaluation layer | Best first test | Official source |
|---|---|---|---|
| promptfoo | Test suites and red teaming | Turn top failure cases into repeatable evals | GitHub |
| Ragas | RAG and answer quality | Evaluate one retrieval dataset | GitHub |
| Langfuse | Observability and datasets | Trace one real agent workflow | GitHub |
What to measure first
Start with failures you can explain. For a browser agent, that may be task completion and recovery. For a coding agent, it may be patch quality and unsafe command behavior. For a RAG agent, it may be retrieval quality, citation usefulness, and answer faithfulness.
Once you know the failure shape, choose the tool. promptfoo is useful when the failure can be turned into a test. Ragas is useful when the failure is tied to retrieval or context quality. Langfuse is useful when the failure appears in production and you need traces to investigate it.
OpenAgent next step
Browse Tools, then compare promptfoo, Ragas, and Langfuse.