Guide · 2026-06-02 · OpenAgent.bot Editors

Agent Evaluation Stack: promptfoo, Ragas, and Langfuse

How to combine test suites, RAG quality checks, and production traces when evaluating AI agents.

agent-evaluationpromptfooragaslangfuse

Agent evaluation usually fails when teams try to make one tool answer every question. Testing, RAG quality, and production observability are different layers.

A practical stack is: promptfoo for repeatable test cases and red-team checks, Ragas for retrieval and answer-quality evaluation, and Langfuse for traces, prompts, datasets, and feedback loops.

Quick recommendation

Use promptfoo when you need a repeatable test suite for prompts, agent tasks, or red-team cases.
Use Ragas when retrieval quality is part of the agent's answer.
Use Langfuse when you need production traces, prompt versions, datasets, and ongoing monitoring.

Comparison table

Tool	Evaluation layer	Best first test	Official source
promptfoo	Test suites and red teaming	Turn top failure cases into repeatable evals	GitHub
Ragas	RAG and answer quality	Evaluate one retrieval dataset	GitHub
Langfuse	Observability and datasets	Trace one real agent workflow	GitHub

What to measure first

Start with failures you can explain. For a browser agent, that may be task completion and recovery. For a coding agent, it may be patch quality and unsafe command behavior. For a RAG agent, it may be retrieval quality, citation usefulness, and answer faithfulness.

Once you know the failure shape, choose the tool. promptfoo is useful when the failure can be turned into a test. Ragas is useful when the failure is tied to retrieval or context quality. Langfuse is useful when the failure appears in production and you need traces to investigate it.

OpenAgent next step

Browse Tools, then compare promptfoo, Ragas, and Langfuse.

Agent Evaluation Stack: promptfoo, Ragas, and Langfuse

Quick recommendation

Comparison table

What to measure first

OpenAgent next step

Official sources