# DeepEval

Open-source LLM evaluation framework for testing RAG pipelines, agent workflows, and LLM outputs with metrics and CI/CD integration.

## Agent Decision Summary
- Risk level: low
- Source confidence: high
- Recommended workflows: Coding agent workflow, Evaluation and observability, Local or private AI stack, Memory or RAG workflow
- Permission surface: memory
- Agent JSON: https://www.openagent.bot/tools/deepeval.agent.json

## Summary
DeepEval is an MIT-licensed LLM evaluation framework that provides over 15 built-in metrics for testing RAG pipelines, agentic workflows, retrieval quality, hallucination detection, and conversation safety with Pytest integration for CI/CD.


## Guide

### FAQ
- What is DeepEval?
  - DeepEval is an open-source LLM evaluation framework that provides over 15 built-in metrics for testing RAG pipelines, agent workflows, and LLM outputs, with native Pytest integration for CI/CD.
- Is DeepEval free?
  - Yes, DeepEval is MIT-licensed and completely free to use. There is also a managed platform for team collaboration and test result visualization.
- What metrics does DeepEval support?
  - DeepEval includes metrics for faithfulness, relevancy, hallucination, bias, toxicity, G-Eval, summarization, answer relevancy, precision, recall, and more than 15 total evaluation metrics.
- How does DeepEval compare to promptfoo?
  - Both are evaluation tools with different approaches. DeepEval focuses on Pytest-integrated metric-based evaluation for RAG and agents, while promptfoo emphasizes prompt testing and red-teaming with a declarative config approach.
- Can DeepEval be used in CI/CD?
  - Yes. DeepEval integrates natively with Pytest, so tests run as standard Pytest suites in any CI/CD pipeline. It also integrates with GitHub Actions, Jenkins, and CircleCI.
## What It Does
DeepEval is a tool in the tools category. DeepEval is an MIT-licensed LLM evaluation framework that provides over 15 built-in metrics for testing RAG pipelines, agentic workflows, retrieval quality, hallucination detection, and conversation safety with Pytest integration for CI/CD.

## How To Evaluate
Evaluate DeepEval by starting from the official sources, checking its repo, docs interface surface, and running one narrow workflow before expanding scope.

## Why It Matters
Teams shipping agent applications need systematic evaluation pipelines, not ad-hoc testing. DeepEval gives builders a practical way to test LLM outputs, RAG retrieval quality, and agent behavior with familiar Pytest workflows.


## Best For
- Teams evaluating RAG pipeline quality
- Developers adding LLM evaluation to CI/CD
- Builders testing agent behavior and conversation safety

## Not For
- Teams that need only production monitoring
- Users who want a single benchmark score without custom test cases

## Fit Matrix
- Coding agent workflow: strong. DeepEval has multiple signals for coding agent workflow, including matching tags, capabilities, category, or positioning. Required check: Run a small repository change and inspect the diff, tests, and rollback path.
- Evaluation and observability: strong. DeepEval has multiple signals for evaluation and observability, including matching tags, capabilities, category, or positioning. Required check: Add one repeatable test case and confirm results can run again in review or CI.
- Local or private AI stack: strong. DeepEval has multiple signals for local or private ai stack, including matching tags, capabilities, category, or positioning. Required check: Verify hardware requirements, data path, storage, and whether all calls stay in your environment.
- Memory or RAG workflow: strong. DeepEval has multiple signals for memory or rag workflow, including matching tags, capabilities, category, or positioning. Required check: Create, update, retrieve, correct, and delete memory or retrieval objects with real data.
- Browser automation: partial. DeepEval has at least one signal for browser automation, but should be checked against a real task before adoption. Required check: Run one non-sensitive website task and inspect clicks, waits, retries, and changed URLs.
- Connector or protocol layer: partial. DeepEval has at least one signal for connector or protocol layer, but should be checked against a real task before adoption. Required check: Connect one low-risk service, then inspect schemas, auth scope, errors, and logs.

## Evidence
- verified: DeepEval is listed as open source. Source: License metadata: MIT
- verified: DeepEval has a recorded GitHub repository: confident-ai/deepeval. Source: Resource facts and GitHub source link.
- inferred: DeepEval supports these recorded deployment modes: self hosted. Source: OpenAgent decision signal metadata.
- inferred: DeepEval is tagged with tool, evals, testing, automation capabilities. Source: OpenAgent capability taxonomy.

## Missing Checks
- Repository freshness has not been recorded.

## Next Actions
- Inspect repository: https://github.com/confident-ai/deepeval
- Read setup docs: https://docs.confident-ai.com
- Open Homepage: https://www.confident-ai.com

## Facts
- Category: tools
- Resource type: tool
- Open source: yes
- License: MIT
- Last verified: 2026-06-24
- GitHub repo: confident-ai/deepeval
- GitHub stars: 42000

## Capabilities
- tool
- evals
- testing
- automation

## Structured Use Case Tags
- self-hosted-ai
- developer-workflow

## Links
- GitHub: https://github.com/confident-ai/deepeval
- Documentation: https://docs.confident-ai.com
- Homepage: https://www.confident-ai.com

## Structured Outputs
- JSON: https://www.openagent.bot/tools/deepeval.json
- Markdown: https://www.openagent.bot/tools/deepeval.md
- Agent JSON: https://www.openagent.bot/tools/deepeval.agent.json
- Canonical: https://www.openagent.bot/tools/deepeval