Agents Models Skills Memory Bots Stack Finder Evaluations Guides Submit a resource

MIT · Tools

DeepEval

Open-source LLM evaluation framework for testing RAG pipelines, agent workflows, and LLM outputs with metrics and CI/CD integration.

GitHub Open repository

42K stars 2.2K forks MIT license 2026-06-24 verified

bash

$# DeepEval

$pip install deepeval

$npx deepeval --help

Open sourceLocal firstSelf-hosted

Overview

What is DeepEval?

DeepEval is an MIT-licensed LLM evaluation framework that provides over 15 built-in metrics for testing RAG pipelines, agentic workflows, retrieval quality, hallucination detection, and conversation safety with Pytest integration for CI/CD.

Ecosystem

Tags & capabilities

toolworkflowautomationworkflow orchestrationtool callingopen sourceself hostedlocal first

FAQ

Questions

What is DeepEval?

DeepEval is an open-source LLM evaluation framework that provides over 15 built-in metrics for testing RAG pipelines, agent workflows, and LLM outputs, with native Pytest integration for CI/CD.

Is DeepEval free?

Yes, DeepEval is MIT-licensed and completely free to use. There is also a managed platform for team collaboration and test result visualization.

What metrics does DeepEval support?

DeepEval includes metrics for faithfulness, relevancy, hallucination, bias, toxicity, G-Eval, summarization, answer relevancy, precision, recall, and more than 15 total evaluation metrics.

How does DeepEval compare to promptfoo?

Both are evaluation tools with different approaches. DeepEval focuses on Pytest-integrated metric-based evaluation for RAG and agents, while promptfoo emphasizes prompt testing and red-teaming with a declarative config approach.

Decision brief

Should you use DeepEval?

JSON

Best for

Teams evaluating RAG pipeline quality
Developers adding LLM evaluation to CI/CD
Builders testing agent behavior and conversation safety

Not for

Teams that need only production monitoring
Users who want a single benchmark score without custom test cases

Trust and freshness

Verified 2026-06-24
License: MIT
Repo: confident-ai/deepeval
Open-source signal

Deployment

self hosted

Permission surface

memory

Decision signals

Local first, Self-hostable

Agent packet

Structured decision data for DeepEval

This packet is the compact machine-readable view agents should use before following source links or taking action.

Full JSON Agent packet Markdown brief

Capabilities

tool, evals, testing, automation, workflow orchestration, tool calling

Constraints

open source, self hosted, local first

Deployment

self hosted

Permission surface

memory

Recommended workflows

Coding agent workflow, Evaluation and observability, Local or private AI stack, Memory or RAG workflow

Overview

What DeepEval does

What it is

DeepEval is a tool in the tools category. DeepEval is an MIT-licensed LLM evaluation framework that provides over 15 built-in metrics for testing RAG pipelines, agentic workflows, retrieval quality, hallucination detection, and conversation safety with Pytest integration for CI/CD.

Why it matters

Teams shipping agent applications need systematic evaluation pipelines, not ad-hoc testing. DeepEval gives builders a practical way to test LLM outputs, RAG retrieval quality, and agent behavior with familiar Pytest workflows.

How to evaluate it

Evaluate DeepEval by starting from the official sources, checking its repo, docs interface surface, and running one narrow workflow before expanding scope.

Facts

Known metadata and operating surface

These fields are separated from editorial interpretation so agents can reason over facts and missing checks.

Resource type tool

Category Tools

Maturity active

Difficulty Unknown

License MIT

Pricing open source

Verified 2026-06-24

Source confidence high

Risk level low

Fit matrix

Where DeepEval fits in an agent stack

strong

Coding agent workflow

DeepEval has multiple signals for coding agent workflow, including matching tags, capabilities, category, or positioning.

Run a small repository change and inspect the diff, tests, and rollback path.
Confirm official docs, current maintenance, license, and runtime constraints before production use.

strong

Evaluation and observability

DeepEval has multiple signals for evaluation and observability, including matching tags, capabilities, category, or positioning.

Add one repeatable test case and confirm results can run again in review or CI.
Confirm official docs, current maintenance, license, and runtime constraints before production use.

strong

Local or private AI stack

DeepEval has multiple signals for local or private ai stack, including matching tags, capabilities, category, or positioning.

Verify hardware requirements, data path, storage, and whether all calls stay in your environment.
Confirm official docs, current maintenance, license, and runtime constraints before production use.

strong

Memory or RAG workflow

DeepEval has multiple signals for memory or rag workflow, including matching tags, capabilities, category, or positioning.

Create, update, retrieve, correct, and delete memory or retrieval objects with real data.
Confirm official docs, current maintenance, license, and runtime constraints before production use.

partial

Browser automation

DeepEval has at least one signal for browser automation, but should be checked against a real task before adoption.

Run one non-sensitive website task and inspect clicks, waits, retries, and changed URLs.
Confirm official docs, current maintenance, license, and runtime constraints before production use.

partial

Connector or protocol layer

DeepEval has at least one signal for connector or protocol layer, but should be checked against a real task before adoption.

Connect one low-risk service, then inspect schemas, auth scope, errors, and logs.
Confirm official docs, current maintenance, license, and runtime constraints before production use.

Inputs and outputs

What an agent should inspect

Likely inputs

Repositories, files, issues, terminal output, and test results
Documents, user facts, entities, context, or retrieval queries
Official setup instructions and a small real workflow

Likely outputs

Diffs, commits, explanations, test results, or review notes
Retrieved context, memory updates, graph relations, or citations
Scores, traces, regression results, dashboards, or failure cases
A decision on whether this resource fits the target workflow

Evidence

Sources, claims, and missing checks

Claims are marked separately from source links so future crawlers and reviewers can update them without rewriting the page.

GitHub github

Repository source for code, license, issues, releases, and implementation details.

Documentation docs

Documentation source for setup, API shape, and operational behavior.

Homepage homepage

Official or project-controlled source for this resource profile.

verified

DeepEval is listed as open source.

License metadata: MIT

verified

DeepEval has a recorded GitHub repository: confident-ai/deepeval.

Resource facts and GitHub source link.

inferred

DeepEval supports these recorded deployment modes: self hosted.

OpenAgent decision signal metadata.

inferred

DeepEval is tagged with tool, evals, testing, automation capabilities.

OpenAgent capability taxonomy.

Missing checks

Repository freshness has not been recorded.

Next action

How to start evaluating DeepEval

Inspect repository

Check license, recent activity, issues, examples, and security-sensitive code paths.

Open source

Read setup docs

Use docs as the source of truth for installation and supported interfaces.

Open source

Open Homepage

Start from the official source before adopting third-party instructions.

Open source

Compare

Alternatives and nearby resources

Use related resources to compare category fit, license, deployment model, and first-workflow behavior.

FAQ

Common questions about DeepEval

What is DeepEval?

DeepEval is an open-source LLM evaluation framework that provides over 15 built-in metrics for testing RAG pipelines, agent workflows, and LLM outputs, with native Pytest integration for CI/CD.

Is DeepEval free?

Yes, DeepEval is MIT-licensed and completely free to use. There is also a managed platform for team collaboration and test result visualization.

What metrics does DeepEval support?

DeepEval includes metrics for faithfulness, relevancy, hallucination, bias, toxicity, G-Eval, summarization, answer relevancy, precision, recall, and more than 15 total evaluation metrics.

How does DeepEval compare to promptfoo?

Can DeepEval be used in CI/CD?

Yes. DeepEval integrates natively with Pytest, so tests run as standard Pytest suites in any CI/CD pipeline. It also integrates with GitHub Actions, Jenkins, and CircleCI.