Autoevals Integration
Overview
Section titled “Overview”Braintrust’s autoevals is an open-source library (Apache 2.0, 800+ stars) with 25+ production-tested scorers for evaluating AI outputs. It includes LLM-as-a-judge evaluations (Factuality, Faithfulness, ClosedQA), RAG metrics (ContextRelevancy, ContextRecall, AnswerRelevancy), and heuristic checks (JSONDiff, EmbeddingSimilarity).
Key points:
- Works standalone — no Braintrust platform account required
- Uses any OpenAI-compatible endpoint for LLM-based scorers
- Integrates with AgentV via
code_judgeevaluator type: wrap any autoevals scorer in a command that reads stdin and writes the AgentV judge result to stdout
Installation
Section titled “Installation”# TypeScriptnpm install autoevals
# Pythonpip install autoevalsSet your API key for LLM-based scorers:
export OPENAI_API_KEY="sk-..."Available Scorers
Section titled “Available Scorers”| Scorer | Use Case | Key Parameters |
|---|---|---|
Factuality | Is the answer factually consistent with the expected answer? | input, output, expected |
ClosedQA | Does the answer correctly address the question given criteria? | input, output, expected |
Faithfulness | Is the output faithful to the provided context (no hallucination)? | input, output, expected |
ContextRelevancy | Is the retrieved context relevant to the question? | input, output, expected |
ContextRecall | Does the context contain the information needed to answer? | input, output, expected |
AnswerRelevancy | Is the answer relevant to the question asked? | input, output, expected |
Summary | Does the summary accurately capture the source material? | input, output, expected |
Translation | Is the translation accurate and natural? | input, output, expected |
JSONDiff | Structural diff between JSON objects (heuristic, no LLM) | output, expected |
EmbeddingSimilarity | Cosine similarity between embeddings (no LLM) | output, expected |
All LLM-based scorers return a score (0–1) and metadata.rationale explaining the judgment.
TypeScript Example
Section titled “TypeScript Example”Use the Factuality scorer as an AgentV code_judge to verify answer correctness.
EVAL.yaml:
tests: - id: capital-city input: - role: user content: "What is the capital of France?" expected_output: "Paris is the capital of France." assert: - name: factuality type: code-judge command: ["bun", "run", "judges/factuality.ts"]judges/factuality.ts:
#!/usr/bin/env bunimport { readFileSync } from "fs";import { Factuality } from "autoevals";
const input = JSON.parse(readFileSync("/dev/stdin", "utf-8"));
const result = await Factuality({ input: input.question, output: input.answer, expected: input.reference_answer,});
const score = result.score ?? 0;const rationale = result.metadata?.rationale ?? "No rationale provided";
console.log( JSON.stringify({ score, hits: score >= 0.5 ? [rationale] : [], misses: score < 0.5 ? [rationale] : [], reasoning: rationale, }));The code judge reads the AgentV input from stdin (question, answer, reference_answer), maps the fields to autoevals parameters (input, output, expected), runs the scorer, and writes the AgentV result format to stdout.
Python Example
Section titled “Python Example”Use the Faithfulness scorer to detect hallucination in a RAG pipeline.
EVAL.yaml:
tests: - id: rag-faithfulness input: - role: user content: "Summarize the key findings from the research paper." expected_output: "The paper found that transformer models outperform RNNs on long-range tasks." assert: - name: faithfulness type: code-judge command: ["python", "judges/faithfulness.py"]judges/faithfulness.py:
#!/usr/bin/env python3import jsonimport sysfrom autoevals import Faithfulness
data = json.load(sys.stdin)
evaluator = Faithfulness()result = evaluator( input=data.get("question", ""), output=data.get("answer", ""), expected=data.get("reference_answer", ""),)
score = result.score or 0rationale = (result.metadata or {}).get("rationale", "No rationale provided")
print(json.dumps({ "score": score, "hits": [rationale] if score >= 0.5 else [], "misses": [rationale] if score < 0.5 else [], "reasoning": rationale,}))Configuration
Section titled “Configuration”Autoevals uses OPENAI_API_KEY and OPENAI_BASE_URL by default. To point it at any OpenAI-compatible endpoint without a Braintrust account:
TypeScript
Section titled “TypeScript”import OpenAI from "openai";import { init } from "autoevals";
init({ client: new OpenAI({ apiKey: process.env.OPENAI_API_KEY, baseURL: "https://api.openai.com/v1/", }),});Python
Section titled “Python”import openaifrom autoevals import init
init(openai.AsyncOpenAI( api_key=os.environ["OPENAI_API_KEY"], base_url="https://api.openai.com/v1/",))You can also configure per-scorer by passing a client parameter:
const result = await Factuality({ client: new OpenAI({ apiKey: process.env.OPENAI_API_KEY }), input: "...", output: "...", expected: "...",});RAG Evaluation Suite
Section titled “RAG Evaluation Suite”Combine multiple autoevals scorers in a single code judge for comprehensive RAG evaluation.
EVAL.yaml:
tests: - id: rag-pipeline input: - role: user content: "What are the benefits of exercise?" expected_output: "Exercise improves cardiovascular health, mental well-being, and longevity." assert: - name: rag-quality type: code-judge command: ["bun", "run", "judges/rag-suite.ts"] weight: 1.0judges/rag-suite.ts:
#!/usr/bin/env bunimport { readFileSync } from "fs";import { Factuality, Faithfulness, AnswerRelevancy, ContextRelevancy,} from "autoevals";
const input = JSON.parse(readFileSync("/dev/stdin", "utf-8"));
const scorerArgs = { input: input.question, output: input.answer, expected: input.reference_answer,};
// Run all scorers in parallelconst [factuality, faithfulness, answerRelevancy, contextRelevancy] = await Promise.all([ Factuality(scorerArgs), Faithfulness(scorerArgs), AnswerRelevancy(scorerArgs), ContextRelevancy(scorerArgs), ]);
const results = [ { name: "Factuality", ...factuality }, { name: "Faithfulness", ...faithfulness }, { name: "Answer Relevancy", ...answerRelevancy }, { name: "Context Relevancy", ...contextRelevancy },];
const hits: string[] = [];const misses: string[] = [];
for (const r of results) { const score = r.score ?? 0; const rationale = r.metadata?.rationale ?? "No rationale"; if (score >= 0.5) { hits.push(`${r.name} (${score.toFixed(2)}): ${rationale}`); } else { misses.push(`${r.name} (${score.toFixed(2)}): ${rationale}`); }}
const avgScore = results.reduce((sum, r) => sum + (r.score ?? 0), 0) / results.length;
console.log( JSON.stringify({ score: avgScore, hits, misses, reasoning: `Average score across ${results.length} RAG metrics: ${avgScore.toFixed(2)}`, }));This pattern runs Factuality, Faithfulness, AnswerRelevancy, and ContextRelevancy in parallel and returns a composite score. Add or remove scorers to match your pipeline’s requirements.