Skip to content

Autoevals Integration

Braintrust’s autoevals is an open-source library (Apache 2.0, 800+ stars) with 25+ production-tested scorers for evaluating AI outputs. It includes LLM-as-a-judge evaluations (Factuality, Faithfulness, ClosedQA), RAG metrics (ContextRelevancy, ContextRecall, AnswerRelevancy), and heuristic checks (JSONDiff, EmbeddingSimilarity).

Key points:

  • Works standalone — no Braintrust platform account required
  • Uses any OpenAI-compatible endpoint for LLM-based scorers
  • Integrates with AgentV via code_judge evaluator type: wrap any autoevals scorer in a command that reads stdin and writes the AgentV judge result to stdout
Terminal window
# TypeScript
npm install autoevals
# Python
pip install autoevals

Set your API key for LLM-based scorers:

Terminal window
export OPENAI_API_KEY="sk-..."
ScorerUse CaseKey Parameters
FactualityIs the answer factually consistent with the expected answer?input, output, expected
ClosedQADoes the answer correctly address the question given criteria?input, output, expected
FaithfulnessIs the output faithful to the provided context (no hallucination)?input, output, expected
ContextRelevancyIs the retrieved context relevant to the question?input, output, expected
ContextRecallDoes the context contain the information needed to answer?input, output, expected
AnswerRelevancyIs the answer relevant to the question asked?input, output, expected
SummaryDoes the summary accurately capture the source material?input, output, expected
TranslationIs the translation accurate and natural?input, output, expected
JSONDiffStructural diff between JSON objects (heuristic, no LLM)output, expected
EmbeddingSimilarityCosine similarity between embeddings (no LLM)output, expected

All LLM-based scorers return a score (0–1) and metadata.rationale explaining the judgment.

Use the Factuality scorer as an AgentV code_judge to verify answer correctness.

EVAL.yaml:

tests:
- id: capital-city
input:
- role: user
content: "What is the capital of France?"
expected_output: "Paris is the capital of France."
assert:
- name: factuality
type: code-judge
command: ["bun", "run", "judges/factuality.ts"]

judges/factuality.ts:

#!/usr/bin/env bun
import { readFileSync } from "fs";
import { Factuality } from "autoevals";
const input = JSON.parse(readFileSync("/dev/stdin", "utf-8"));
const result = await Factuality({
input: input.question,
output: input.answer,
expected: input.reference_answer,
});
const score = result.score ?? 0;
const rationale = result.metadata?.rationale ?? "No rationale provided";
console.log(
JSON.stringify({
score,
hits: score >= 0.5 ? [rationale] : [],
misses: score < 0.5 ? [rationale] : [],
reasoning: rationale,
})
);

The code judge reads the AgentV input from stdin (question, answer, reference_answer), maps the fields to autoevals parameters (input, output, expected), runs the scorer, and writes the AgentV result format to stdout.

Use the Faithfulness scorer to detect hallucination in a RAG pipeline.

EVAL.yaml:

tests:
- id: rag-faithfulness
input:
- role: user
content: "Summarize the key findings from the research paper."
expected_output: "The paper found that transformer models outperform RNNs on long-range tasks."
assert:
- name: faithfulness
type: code-judge
command: ["python", "judges/faithfulness.py"]

judges/faithfulness.py:

#!/usr/bin/env python3
import json
import sys
from autoevals import Faithfulness
data = json.load(sys.stdin)
evaluator = Faithfulness()
result = evaluator(
input=data.get("question", ""),
output=data.get("answer", ""),
expected=data.get("reference_answer", ""),
)
score = result.score or 0
rationale = (result.metadata or {}).get("rationale", "No rationale provided")
print(json.dumps({
"score": score,
"hits": [rationale] if score >= 0.5 else [],
"misses": [rationale] if score < 0.5 else [],
"reasoning": rationale,
}))

Autoevals uses OPENAI_API_KEY and OPENAI_BASE_URL by default. To point it at any OpenAI-compatible endpoint without a Braintrust account:

import OpenAI from "openai";
import { init } from "autoevals";
init({
client: new OpenAI({
apiKey: process.env.OPENAI_API_KEY,
baseURL: "https://api.openai.com/v1/",
}),
});
import openai
from autoevals import init
init(openai.AsyncOpenAI(
api_key=os.environ["OPENAI_API_KEY"],
base_url="https://api.openai.com/v1/",
))

You can also configure per-scorer by passing a client parameter:

const result = await Factuality({
client: new OpenAI({ apiKey: process.env.OPENAI_API_KEY }),
input: "...",
output: "...",
expected: "...",
});

Combine multiple autoevals scorers in a single code judge for comprehensive RAG evaluation.

EVAL.yaml:

tests:
- id: rag-pipeline
input:
- role: user
content: "What are the benefits of exercise?"
expected_output: "Exercise improves cardiovascular health, mental well-being, and longevity."
assert:
- name: rag-quality
type: code-judge
command: ["bun", "run", "judges/rag-suite.ts"]
weight: 1.0

judges/rag-suite.ts:

#!/usr/bin/env bun
import { readFileSync } from "fs";
import {
Factuality,
Faithfulness,
AnswerRelevancy,
ContextRelevancy,
} from "autoevals";
const input = JSON.parse(readFileSync("/dev/stdin", "utf-8"));
const scorerArgs = {
input: input.question,
output: input.answer,
expected: input.reference_answer,
};
// Run all scorers in parallel
const [factuality, faithfulness, answerRelevancy, contextRelevancy] =
await Promise.all([
Factuality(scorerArgs),
Faithfulness(scorerArgs),
AnswerRelevancy(scorerArgs),
ContextRelevancy(scorerArgs),
]);
const results = [
{ name: "Factuality", ...factuality },
{ name: "Faithfulness", ...faithfulness },
{ name: "Answer Relevancy", ...answerRelevancy },
{ name: "Context Relevancy", ...contextRelevancy },
];
const hits: string[] = [];
const misses: string[] = [];
for (const r of results) {
const score = r.score ?? 0;
const rationale = r.metadata?.rationale ?? "No rationale";
if (score >= 0.5) {
hits.push(`${r.name} (${score.toFixed(2)}): ${rationale}`);
} else {
misses.push(`${r.name} (${score.toFixed(2)}): ${rationale}`);
}
}
const avgScore =
results.reduce((sum, r) => sum + (r.score ?? 0), 0) / results.length;
console.log(
JSON.stringify({
score: avgScore,
hits,
misses,
reasoning: `Average score across ${results.length} RAG metrics: ${avgScore.toFixed(2)}`,
})
);

This pattern runs Factuality, Faithfulness, AnswerRelevancy, and ContextRelevancy in parallel and returns a composite score. Add or remove scorers to match your pipeline’s requirements.