LLM Judges
LLM judges use a language model to evaluate agent responses against custom criteria defined in a prompt file.
Default Judge
Section titled “Default Judge”When a test defines criteria but has no assert field, a default llm-judge runs automatically. The built-in prompt evaluates the response against your criteria and expected_output:
tests: - id: simple-eval criteria: Correctly explains the bug and proposes a fix input: "Debug this function..." # No assert needed — default llm-judge evaluates against criteriaWhen assert is present, no default judge is added. To use an LLM judge alongside other evaluators, declare it explicitly. See How criteria and assert interact.
Configuration
Section titled “Configuration”Reference an LLM judge in your eval file:
assert: - name: semantic_check type: llm-judge prompt: ./judges/correctness.mdPrompt Files
Section titled “Prompt Files”The prompt file defines evaluation criteria and scoring guidelines. It can be a markdown text template or a TypeScript/JavaScript dynamic template.
Markdown Template
Section titled “Markdown Template”Write evaluation instructions as markdown. Template variables are interpolated:
# Evaluation Criteria
Evaluate the candidate's response to the following question:
**Question:** {{question}}**Criteria:** {{criteria}}**Reference Answer:** {{reference_answer}}**Candidate Answer:** {{answer}}
## Scoring
Score the response from 0.0 to 1.0 based on:1. Correctness — does the answer match the expected outcome?2. Completeness — does it address all parts of the question?3. Clarity — is the response clear and well-structured?Available Template Variables
Section titled “Available Template Variables”| Variable | Source |
|---|---|
question | First user message content |
criteria | Test criteria field |
reference_answer | Last expected message content |
answer | Last candidate response content |
metadata | Test metadata |
rubrics | Test rubrics (if defined) |
file_changes | Unified diff of workspace file changes (when workspace_template is configured) |
TypeScript Template
Section titled “TypeScript Template”For dynamic prompt generation, use the definePromptTemplate function from @agentv/eval:
#!/usr/bin/env bunimport { definePromptTemplate } from '@agentv/eval';
export default definePromptTemplate((ctx) => { const rubric = ctx.config?.rubric as string | undefined;
return `You are evaluating an AI assistant's response.
## Question${ctx.question}
## Candidate Answer${ctx.answer}
${ctx.referenceAnswer ? `## Reference Answer\n${ctx.referenceAnswer}` : ''}
${rubric ? `## Evaluation Criteria\n${rubric}` : ''}
Evaluate and provide a score from 0 to 1.`;});How It Works
Section titled “How It Works”- AgentV renders the prompt template with variables from the test
- The rendered prompt is sent to the judge target (configured in targets.yaml)
- The LLM returns a structured evaluation with score, hits, misses, and reasoning
- Results are recorded in the output JSONL
Command Configuration
Section titled “Command Configuration”When using TypeScript templates, configure them in YAML with optional config data passed to the command:
assert: - name: custom-eval type: llm-judge prompt: command: [bun, run, ../prompts/custom-evaluator.ts] config: rubric: "Your rubric here" strictMode: trueThe config object is available as ctx.config inside the template function.
Available Context Fields
Section titled “Available Context Fields”TypeScript templates receive a context object with these fields:
| Field | Type | Description |
|---|---|---|
question | string | First user message content |
answer | string | Last entry in output |
referenceAnswer | string | Last entry in expected_output |
criteria | string | Test criteria field |
expectedOutput | Message[] | Full resolved expected output |
output | Message[] | Full provider output messages |
trace | TraceSummary | Execution metrics summary |
config | object | Custom config from YAML |
Template Variable Derivation
Section titled “Template Variable Derivation”Template variables are derived internally through three layers:
1. Authoring Layer
Section titled “1. Authoring Layer”What users write in YAML or JSONL:
inputorinput— two syntaxes for the same data.input: "What is 2+2?"expands to[{ role: "user", content: "What is 2+2?" }]. If both are present,inputtakes precedence.expected_outputorexpected_output— two syntaxes for the same data.expected_output: "4"expands to[{ role: "assistant", content: "4" }]. If both are present,expected_outputtakes precedence.
2. Resolved Layer
Section titled “2. Resolved Layer”After parsing, canonical message arrays replace the shorthand fields:
input: TestMessage[]— canonical resolved inputexpected_output: TestMessage[]— canonical resolved expected output
At this layer, input and expected_output no longer exist as separate fields.
3. Template Variable Layer
Section titled “3. Template Variable Layer”Derived strings injected into evaluator prompts:
| Variable | Derivation |
|---|---|
question | Content of the first user role entry in input |
criteria | Passed through from the test field |
reference_answer | Content of the last entry in expected_output |
answer | Content of the last entry in output |
input | Full resolved input array, JSON-serialized |
expected_output | Full resolved expected array, JSON-serialized |
output | Full provider output array, JSON-serialized |
file_changes | Unified diff of workspace file changes (when workspace_template is configured) |
Example flow:
# User writes:input: "What is 2+2?"expected_output: "The answer is 4"# Resolved:input: [{ role: "user", content: "What is 2+2?" }]expected_output: [{ role: "assistant", content: "The answer is 4" }]
# Derived template variables:question: "What is 2+2?"reference_answer: "The answer is 4"answer: (extracted from provider output at runtime)