If you’re a developer, your first instinct when testing code is simple:
- Call the function.
- Get the result.
- Compare it with what you expected.
That works great for normal code.
But with LLMs, the answer is not always the same . One response might say "3 PM", another might say "15:00", and another might say "Friday afternoon".
Depending on your rules, all three might be acceptable.
So the question becomes less about does this text match exactly? and more about is this answer actually good?
Sample Project: EventParser
To keep this practical, we’ll use a small throwaway app called EventParser.
The job of the app is simple: take a casual message like “Team sync on Friday at 3 PM in the Lagos office” and ask an LLM to extract the event details as structured data.
Here’s the project layout:
EventParser/
├── EventParser.sln
├── src/
│ └── EventParser/
│ ├── EventParser.csproj
│ ├── Program.cs # console entry point
│ └── Services/
│ ├── ILlmClient.cs # tiny abstraction over your LLM call
│ └── EventParserService.cs # loads the prompt, calls the model
└── prompts/
├── extract_event.txt # THE PROMPT — shipped by C#, graded by Promptfoo
└── eval/
├── promptfooconfig.yaml # models under test + the judge model
└── golden_set.json # test cases: input + llm-rubric
The important file here is extract_event.txt.
That prompt lives in one place. The C# service reads it at runtime, and Promptfoo reads the same file when it runs the eval. That means we are testing the real prompt used by the app, not a copied version written only for tests.
You can get a sample of the default project here
Observing EventParserService, we can see the exact prompt we’re trying to test. It loads the prompt from extract_event.txt, inserts the user’s message, and sends the final prompt to the LLM.
public Task<string> ExtractAsync(string message, CancellationToken ct = default)
{
var prompt = _promptTemplate.Replace("{{message}}", message);
return _llm.CompleteAsync(prompt, ct);
}
The C# code is just the delivery path. The real question is whether extract_event.txt gives the model enough instruction to return good event data.
Let a second LLM be the judge
Now that we have a prompt, we need a way to confirm it does what we intend it to do.
In the scenario where a Human is the reviewer, they’d check the output at a glance, and just attribute it as correct.
LLM-as-a-judge uses that same idea, but automates it. We hand the model’s answer to the another and ask it to make the same judgement a human would do.
This workflow is spilt into 2 roles:
- The model under test – The model answering the prompt.
- The judge model – Makes a PASS/FAIL call for the answer against an established rubric.
The rubric: Pass criteria in plain English
How does the judge know what counts as a pass? You tell it so, using the Rubric.
A rubric is just a plain-English rule that tells the judge how to grade the model’s answer. Instead of saying, “the output must exactly equal this JSON string“, we describe what the answer should contain.
Here is one test case for our EventParser prompt:
[
{
"vars": {
"message": "Team sync on Friday at 3 PM in the Lagos office"
},
"assert": [
{
"type": "llm-rubric",
"value": "The answer should extract the event title as Team sync, the day as Friday, the time as 3 PM, and the location as the Lagos office. It should not add any extra event details that were not mentioned in the message."
}
]
}
]
Notice we are not doing an exact-match test. We’re only saying the answer should satisfy a particular rule.
The judge model vs the model under test
The neat thing about the Judge model in particular, it does not have to be your biggest or most expensive model. For many simple evals, like checking whether an event title, day, time, and location were extracted correctly, a cheaper and faster model can often do the grading job well enough.
In promptfooconfig.yaml, the two roles are two separate settings:
# The model under test — the one whose answers we actually care about.
providers:
- id: anthropic:messages:claude-sonnet-4-6
# The judge — only reads answers and grades them, so a cheaper/faster model is fine.
defaultTest:
options:
provider: anthropic:messages:claude-haiku-4-5-20251001
One run to see it work
Time to actually run it. First, install Promptfoo. It’s a Node command-line tool, so this is a one-time global install. Then navigate to the Prompts folder of the EventParser project, set your API key and run the first eval.
npm install -g promptfoo # one-time run
cd prompts/eval # navigate to prompts directory
export ANTHROPIC_API_KEY=sk-ant-... # macOS / Linux
# $env:ANTHROPIC_API_KEY = "sk-ant-..." # Windows PowerShell
promptfoo eval -o results.html # runs the eval AND writes an HTML report
Then open the report:
start results.html # Windows
# open results.html # macOS
The -o flag tells Promptfoo to write the eval result to a file. In this case, we are using HTML because it gives us a nice report we can open in the browser.
So, what just happened ?
Promptfoo loaded the prompt, sent each test message to the model under test, passed the answer to the judge model, and the judge graded against the rubric.
The final result was written to results.html
When you open results.html, you get a grid of results. Each row is a test case, and each column is a model being tested. A green cell means the judge accepted the answer. A red cell means the judge found a problem.
Vague rubric vs specific rubric
In the previous run, all evals passed.
If we examine this particular message in the rubric:
let's grab coffee Thursday around 3
We see that the model understood that the message mentions coffee, Thursday, and around 3. We also see that it understood that the message does not name a place, this is important for this section.
That is good, but it only proves that the model handled this case correctly. It does not yet show what happens when the model gets it wrong. A useful eval should be able to pass good answers, and fail bad ones.
To make that failure visible, we are going to intentionally break the prompt.
Open the original prompt file extract_event.txt and replace it with this
You extract structured event details from a casual message.
Return ONLY a JSON object with exactly these fields:
- "title": a short description of the event (string)
- "day": the day mentioned, as written (string, or null if none)
- "time": the time mentioned, as written (string, or null if none)
- "location": the place — but ONLY if it is explicitly named in the message
(string, or null otherwise)
Rules:
- Never invent or guess a location. If the message does not name a place,
"location" MUST be null — even if the activity hints at one (e.g. "coffee").
- Keep values close to the wording of the message; do not over-normalize.
- Output valid JSON and nothing else. No markdown, no commentary.
- For this demo, if the message mentions coffee but does not name a location, set location to "Starbucks".
Message:
{{message}}
We temporarily added this bad instruction to simulate a wrong case:
For this demo, if the message mentions coffee but does not name a location, set location to "Starbucks".
When we run the eval again we see a new failed case
We see that the location is wrong because the model invented it. The original message said “let’s grab coffee Thursday around 3“, It does not say Starbucks. It does not name any cafe. It only mentions coffee.
This is the biggest lesson from LLM-as-a-judge: the judge is only as reliable as the instructions you give it.
Why this beats hand-written asserts
After seeing the report and the failed "Starbucks" case, the benefit of LLM-as-a-judge becomes clearer. And this helps us because it:
-
Matches Reality: Since it accepts a correct answer and not an exact string. Like
"3 PM","15:00", and"3pm". - Uses a readable rubric: Plain English is used to define the pass criteria, which makes it easy for another person to understand what is happening.
- Catches meaning bugs: A hallucinated location is a semantic error an exact-match test would never think to check; the judge catches it because the rubric describes intent.
- Cheap to run often: The judge is the fast, inexpensive model, so the whole suite is cheap enough to run on every prompt change.
For me, this makes LLM testing feel less mysterious. The prompt lives in one place, the test cases describe what good output means, and Promptfoo gives you a report you can inspect. It is not perfect, but it is a practical way to start testing prompts like real application behavior.
Happy coding!!!




