I was building an eval harness for a retrieval-augmented generation pipeline, and the first faithfulness check I wrote was quietly wrong. It looked reasonable. It ran on every example for free. It just measured the wrong thing, and I only saw it once I started feeding it edge cases on purpose. The way it fails is the same way most RAG tutorials I had copied from fail.
Here is the one sentence version: token overlap does not measure faithfulness, it measures copy-paste fidelity, and the gap between those two destroys your eval exactly in the cases you care about.
Let me show you what I mean, because “use a better metric” is the kind of advice that sounds smart and helps nobody.
What faithfulness is supposed to mean
Faithfulness is a different axis from relevance. Relevance asks whether the answer addresses the user’s question. Faithfulness asks something narrower: is the answer grounded in what the retrieved context actually says? You can be perfectly faithful and useless (you quoted the context exactly but answered nothing the user asked), or relevant and unfaithful (you answered the question correctly using facts you invented). A real eval needs both, measured separately. Collapsing them is the first mistake, but it is not the one I want to talk about.
What I actually built
When you reach for a quick faithfulness signal, the tutorial answer is almost always some flavor of token overlap. Count how many words of the answer appear in the retrieved context. High overlap means the answer came from the context, so it must be grounded. Here is roughly what I had:
def faithfulness(answer: str, context: str) -> float:
answer_tokens = set(answer.lower().split())
context_tokens = set(context.lower().split())
if not answer_tokens:
return 0.0
overlap = answer_tokens & context_tokens
return len(overlap) / len(answer_tokens)
Clean, fast, no model call, runs on every example for free. That is exactly why it is everywhere. It is also wrong in two opposite directions at the same time, which is the part I missed until I went looking.
Failure mode one: stopwords inflate the score (false positive)
Look at what dominates that overlap set. It is “the”, “is”, “of”, “a”, “to”, “and”. Function words. Every English sentence is mostly function words, and your context is too, so they always match. The score goes up because of grammar, not because of grounding.
answer = "The device is covered for a period of thirty-six months." # hallucinated number
context = "The device is covered for a period of twenty-four months."
# overlap: the, device, is, covered, for, a, period, of, months
# only "thirty-six" vs "twenty-four" differs
# score ~ 0.9, logged as faithful
The model hallucinated the single most important token in the sentence, the actual number, and the metric reported ninety percent faithful because the scaffolding words all matched. In a warranty bot, a contract assistant, anything where the payload is a number or a name, this is the failure that gets you sued. The metric is blind to it because the meaningful tokens are a tiny fraction of the total and stopwords drown them out.
Stripping stopwords helps a little, but it is a patch on a deeper problem, which is failure mode two.
Failure mode two: synonyms tank the score (false negative)
Now flip it. The model does its job well. It reads the context and paraphrases instead of regurgitating, because that is what good answers do.
answer = "Shipping is free for orders above fifty dollars."
context = "We provide complimentary delivery on purchases over $50."
# shared content tokens after stopword removal: basically none
# free != complimentary, shipping != delivery, orders != purchases, above != over
# score ~ 0.1, logged as unfaithful
This answer is perfectly grounded. Every claim traces back to the context. The metric flags it as a hallucination because the model used a thesaurus. So now your eval punishes the exact behavior you want, fluent paraphrase, and rewards the behavior you do not want, verbatim copying. Optimize against this metric and you will train or prompt your system toward parroting, which also happens to be the behavior most likely to leak verbatim source text you did not want exposed.
These two failures are not random noise. They are structural, and they point in opposite directions on the same number. Stopwords push it up when grounding is bad. Synonyms push it down when grounding is good. A metric that is wrong in both directions is not noisy, it is meaningless, and an average over a thousand examples will look perfectly stable while telling you nothing.
What I changed
The fix is to stop comparing strings and start comparing claims. The approach I am moving to decomposes the answer into atomic factual statements, then checks each one against the context for entailment rather than word match.
# 1. split the answer into atomic claims
claims = extract_claims(answer) # "shipping is free", "threshold is $50"
# 2. for each claim, ask: does the context support this?
# semantic entailment, not token overlap
supported = [entails(context, c) for c in claims]
faithfulness = sum(supported) / len(claims)
The entails step is where the real work lives. A cross-encoder NLI model handles synonyms and paraphrase because it scores meaning, not surface form. An LLM-as-judge prompt does too, at higher cost and with its own calibration headaches. Either one fixes both failure modes, because “complimentary delivery” entails “free shipping” and “covered for thirty-six months” does not entail “covered for twenty-four months”. The number now matters again, and the paraphrase no longer gets punished.
One thing worth being honest about: in practice, extract_claims is usually an LLM call itself, which means you have introduced a second model that can fail. Under-splitting hides compound errors. Over-splitting creates claims trivial enough that almost anything entails them. You have traded one hard problem for another, and it is worth knowing that going in.
It is slower and it costs money per example, and that tradeoff is real. But a cheap metric that lies is not cheaper than an expensive one that does not. The dashboard showing ninety percent faithful while the hallucinated number slipped through was enough to make me change it.
The takeaway
If your faithfulness eval is built on token or n-gram overlap, it is rewarding copy-paste and calling it grounding. Check it against two cases before you trust it: one answer that hallucinates a single critical token but keeps the sentence frame, and one answer that paraphrases the context correctly. If your metric does not flag the first and pass the second, it is not measuring faithfulness, and your dashboard is green for the wrong reason.
The open problem I’m stuck on is the claim-extraction step, since over-splitting creates claims too trivial to falsify and under-splitting hides compound errors. If you have run claim-level faithfulness on real traffic, I would like to know where it broke for you, because I have only tested it on cases I could think up myself.