Software

2 minute read

How I benchmarked a 100% local RAG pipeline to 9/9 (zero API keys)

June 8, 2026

Most “chat with your documents” demos work in an afternoon. Then you hit the last
20%: retrieval that misses the right passage, an LLM that confidently makes things
up, a reranker that wrecks your latency, chunking you re-tune ten times. And if
your documents are sensitive — legal, medical, internal — you can’t just paste
them into a cloud API.

So I built a fully local RAG pipeline and, more importantly, a reproducible
benchmark to prove it actually works. Everything runs on the machine. No
OpenAI, no Anthropic, no Cohere. Here’s the stack, the numbers, and what actually
moved them.

The stack (all local, permissively licensed)

Embeddings: Qwen3-Embedding-0.6B (bge-m3 as a fallback)
Vector store: Qdrant in local/embedded mode (no Docker)
Retrieval: dense + sparse BM25, fused with Reciprocal Rank Fusion (RRF)
Reranker: a cross-encoder (MiniLM) over the top-k
LLM: Gemma3:4b via Ollama
Eval judge: the same local LLM (so even evaluation makes zero external calls)

The targets (from current RAG benchmarks)

I wanted pass/fail thresholds, not vibes:

Metric	Target
Hit Rate@5	≥ 0.90
MRR	≥ 0.75
Context Precision@3	≥ 0.70
Context Recall	≥ 0.85
Faithfulness	≥ 0.90
Answer Relevancy	≥ 0.85
Retrieval latency (p50)	≤ 1.0s
End-to-end (p50)	≤ 8.0s

What actually moved the numbers

Starting from a naive dense-only baseline (5/9 passing), four changes did the work:

Hybrid + RRF took Hit Rate@5 from 0.90 (dense only) to 1.0. Keyword
matching catches what embeddings miss, and vice versa.
The reranker took Context Precision@3 from 0.45 → 0.89. The single
biggest precision lever. Cross-encoders are slow, so it only runs on the top-k.
A strict prompt (“answer ONLY from the context; if it’s not there, say you
don’t know”) plus temperature 0.1 took Faithfulness from 0.62 → 1.0. Most
“hallucination” is really a prompt + retrieval problem.
Putting Ollama on the GPU cut end-to-end p50 from 14s → 6.5s.

Results (validated at 3 scales)

To rule out “it only works because the corpus is tiny”, I ran it on 42, 124, and
274 questions with chunk-level ground truth. Scores stayed flat-to-rising as the
corpus grew 16×:

Metric	42Q	124Q	274Q
Hit Rate@5	1.00	1.00	1.00
MRR	0.95	0.98	0.98
Context Precision@3	0.89	0.92	0.93
Faithfulness	1.00	0.99	0.97
Answer Relevancy	0.88	0.90	0.92

9/9 at every scale.

Lessons

Measure first. Without an eval harness, you optimize blind. The retrieval
metrics alone (no LLM) run in seconds and catch most regressions.
“Hallucination” is usually retrieval. If faithfulness is fine but relevancy
is low, your problem is upstream in retrieval, not the model.
Local is a feature, not a compromise. For sensitive data it’s the only
option, and a small local stack hits production-grade numbers in 2026.

Want the whole thing done for you?

I packaged the full pipeline — code, the eval suite, 13 input formats, metadata
filters, a CLI and a Streamlit UI, 60+ tests, docs — as a one-time download so
you can skip the weeks of tuning: https://buy.polar.sh/polar_cl_XV4ksHBnFjkEGMnKLzFc2HFB16agYFEORQ0Ov3oo7HK

Either way, happy to answer questions about the stack or the eval methodology in
the comments.

Claude setup planner

June 8, 2026

AI - Artificial-Intelligence

Apple just taught your iPhone to finish your sentences, your photos, and your workflows

June 8, 2026

M	T	W	T	F	S	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Hand-Picked Top-Read Stories

Microsoft launches its first cybersecurity model, plus a new agentic cybersecurity system

They Just Told Me I’m Going to Be the New Internal Auditor – YIKES! (Part 2)

The Blinking Toilet Light and My `isProcessing` Flag Were Doing the Same Job

Trending Tags

How I benchmarked a 100% local RAG pipeline to 9/9 (zero API keys)

The stack (all local, permissively licensed)

The targets (from current RAG benchmarks)

What actually moved the numbers

Results (validated at 3 scales)

Lessons

Want the whole thing done for you?

Leave a Reply Cancel reply

Previous Post

Claude setup planner

Next Post

Apple just taught your iPhone to finish your sentences, your photos, and your workflows

How I benchmarked a 100% local RAG pipeline to 9/9 (zero API keys)

The stack (all local, permissively licensed)

The targets (from current RAG benchmarks)

What actually moved the numbers

Results (validated at 3 scales)

Lessons

Want the whole thing done for you?

Leave a Reply Cancel reply

Previous Post

Next Post

Related Posts