How I benchmarked a 100% local RAG pipeline to 9/9 (zero API keys)

Most “chat with your documents” demos work in an afternoon. Then you hit the last
20%: retrieval that misses the right passage, an LLM that confidently makes things
up, a reranker that wrecks your latency, chunking you re-tune ten times. And if
your documents are sensitive — legal, medical, internal — you can’t just paste
them into a cloud API.

So I built a fully local RAG pipeline and, more importantly, a reproducible
benchmark
to prove it actually works. Everything runs on the machine. No
OpenAI, no Anthropic, no Cohere. Here’s the stack, the numbers, and what actually
moved them.

The stack (all local, permissively licensed)

  • Embeddings: Qwen3-Embedding-0.6B (bge-m3 as a fallback)
  • Vector store: Qdrant in local/embedded mode (no Docker)
  • Retrieval: dense + sparse BM25, fused with Reciprocal Rank Fusion (RRF)
  • Reranker: a cross-encoder (MiniLM) over the top-k
  • LLM: Gemma3:4b via Ollama
  • Eval judge: the same local LLM (so even evaluation makes zero external calls)

The targets (from current RAG benchmarks)

I wanted pass/fail thresholds, not vibes:

Metric Target
Hit Rate@5 ≥ 0.90
MRR ≥ 0.75
Context Precision@3 ≥ 0.70
Context Recall ≥ 0.85
Faithfulness ≥ 0.90
Answer Relevancy ≥ 0.85
Retrieval latency (p50) ≤ 1.0s
End-to-end (p50) ≤ 8.0s

What actually moved the numbers

Starting from a naive dense-only baseline (5/9 passing), four changes did the work:

  1. Hybrid + RRF took Hit Rate@5 from 0.90 (dense only) to 1.0. Keyword
    matching catches what embeddings miss, and vice versa.
  2. The reranker took Context Precision@3 from 0.45 → 0.89. The single
    biggest precision lever. Cross-encoders are slow, so it only runs on the top-k.
  3. A strict prompt (“answer ONLY from the context; if it’s not there, say you
    don’t know”) plus temperature 0.1 took Faithfulness from 0.62 → 1.0. Most
    “hallucination” is really a prompt + retrieval problem.
  4. Putting Ollama on the GPU cut end-to-end p50 from 14s → 6.5s.

Results (validated at 3 scales)

To rule out “it only works because the corpus is tiny”, I ran it on 42, 124, and
274 questions with chunk-level ground truth. Scores stayed flat-to-rising as the
corpus grew 16×:

Metric 42Q 124Q 274Q
Hit Rate@5 1.00 1.00 1.00
MRR 0.95 0.98 0.98
Context Precision@3 0.89 0.92 0.93
Faithfulness 1.00 0.99 0.97
Answer Relevancy 0.88 0.90 0.92

9/9 at every scale.

Lessons

  • Measure first. Without an eval harness, you optimize blind. The retrieval
    metrics alone (no LLM) run in seconds and catch most regressions.
  • “Hallucination” is usually retrieval. If faithfulness is fine but relevancy
    is low, your problem is upstream in retrieval, not the model.
  • Local is a feature, not a compromise. For sensitive data it’s the only
    option, and a small local stack hits production-grade numbers in 2026.

Want the whole thing done for you?

I packaged the full pipeline — code, the eval suite, 13 input formats, metadata
filters, a CLI and a Streamlit UI, 60+ tests, docs — as a one-time download so
you can skip the weeks of tuning: https://buy.polar.sh/polar_cl_XV4ksHBnFjkEGMnKLzFc2HFB16agYFEORQ0Ov3oo7HK

Either way, happy to answer questions about the stack or the eval methodology in
the comments.

Total
0
Shares
Leave a Reply

Your email address will not be published. Required fields are marked *

Previous Post

Claude setup planner

Related Posts