Most “chat with your documents” demos work in an afternoon. Then you hit the last
20%: retrieval that misses the right passage, an LLM that confidently makes things
up, a reranker that wrecks your latency, chunking you re-tune ten times. And if
your documents are sensitive — legal, medical, internal — you can’t just paste
them into a cloud API.
So I built a fully local RAG pipeline and, more importantly, a reproducible
benchmark to prove it actually works. Everything runs on the machine. No
OpenAI, no Anthropic, no Cohere. Here’s the stack, the numbers, and what actually
moved them.
The stack (all local, permissively licensed)
- Embeddings: Qwen3-Embedding-0.6B (bge-m3 as a fallback)
- Vector store: Qdrant in local/embedded mode (no Docker)
- Retrieval: dense + sparse BM25, fused with Reciprocal Rank Fusion (RRF)
- Reranker: a cross-encoder (MiniLM) over the top-k
- LLM: Gemma3:4b via Ollama
- Eval judge: the same local LLM (so even evaluation makes zero external calls)
The targets (from current RAG benchmarks)
I wanted pass/fail thresholds, not vibes:
| Metric | Target |
|---|---|
| Hit Rate@5 | ≥ 0.90 |
| MRR | ≥ 0.75 |
| Context Precision@3 | ≥ 0.70 |
| Context Recall | ≥ 0.85 |
| Faithfulness | ≥ 0.90 |
| Answer Relevancy | ≥ 0.85 |
| Retrieval latency (p50) | ≤ 1.0s |
| End-to-end (p50) | ≤ 8.0s |
What actually moved the numbers
Starting from a naive dense-only baseline (5/9 passing), four changes did the work:
-
Hybrid + RRF took Hit Rate@5 from 0.90 (dense only) to 1.0. Keyword
matching catches what embeddings miss, and vice versa. -
The reranker took Context Precision@3 from 0.45 → 0.89. The single
biggest precision lever. Cross-encoders are slow, so it only runs on the top-k. -
A strict prompt (“answer ONLY from the context; if it’s not there, say you
don’t know”) plus temperature 0.1 took Faithfulness from 0.62 → 1.0. Most
“hallucination” is really a prompt + retrieval problem. - Putting Ollama on the GPU cut end-to-end p50 from 14s → 6.5s.
Results (validated at 3 scales)
To rule out “it only works because the corpus is tiny”, I ran it on 42, 124, and
274 questions with chunk-level ground truth. Scores stayed flat-to-rising as the
corpus grew 16×:
| Metric | 42Q | 124Q | 274Q |
|---|---|---|---|
| Hit Rate@5 | 1.00 | 1.00 | 1.00 |
| MRR | 0.95 | 0.98 | 0.98 |
| Context Precision@3 | 0.89 | 0.92 | 0.93 |
| Faithfulness | 1.00 | 0.99 | 0.97 |
| Answer Relevancy | 0.88 | 0.90 | 0.92 |
9/9 at every scale.
Lessons
-
Measure first. Without an eval harness, you optimize blind. The retrieval
metrics alone (no LLM) run in seconds and catch most regressions. -
“Hallucination” is usually retrieval. If faithfulness is fine but relevancy
is low, your problem is upstream in retrieval, not the model. -
Local is a feature, not a compromise. For sensitive data it’s the only
option, and a small local stack hits production-grade numbers in 2026.
Want the whole thing done for you?
I packaged the full pipeline — code, the eval suite, 13 input formats, metadata
filters, a CLI and a Streamlit UI, 60+ tests, docs — as a one-time download so
you can skip the weeks of tuning: https://buy.polar.sh/polar_cl_XV4ksHBnFjkEGMnKLzFc2HFB16agYFEORQ0Ov3oo7HK
Either way, happy to answer questions about the stack or the eval methodology in
the comments.