Software

2 minute read

Two Pre-Registered Benchmarks for Audit-Native RAG: RAB (EU AI Act 10/12/19) + LRB (Time-Travel Retrieval)

June 14, 2026

Most RAG demos answer “what’s the right chunk?” Very few can answer the
two questions a regulator or an auditor will actually ask:

Replay this decision — show me the exact, complete record of how
this answer was produced.
Reconstruct the past — what did your system know at the moment it
answered, not what it knows now?

I got tired of hand-waving at both, so I shipped two pre-registered,
deterministic benchmarks alongside JAMES,
my local-first, audit-native Graph-RAG. Pre-registered means the metrics,
scenarios, and decision rules were locked before the numbers came in —
no post-hoc story-fitting.

RAB — Replayable-Audit Benchmark

RAB measures whether your audit trail is good enough to replay a
decision, with three deterministic metrics:

Metric	What it checks	EU AI Act
AC — Audit Completeness	Is every decision-relevant event logged?	Art. 10
RF — Replay Fidelity	Can you re-derive the answer from the log alone?	Art. 12
PC — Provenance Coverage	Does every claim trace to a source?	Art. 19

The three metrics map verbatim to EU AI Act Articles 10, 12, and 19 —
record-keeping obligations that apply from 2026-08-02 (per Article 113).

Scenario S1 result:

                 AC      RF      PC
JAMES          1.000   1.000   1.000
Baseline-0     0.275   0.000   0.000   (vanilla default-logging)

The gap is the whole point. “We have logs” (AC 0.275) is not the same as
“we can replay the decision” (RF 0). Default application logging gets you
a partial event trail and zero replay/provenance — which is exactly the
failure mode an Article 12 audit would surface.

LRB — Lifecycle Retrieval Benchmark

RAG facts go stale. A policy is superseded, a price changes, a spec is
revised. LRB asks: when you query as of a point in time, do you
retrieve the fact that was valid then, or whatever overwrote it?

Three systems compared:

V — Vanilla: no time handling.
N — Naive-supersede: newest fact wins.
J — JAMES: validity-window retrieval (reconstruct_graph_at(t)).

The R@1 ordering V < N < J holds across 4 model families × 4 scale
points (a 12.5× scale span) — time-aware retrieval beats both naive
overwrite and no time-handling at every scale, not just one lucky cell.

At publication scale (S3):

        R@1
V       0.502
N       0.721
J       0.845

How to run it yourself

Everything is local — Ollama (gemma4:e4b default) + BAAI/bge-m3
embeddings + ChromaDB. No cloud LLM account.

git clone https://github.com/Hashevolution/James-RAG-Evol
cp .env.example .env
pip install -r requirements.txt
ollama pull gemma4:e4b
# benchmark runners live in scripts/research/ (lrb_run*.py, rab_*)

Honest framing

These are benchmarks, not a victory lap. JAMES hitting 1.0/1.0/1.0 on a
scenario I designed is a starting line, not proof of general
superiority — the value is that the scenarios, metrics, and baselines are
public and deterministic, so you can run them, disagree, and beat the
numbers.

📄 Preprints (RAB 10pg, LRB 11pg) + Zenodo DOI: 10.5281/zenodo.20652679
💻 Code (MIT, OpenSSF Best Practices passing):
https://github.com/Hashevolution/James-RAG-Evol

Feedback I’d value most: (a) does the AC/RF/PC ↔ Art. 10/12/19 mapping
hold up under your reading of the text? (b) is “newest wins” the right
Naive-supersede baseline for LRB, or is there a stronger one I should add?

I Built a Web App That Finds the Fairest Meeting Spot for Any Group (and It’s Free)

June 14, 2026

Quality Assurance

Stop Talking About Agentic AI Until You Fix Your Data Foundation

June 14, 2026

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Hand-Picked Top-Read Stories

Stop Guessing Your Macros: Building an Autonomous AI Health Agent with AutoGen and HealthKit

Prentis, new AI lab co-founded by Reid Hoffman, Marc Pincus in talks to raise $100M

My idle ClickHouse was merging 11 million rows every 30 seconds

Trending Tags

Two Pre-Registered Benchmarks for Audit-Native RAG: RAB (EU AI Act 10/12/19) + LRB (Time-Travel Retrieval)

RAB — Replayable-Audit Benchmark

LRB — Lifecycle Retrieval Benchmark

How to run it yourself

Honest framing

Leave a Reply Cancel reply

Previous Post

I Built a Web App That Finds the Fairest Meeting Spot for Any Group (and It’s Free)

Next Post

Stop Talking About Agentic AI Until You Fix Your Data Foundation

Two Pre-Registered Benchmarks for Audit-Native RAG: RAB (EU AI Act 10/12/19) + LRB (Time-Travel Retrieval)

RAB — Replayable-Audit Benchmark

LRB — Lifecycle Retrieval Benchmark

How to run it yourself

Honest framing

Leave a Reply Cancel reply

Previous Post

Next Post

Related Posts