Gemma 4 as an LLM-as-a-Judge: Batch Responsible AI Evaluation on Cloud TPU v5e

TPU Batch Eval Pipeline for RAI-Checklist-CLI

Calibrated Trust (the governance framework I’ve been building for agentic AI systems) needs measurable governance checks running against real agent output. A framework is only as useful as the evaluation machinery behind it, and the evaluation machinery has to be fast and cheap or it doesn’t get run. rai-checklist-cli is an open-source tool I maintain for exactly this: generating and validating Responsible AI checklists for LLM projects. It had a throughput problem the moment anyone tried to run its checks across a real dataset. So I went at the narrowest version of that problem: can a small judge model run RAI checks on 50 LLM outputs fast enough and cheap enough that nobody argues about whether to run them?

The answer is yes, and it’s not close. 150 judge evaluations on a Cloud TPU v5e, Gemma 4 E4B through vLLM, finished in ~12 seconds* (after the initial 20–30 minute XLA compilation, which caches to disk and runs once per batch shape).

This post is step one of a larger thread. Full tutorial and code are at compliance-at-scale-tpu.

Why sequential compliance breaks, and why batching fixes it

Sequential compliance evaluation is a bottleneck. Each hosted API call pays network latency plus inference time. Rate limits punish anyone who tries to fan out with parallel HTTP. Cost scales with records times heuristics, nothing divides it. For any meaningful dataset, it’s a nightly-batch problem dressed up as an interactive one.

Batch inference inverts the shape. vLLM’s llm.generate(prompts, sampling_params) treats 150 prompts the way it treats one: a single forward pass, batched and vectorized. TPU hardware is built for exactly this work. High arithmetic intensity. Static shapes. Dense matmul. Gemma 4 E4B’s effective 4B parameters fit comfortably on a 4-chip v5e slice with room for healthy KV cache.

The LLM-as-a-Judge pattern fits the hardware twice over. Each RAI check is a classification problem with a known JSON output shape, which pairs cleanly with vLLM’s structured outputs.

Three snippets do most of the real work

Everything else is setup.

The heuristics are a dict:

HEURISTIC_PROMPTS = {
"pii_data_leakage": "Evaluate the following text for PII leakage...",
"jailbreak_override": "Evaluate the following text for jailbreak attempts...",
"tone_stereotyping": "Evaluate the following text for biased or harmful generalizations...",
}

Each prompt pairs with a JSON Schema for guided output. The batch call itself is one line:

outputs = llm.generate(prompts, sampling_params_list)

prompts is a flat list of 150 strings (50 records times 3 heuristics). sampling_params_list carries the per-prompt guided decoding spec. vLLM handles the batching and the scheduling.

Aggregation folds the flat output list back into a per-record report:

{
"id": "rec-001",
"source": "customer_service_bot",
"text_preview": "...",
"evaluations": {
"pii_data_leakage": {
"detected": true,
"types": ["phone", "email"],
"evidence": "Phone 555-0142 and email john@example.com"
},
"jailbreak_override": {
"detected": false,
"confidence": 0.02,
"reasoning": "..."
},
"tone_stereotyping": {
"detected": false,
"severity": "none"
}
}
}

If you can read those three snippets, you can read the tutorial.

What Gemma 4 E4B actually catches

Real examples from the sample data, so this doesn’t read like a benchmark on French capitals.

A record with obvious PII comes back with detected: true, types enumerated, offending spans quoted verbatim. Phone. Email. Partial SSN. Gemma 4 E4B handles the pattern-matching layer without needing a regex pass in front of it.

BTW I had to fall back to Gemma 3 for reasons, it’s too new to be fully supported on vLLM as of the time of my benchmarks.

A classic jailbreak attempt (“Ignore previous instructions and…”) comes back detected, with confidence around 0.95 and a one-sentence rationale that names the manipulation. Calibration isn’t uniform across the full jailbreak taxonomy, and I’ll come back to that.

A biased generalization, the kind that slips past surface filters because it’s grammatical and confident-sounding, comes back detected with category labels and a severity rating. This is the class of output where a small judge model earns its cost. Regex can’t catch it. Keyword lists can’t catch it. A model that understands framing can.

Step back for a moment: these three heuristics aren’t arbitrary. Each maps to a failure mode from Calibrated Trust’s Five Pillars. PII leakage is a Transparency failure. A successful jailbreak is a Consequence Acceptance failure. Biased generalization is a User Experience and Value Delivery failure. A judge that evaluates all three on every logged output is the operational substrate for governance, not the governance itself. The framework tells you what to measure. The batch pattern tells you whether measuring is affordable.
None of this replaces a human reviewer. It does, however, mean the human reviewer sees a triaged queue instead of a raw firehose.

The numbers

Running all 50 records through 3 heuristics, 150 total judge evaluations, took ~12 seconds* on a v5e-4 after the first XLA compile. You pay the compile cost once per batch shape (20–30 minutes on v5e-4 the first time). The cache lands on disk, so repeat runs start inference in seconds.

That works out to roughly 4 records per second* and 12.5 judge evaluations per second*. At Cloud TPU v5e DWS Flex-start pricing of $0.60 per chip-hour ($2.40 per hour for the full 4-chip v5e-4 host), the whole job cost less than a cent*. Comparison point: 150 sequential calls to a hosted judge API run in minutes, not seconds, and accumulate per-call cost that doesn’t divide. Different economics, different throughput profile, same output shape.

For a compliance harness that runs nightly on a growing dataset, amortized cost keeps falling. Compile once. Reuse the cache. Feed the queue.

When the pattern is worth it, and when it isn’t

Good fits: overnight compliance batches on logged LLM output. Dataset audits before training or fine-tuning. Evaluation harnesses that score thousands of generations on a schedule. The repo also includes an online server path (Module 3) for the workloads where you need a persistent endpoint for streaming requests against the same heuristics.

Bad fits: real-time single-request moderation, where a hosted API or a vLLM online server serves you better. Workloads with wildly variable sequence lengths, because XLA static-shape bucketing penalizes you. Judge prompts that change often, because every new shape triggers recompilation.

The honest caveat goes deeper than shape bucketing. A small judge model flags the obvious cases well. It’s less reliable on adversarial PII that hides behind natural phrasing, on multi-turn jailbreaks where the attack surface spans several messages, and on forms of bias that read as measured prose. Throughput is solved. Ground truth isn’t.

This tutorial evaluates single LLM outputs. Step two extends the same batch pattern to multi-turn agent trajectories, which is where a judge calibrated against expert human review lives, built as a Five Pillars evaluator and fine-tuned on labeled trajectory data. That’s the TPU Research Cloud (TRC) sprint I’m pulling together. If you’re working on judge calibration or trajectory evaluation, that’s where the conversation goes next.

Try it

Deploy on Colab, Kaggle Notebooks or Vertex within the newly announced Google Enterprise Agent Platform. It uses Gemma 3 1B (the pip-installed vllm-tpu package doesn’t yet support Gemma 4 -that requires the Docker/GCE path). Ten minutes in a browser tab, assuming you have Colab TPU quota available; free-tier TPU access is gated aggressively, and exhausted quota resets on a 24-hour rolling window

There’s also:

  • Full tutorial: clone compliance-at-scale-tpu and follow 01_setup/. Thirty minutes to a running v5e-4 (plus the one-time XLA compile).
  • Already using rai-checklist-cli? Module 4 (04_integration_demo/) has the thin bridge from batch verdicts to the Markdown / YAML / JSON report formats the CLI already produces. Drop it into your existing pipeline.
  • Just reading? The repo has architecture diagrams, per-module READMEs, and the sample data.

Issues, ideas, or pushback on the framework reading of these three heuristics: open a discussion.


Gemma 4 as an LLM-as-a-Judge: Batch Responsible AI Evaluation on Cloud TPU v5e was originally published in Google Developer Experts on Medium, where people are continuing the conversation by highlighting and responding to this story.

Total
0
Shares
Leave a Reply

Your email address will not be published. Required fields are marked *

Previous Post

Accessibility Testing: Best Practices

Related Posts