Software

3 minute read

This Rewrite Isnt the Constraint: How a 300ms Tail Latency Hunt Led to a New Event Pipeline

May 29, 2026

We were burning 400ms in p99 tail latency on a core event-processing path in Veltrix. The upstream teams kept blaming the network, but the numbers didnt lie—64% of the time was spent inside the JVM, specifically in sun.misc.Unsafe.park during GC pauses. Every time we hit 80% heap pressure, the throughput collapsed and we lost 300k events per minute. That was the exact moment I stopped believing in the JVM as the runtime and started looking at the system boundary.

The first attempt was aggressively tuned HotSpot with G1GC and pinning the critical threads to their own NUMA nodes. We set -XX:MaxGCPauseMillis=20, -XX:+UseNUMA, and even migrated to Azul Zulu Prime because its handling of large heaps was supposedly better. The p99 dropped to 280ms, but the GC telemetry still showed a sawtooth pattern of 30–40ms spikes every 230ms on a 16GB heap. Profiling with JDK Flight Recorder told us 18% of CPU time was spent in card-table scanning. At that point I knew we were fighting the runtime, not the problem. The event pipeline was small—just JSON parsing, enrichment, and a single RocksDB write—but the JVMs generational collector couldnt stop moving objects.

The architecture decision came during a four-day blackout window after a failed Blue-Green deploy. Three of us sat in a war room with a single Grafana dashboard showing 100% CPU steal time on the Kubernetes nodes. We had two choices: squeeze more life out of the JVM by manually balancing the heap or rewrite the critical hot path in Rust and give the compiler full control over memory layout. The Rust option meant losing the JVM ecosystem (no more async-profiler, no more one-liner heap dumps) but gave us stackless futures, zero-cost abstractions, and compile-time memory safety. We chose Rust. We forked the Cargo.toml wed used in a sidecar for metrics and started porting the event collector.

The numbers after the rewrite told the story. We recompiled the same two endpoints—POST /events and GET /aggregates—and served them from the same Kubernetes pod scaled to two replicas. After two weeks of shadow traffic the results were:

p50 latency moved from 26ms (JVM) to 14ms (Rust)
p99 latency dropped from 400ms to 82ms, staying flat even under 90% heap usage
Allocation rate fell from 680MiB/s to 42MiB/s (a 16× reduction)
RocksDB compaction lag fell 60% because the Rust side no longer churned memory
Rusts jemalloc profile showed 98% of allocations were stack or bump-pointer, not heap

We still use the JVM—but only for the edges: the admin endpoints, the health probes, even the build tooling. The core event pipeline lives in Rust now, compiled with -C target-cpu=native, -C opt-level=3, and jemalloc as the global allocator. The CI pipeline now builds a static binary that weighs 12MB and starts in under 15ms. We lost the ergonomic debugging of the JVM, but gained the ability to inline hot loops, control object layouts with repr(C), and reason about zero-cost abstractions at compile time.

What I would do differently is split the rewrite into smaller pieces instead of doing a big-bang cutover. The first Rust version still called out to JNI for RocksDB because we were in a hurry. That JNI bridge introduced a 5ms latency spike and added a 320MiB memory overhead we didnt see until we ran jemalloc profiling. Next time Ill either write the RocksDB bindings in Rust using the raw C API or integrate the Rust crate redb sooner. Also, we under-instrumented the Rust side at first. We learned the hard way that perf on Rust binaries doesnt show symbol names unless you compile with dwarf unwind tables (-C dwarf-debug-info). The flamegraph we eventually got from perf inject --jit was the clue that revealed the hidden 3ms in a single memcpy inside our JSON parser.

The internet is being rebuilt for machines

May 28, 2026

Quality Assurance

Building Excellence: The Global Surge of the ASQ Certified Construction Quality Manager (CCQM)

May 29, 2026

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Hand-Picked Top-Read Stories

I gave an AI agent a business and $0. It found the one thing AI can’t sell.

Microsoft is openly competing with OpenAI, Anthropic more than ever

Ontology on Snowflake: How to Make AI Actually Understand Your Data

Trending Tags

This Rewrite Isnt the Constraint: How a 300ms Tail Latency Hunt Led to a New Event Pipeline

Leave a Reply Cancel reply

Previous Post

The internet is being rebuilt for machines

Next Post

Building Excellence: The Global Surge of the ASQ Certified Construction Quality Manager (CCQM)

This Rewrite Isnt the Constraint: How a 300ms Tail Latency Hunt Led to a New Event Pipeline

Leave a Reply Cancel reply

Previous Post

Next Post

Related Posts