We were burning 400ms in p99 tail latency on a core event-processing path in Veltrix. The upstream teams kept blaming the network, but the numbers didnt lie—64% of the time was spent inside the JVM, specifically in sun.misc.Unsafe.park during GC pauses. Every time we hit 80% heap pressure, the throughput collapsed and we lost 300k events per minute. That was the exact moment I stopped believing in the JVM as the runtime and started looking at the system boundary.
The first attempt was aggressively tuned HotSpot with G1GC and pinning the critical threads to their own NUMA nodes. We set -XX:MaxGCPauseMillis=20, -XX:+UseNUMA, and even migrated to Azul Zulu Prime because its handling of large heaps was supposedly better. The p99 dropped to 280ms, but the GC telemetry still showed a sawtooth pattern of 30–40ms spikes every 230ms on a 16GB heap. Profiling with JDK Flight Recorder told us 18% of CPU time was spent in card-table scanning. At that point I knew we were fighting the runtime, not the problem. The event pipeline was small—just JSON parsing, enrichment, and a single RocksDB write—but the JVMs generational collector couldnt stop moving objects.
The architecture decision came during a four-day blackout window after a failed Blue-Green deploy. Three of us sat in a war room with a single Grafana dashboard showing 100% CPU steal time on the Kubernetes nodes. We had two choices: squeeze more life out of the JVM by manually balancing the heap or rewrite the critical hot path in Rust and give the compiler full control over memory layout. The Rust option meant losing the JVM ecosystem (no more async-profiler, no more one-liner heap dumps) but gave us stackless futures, zero-cost abstractions, and compile-time memory safety. We chose Rust. We forked the Cargo.toml wed used in a sidecar for metrics and started porting the event collector.
The numbers after the rewrite told the story. We recompiled the same two endpoints—POST /events and GET /aggregates—and served them from the same Kubernetes pod scaled to two replicas. After two weeks of shadow traffic the results were:
p50 latency moved from 26ms (JVM) to 14ms (Rust)
p99 latency dropped from 400ms to 82ms, staying flat even under 90% heap usage
Allocation rate fell from 680MiB/s to 42MiB/s (a 16× reduction)
RocksDB compaction lag fell 60% because the Rust side no longer churned memory
Rusts jemalloc profile showed 98% of allocations were stack or bump-pointer, not heap
We still use the JVM—but only for the edges: the admin endpoints, the health probes, even the build tooling. The core event pipeline lives in Rust now, compiled with -C target-cpu=native, -C opt-level=3, and jemalloc as the global allocator. The CI pipeline now builds a static binary that weighs 12MB and starts in under 15ms. We lost the ergonomic debugging of the JVM, but gained the ability to inline hot loops, control object layouts with repr(C), and reason about zero-cost abstractions at compile time.
What I would do differently is split the rewrite into smaller pieces instead of doing a big-bang cutover. The first Rust version still called out to JNI for RocksDB because we were in a hurry. That JNI bridge introduced a 5ms latency spike and added a 320MiB memory overhead we didnt see until we ran jemalloc profiling. Next time Ill either write the RocksDB bindings in Rust using the raw C API or integrate the Rust crate redb sooner. Also, we under-instrumented the Rust side at first. We learned the hard way that perf on Rust binaries doesnt show symbol names unless you compile with dwarf unwind tables (-C dwarf-debug-info). The flamegraph we eventually got from perf inject --jit was the clue that revealed the hidden 3ms in a single memcpy inside our JSON parser.