Goroutine leaks rarely announce themselves with a dramatic outage. They show up as “slowly getting worse”:
- p95/p99 creeps up over an hour or two
- memory trends upward even though traffic is flat
- goroutine count keeps climbing and doesn’t return to baseline
If you’ve been on-call long enough, you’ve seen the trap: people debate why it’s happening before they’ve proven what is accumulating.
This post is a compact, production-first triage that I use to confirm a goroutine leak fast, identify the dominant stuck pattern, and ship a fix that holds.
If you want the full long-form runbook with a root-cause catalog, hardening defaults, and a production checklist, I published it here:
https://compile.guru/goroutine-leaks-production-pprof-tracing/
What a goroutine leak is (the only definition that matters in production)
In production I don’t define a leak as “goroutines are high.”
A goroutine is leaked when it outlives the request/job that created it and it has no bounded lifetime (no reachable exit path tied to cancellation, timeout budget, or shutdown).
That framing matters because it turns debugging into lifecycle accounting:
- What started this goroutine?
- What is its exit condition?
- Why is the exit unreachable?
Minute 0–3: confirm the signature (don’t skip this)
Before you touch profiling, answer one question:
Is the system accumulating concurrency footprint without a matching increase in work?
What I look at together:
- QPS / job intake (flat or stable-ish)
- goroutines (upward slope)
- inuse heap / RSS (upward slope)
- tail latency (upward slope)
If goroutines spike during a burst and then gradually return: that’s not a leak, that’s load.
If goroutines rise linearly (or step-up repeatedly) while work is stable: treat it as a leak until proven otherwise.
Minute 3–10: capture two goroutine profiles and diff them
The key move is comparison. A single goroutine dump is noisy. Two dumps tell you what’s growing.
Option A (best for diffing): capture profile format and use go tool pprof
Capture twice, separated by 10–15 minutes.
curl -sS "http://$HOST/debug/pprof/goroutine" > goroutine.1.pb.gz
sleep 900
curl -sS "http://$HOST/debug/pprof/goroutine" > goroutine.2.pb.gz
Now diff them.
go tool pprof -top -diff_base=goroutine.1.pb.gz ./service-binary goroutine.2.pb.gz
What you want:
- one (or a few) stacks that grow a lot
- a clear wait reason: channel send/recv, network poll, lock wait, select, etc.
Option B (fastest human scan): debug=2 text dumps
curl -sS "http://$HOST/debug/pprof/goroutine?debug=2" > goroutines.1.txt
sleep 900
curl -sS "http://$HOST/debug/pprof/goroutine?debug=2" > goroutines.2.txt
Then do a “poor man’s diff”:
- search for repeated top frames
- count occurrences (even roughly)
- focus on the stacks with the biggest growth
Minute 10-15: map the dominant stack to the first fix you should try
Once you have “the stack that grows,” the fix is usually not mysterious. Here’s the mapping I use to choose the first patch.
1) Many goroutines blocked on chan send / chan receive
Interpretation: backpressure/coordination bug. Producers outpace consumers, or receivers are missing, or close ownership is unclear.
First fix:
- add a cancellation edge to send/receive paths (
select { case <-ctx.Done(): ... }) - bound the queue/channel (and decide policy: block with timeout vs reject)
Example helper:
func sendWithContext[T any](ctx context.Context, ch chan<- T, v T) error {
select {
case ch <- v:
return nil
case <-ctx.Done():
return ctx.Err()
}
}
2) Many goroutines stuck in net/http.(*Transport).RoundTrip / netpoll waits
Interpretation: outbound I/O without a real deadline or missing request context wiring. Slow downstream causes your service to "hold on" to goroutines.
First fix:
- enforce timeouts at the client level (transport + overall cap)
- always use
http.NewRequestWithContext(orreq = req.WithContext(ctx)) - always close bodies and bound reads
3) Many goroutines waiting on WaitGroup.Wait, semaphores, or errgroup
Interpretation: join/cancellation bug or unbounded fan-out. Work starts faster than it completes; cancellation doesn’t propagate; someone forgot to wait.
First fix:
- ensure there is exactly one "owner" that always calls
Wait() - use
errgroup.WithContextso cancellation is wired - bound concurrency explicitly (e.g.,
SetLimit)
g, ctx := errgroup.WithContext(ctx)
g.SetLimit(16)
4) Many goroutines in timers/tickers / periodic loops
Interpretation: time-based resources not stopped, or loops not tied to cancellation/shutdown.
First fix:
- stop tickers
- stop + drain timers when appropriate
- ensure the loop has a
ctx.Done()exit
Where tracing fits (and why it’s worth it even if pprof "already shows the stack")
pprof tells you what is stuck. Tracing tells you:
- which request/job spawned it
- what deadline/budget it had
- which downstream call/queue wait never returned
If you already have OpenTelemetry (or any tracing), the fastest win is:
- put spans around anything that can block: outbound HTTP/gRPC, DB calls, queue publish/consume, semaphore acquire, worker enqueue
- tag spans with route/operation, downstream name, and timeout budget
That way, when profiling says "these goroutines are stuck in RoundTrip," tracing tells you "95% of them are from /enrich, tenant X, calling payments-api, timing out at 800ms."
The patch that actually holds: ship "hardening defaults," not one-off fixes
If you only patch the one stack you saw today, the next incident will be a different stack.
The fixes that keep paying dividends are defaults:
- timeout budgets at boundaries
- bounded concurrency for any fan-out
- bounded queues + explicit backpressure policy
- explicit channel ownership rules
- structured shutdown (stop admission → cancel context → wait with a shutdown budget)
I keep the complete hardening patterns + production checklist in the full post:
https://compile.guru/goroutine-leaks-production-pprof-tracing/
Prove it’s fixed (don’t accept vibes)
A real fix has artifacts:
- goroutine slope stabilizes under the same traffic/load pattern
- the dominant growing stack is gone (or bounded) in comparable snapshots
- tail latency and timeout rate improve (or at least stop worsening)
Also watch out for "false confidence":
- restarts and autoscaling can hide leaks without removing the bug
- short tests miss slow leaks (especially timer/ticker issues)
Wrap-up
The fastest way to win against goroutine leaks is to stop guessing:
1) confirm the signature (slope + correlation)
2) take two goroutine captures and diff them
3) fix the dominant stack with lifecycle bounds (timeout/cancel/join/backpressure)
4) prove the fix with before/after slope and comparable snapshots
If you want the deeper catalog of leak patterns and the production checklist I use in reviews and incident response, here’s the complete runbook:
https://compile.guru/goroutine-leaks-production-pprof-tracing/