When you call an external API, things go fine until they don’t. A network blip, a server restart, a rate limit. So you add a retry, and most of the time it helps. The problem is that the obvious retry, the one we all write first, can quietly make things worse: it can resend a payment twice, or keep a struggling server down longer than the original failure would have.
In this post we’ll build a retry client in Go from the naive version up, fix each thing that bites, and end somewhere a little uncomfortable: telling you to use a library instead. By the end you’ll understand exactly what that library is doing, which is the real reason to read this.
The retry everyone writes first
Three tries, wait a second, give up. It looks reasonable, and there are three problems hiding in it.
First, the body. An HTTP client reads the request body to the end when it sends it, and it doesn’t rewind. So on the second attempt there’s nothing left to send, the server gets an empty POST, and it returns 400. Now you’re retrying a 400 that you caused.
Second, the wait is the same for everyone. When a server starts returning 503s, every client retries on the same one-second tick, together, over and over. You haven’t given it room to recover. You’ve lined everyone up to hit it again at the same moment. This is the thundering herd.
Third, you’re retrying everything. A 400 means your request is wrong. A 404 means the thing isn’t there. No amount of retrying changes either, but this loop treats every failure the same.
Three problems, and they don’t get fixed in the same place. Let’s take them one at a time.
Backoff: don’t retry in lockstep
The wait between attempts should grow, and it shouldn’t be identical for every client. Growing is exponential backoff. Not-identical is jitter. You want both.
Two details here are worth slowing down on, because they’re the ones people get wrong.
I used a left shift instead of math.Pow(2, attempt). Floats are imprecise for this, and Base * 2^attempt overflows an int64 of nanoseconds sooner than you’d expect, which can wrap a careful 800ms wait into a negative number. Shifting integers and capping early avoids all of it.
And jitter is a wrapper, not a second copy of the exponential math. The tempting move is to write a separate ExponentialBackoffWithJitter type and duplicate the doubling inside it. Then your linear and constant strategies don’t get jitter, and you’ve got the same logic in two places. Wrapping any Backoff keeps it in one place. One more thing: this is full jitter, random(0, d), not the wait give or take 25 percent, and the difference is the whole point. If every client jitters in a narrow band around the same target, they stay bunched together, just a slightly wider bunch. Picking uniformly across the entire [0, d) window is what actually scatters them. If you want the simulations behind that, AWS’s Exponential Backoff and Jitter post has the graphs.
Make sleep mockable, and make it respect context
Here’s a practical problem you hit the moment you write a test. If the test actually sleeps through exponential backoff, it actually takes seconds, and a suite full of them crawls.
So don’t call time.Sleep directly. Put it behind an interface, and while you’re there, make it cancellable.
That select on ctx.Done() is the part most hand-rolled retry loops miss. They check the context once at the top of the loop, then block on a plain time.Sleep for ten seconds. If the request is cancelled during that wait, nothing happens until the sleep ends. With the timer-and-context version, a cancelled request stops right away. On a deploy, that’s the difference between a clean shutdown and a thirty-second hang.
The client, where everything comes together
Let’s walk through the decisions, because they’re the whole point.
We buffer the body once, up front, and only when it’s a raw stream. Requests built with http.NewRequest from a bytes.Reader or a string already carry GetBody, so we reuse it instead of reading anything. We also don’t mutate the request you passed in. (This does read the whole body into memory, so if you’re streaming a very large upload, know that retries and streaming pull in opposite directions, and pick one on purpose.)
The default policy won’t retry your POST, and this is the one I’d most want you to take away. A POST that timed out may have already succeeded on the server before the response got lost on the way back. Retry it blindly and you’ve charged the customer twice. So GET, HEAD, PUT and the other safe-to-repeat methods retry by default, and POST doesn’t unless you opt in with an idempotency key. The policy gets the request so it can actually make that distinction.
Retry-After is respected and capped. If a server says wait two seconds, we wait two seconds. If a misconfigured server says wait 86400, we don’t park the goroutine for a day. We also handle the HTTP-date form of the header, not just delta-seconds, since the spec allows both and the date form is the one that usually gets forgotten.
And when we finally give up, we hand back the last response with its body still open, plus an error. An earlier version of mine closed that body before returning it, which left the caller holding a response it couldn’t read. Close it yourself and check the status if you want to know why we stopped.
Now the tests run in microseconds
Because the sleeper is mocked, we can assert the exact backoff schedule, prove Retry-After overrides the backoff, prove a cancelled context stops immediately, and prove a POST is left alone. None of it waits on a real clock. That last test, the one that pins the POST behaviour, is the one I’d be most nervous shipping without.
You probably shouldn’t use any of this
Here’s the uncomfortable turn. For real work, reach for a library. go-retryablehttp buffers bodies, respects Retry-After, does exponential backoff with jitter, and is wired into Terraform and Vault, so it has survived far more abuse than anything you or I will write this week. If you want a full HTTP client with a chainable API and JSON handling on top, resty has retries built in.
So why build it at all? Because now you can open that library’s source and read it as a set of decisions instead of magic: why it buffers, why it clones the request, why its default policy is careful about methods. The day it does something surprising, you’re debugging a thing you understand. That was the deliverable all along. Not the code, the understanding.
What actually bites people
A few things that aren’t obvious until they catch you.
Idempotency is the big one. Retrying is safe right up until the operation has a side effect, and then it isn’t. Make the operation idempotent, send an idempotency key, or don’t retry it. There’s no fourth choice that ends well.
Logging is the quiet one. The instinct is to log the full request and response on every retry, and then your logs fill with duplicated payloads, some of them carrying tokens. Log the attempt number, the status, and the endpoint. Not the body.
Timeouts catch people who think retries replace them. They don’t. The HTTP client timeout is per attempt, and your overall deadline belongs in the context. Those are two different clocks, and you want both, or one slow call holds your whole retry loop hostage.
And the one worth saying out loud: retries are most dangerous in a chain. If service A retries B three times, and B retries C three times, a small hiccup in C arrives at C as nine times the load. Stack a few layers and you get a retry storm that keeps a system down well after the original cause is gone. The fix is a retry budget: cap retries to a small fraction of your request rate so a bad minute can’t multiply into an outage. Single-server jitter handles the herd. A budget handles the chain. Real systems need both.
The full implementation is in the gist above. Clone it, run the tests, then break something and watch what the tests tell you. That’s how it stuck for me, and how it’ll stick for you. Happy coding.