A team ships their first email agent on a Thursday. Demo went great, handler’s deployed, webhook’s registered. Friday morning the on-call wakes up to an inbox where the agent has been enthusiastically replying to its own replies all night, a customer who received the same answer three times, and a thread in Gmail that’s shattered into five separate conversations. None of it was an exotic failure — every one of these is a known pitfall with a known fix, documented in the Nylas Agent Accounts cookbook (the product’s in beta; the mistakes are timeless).
Here are the nine I’d check before any launch.
1. The agent replies to itself
The message.created webhook fires for outbound messages too — when your agent sends a reply via the API, that sent message triggers the same event as inbound mail. Skip this check and you’ve built a perpetual motion machine: reply, webhook, reply.
Fix: filter the agent’s own address at the very top of the handler, before any other logic.
const sender = msg.from?.[0]?.email;
if (sender === AGENT_EMAIL) return;
2. No webhook deduplication
Delivery is at-least-once. If your endpoint doesn’t return 200 fast enough, or the network hiccups, the same message.created notification arrives again — and a naive handler replies twice. The dedup recipe calls this the most common source of duplicates.
Fix: an atomic check-and-set on the message ID before processing — INSERT ... ON CONFLICT DO NOTHING in Postgres, SET id 1 NX EX 86400 in Redis. Give records a TTL of 24–48 hours so late redeliveries still get caught without the table growing forever.
3. Dedup without locking
Two concurrent workers (Lambda instances, ECS tasks) can race past the check-and-set in the same millisecond and both generate a reply. Dedup catches the same event delivered twice; it can’t catch the same event processed simultaneously.
Fix: a per-thread lock with a 30-second TTL so a crashed worker self-releases — and a double-check inside the lock that inspects the thread’s latest message and bails if the agent already replied. You need dedup and locking; they cover different failure modes.
4. Trusting the webhook payload for the message body
The webhook carries summary fields — subject, from, snippet — not the full body. Worse, if a body exceeds roughly 1 MB, the event type becomes message.created.truncated and the body is omitted entirely. Agents that parse the payload directly work in testing and fail on real-world mail.
Fix: always fetch the full message from the API using the ID in the payload, as the reply-handling recipe does, and handle the truncated event type explicitly.
5. Replies that don’t thread
Send a “reply” as a fresh message and it lands as a disconnected email in the recipient’s client — no quoted context, no conversation grouping. Multiply by a few turns and the customer is hunting through five fragments of one discussion.
Fix: pass reply_to_message_id on every reply. That makes the platform set the In-Reply-To and References headers so the message threads correctly in Gmail, Outlook, and the agent’s own mailbox. Match incoming replies by thread_id, never by subject line — subjects get edited, and two different threads can share one.
await nylas.messages.send({
identifier: AGENT_GRANT_ID,
requestBody: {
replyToMessageId: msg.id,
to: [{ email: sender }],
body: replyBody,
},
});
6. Replying instantly to every message
Humans send corrections. A recipient fires off a reply, spots a mistake, and sends a follow-up fifteen seconds later — and your agent has already answered the first message, so now it answers the second too, and the conversation forks.
Fix: a 30–60 second cooldown before responding in active threads, batching consecutive inbound messages into one considered reply.
7. No outbound circuit breaker
Even with dedup, locking, and self-filtering, a logic bug can still produce a reply storm — and an autonomous sender fails at machine speed. This is the safety net the dedup recipe says not to ship without.
Fix: a per-thread send budget. If the agent has sent 3 or more messages on one thread within 5 minutes, stop sending and escalate to a human. A rate limit triggering is a page; a runaway agent is an apology tour.
8. Letting junk wake the agent
Spam, bounce-backs, and out-of-office auto-replies all fire message.created. If every one of them reaches your LLM, you’re paying inference costs to reason about garbage — and risking the agent answering it.
Fix: push filtering below your application using rules. A block rule rejects known-bad senders at the SMTP level so your code never sees the message; assign_to_folder routes automated notifications away from the inbox so your handler can skip folders the agent shouldn’t answer. Rules run in priority order (0–1000, lower first), so put specific matches before broad contains rules — the first matching block is terminal.
9. Treating a blocked send as a retryable error
If your workspace has outbound rules, a send matching a block rule returns 403 — and no retry will ever deliver it, because the rule rejected it before the provider was involved. An agent with generic retry logic will hammer that send forever and report a flaky network.
Fix: treat 403 on send as terminal. Log it, then query GET /v3/grants/{grant_id}/rule-evaluations to see exactly which rule matched and what data was evaluated — that endpoint is the fastest answer to “why didn’t this send?”
There’s one nuance worth encoding in your error handler. Rule evaluation fails closed: if a block rule can’t be evaluated because of a transient infrastructure problem (say, a list lookup failure during in_list matching), the send is blocked anyway — but it comes back as a 503, not a 403, and the audit record carries blocked_by_evaluation_error: true. So the rule is simple: retry 503, never retry 403. Conflating the two is how agents either give up on deliverable mail or hammer undeliverable mail.
The pattern across all nine: email agents fail at the seams between at-least-once infrastructure and autonomous action, not in the LLM prompt. The fixes are boring — a filter, a lock, a cap, a rule — and that’s the point. Boring is what you want standing between a language model and a real human’s inbox.
Turn this into a pre-launch checklist: nine items, and your load test should specifically exercise #2 and #3 by firing duplicate webhooks from concurrent connections. Which of these has bitten you in production — and was the fix on this list?