US-East-1: When the Titanic Sinks

us-east-1:-when-the-titanic-sinks

Learnings from the recent AWS failure.

When the Internet Paused

It started with confusion. At 7:40 AM BST on October 20, 2025, a Monday morning like any other, people around the world reached for their phones and found… nothing. Duolingo wouldn’t load—goodbye, 500-day streak. Snapchat refused to open. Ring doorbells went blind. Wordle players stared at blank screens, their morning ritual interrupted. Coinbase users couldn’t check their crypto portfolios. Even Amazon’s own shopping site was struggling.

On Twitter (somehow still working), the complaints began flooding in. “Is it just me or is everything down?” Thousands asked the same question simultaneously. Within minutes, Downdetector lit up like a Christmas tree—over 50,000 reports cascaded across services that seemingly had nothing in common. Banking apps, dating platforms, learning tools, gaming servers, university systems, airline websites—all frozen.

Then someone noticed the pattern. They all ran on AWS.

The realization spread like wildfire: Amazon Web Services, the invisible backbone supporting roughly 30% of the internet, had suffered a catastrophic failure in its US-EAST-1 region—the data center cluster in Northern Virginia that serves as the internet’s beating heart. This wasn’t just an outage. This was a demonstration of how fragile our hyper-connected world had become.

Anatomy of a Cascade: When DNS Forgot How to Speak

The technical autopsy reveals a failure that started small and metastasized into chaos through a perfect storm of dependencies. At 12:11 AM PDT (3:11 AM ET), AWS engineers first detected increased error rates and latencies across multiple services in US-EAST-1. What they didn’t yet know was that they were watching the opening act of a multi-hour digital catastrophe.

The First Domino: DynamoDB’s DNS Resolution Failure

At the center of the crisis sat DynamoDB, AWS’s managed NoSQL database service. DynamoDB isn’t just another database—it’s a foundational service upon which dozens of other AWS services depend. It stores configuration data, deployment states, and critical metadata that keeps the AWS ecosystem functioning. Think of it as the nervous system of the cloud.

The problem began with AWS’s internal Domain Name System (DNS)—the phone book of the internet that translates human-readable names like “dynamodb.us-east-1.amazonaws.com” into the IP addresses that computers actually use to find each other. According to AWS’s post-mortem, a subsystem failure in the network load balancer (NLB) caused health check updates to fail. When load balancers can’t properly track which servers are healthy, they make catastrophic decisions—marking perfectly functional servers as offline and corrupting DNS records in the process.

Suddenly, services across AWS couldn’t find DynamoDB’s API endpoint. DNS queries that should have returned valid IP addresses instead returned nothing—or worse, pointed to servers that were incorrectly marked as unavailable. Applications tried to connect, waited, timed out, and crashed.

The Cascade Begins: When Everything Depends on Everything

Here’s where the architecture of modern cloud computing revealed its Achilles’ heel. DynamoDB’s failure didn’t stay contained—it couldn’t. AWS Lambda, the serverless computing platform that powers millions of applications, depends on DynamoDB to store function configurations and deployment metadata. When Lambda couldn’t reach DynamoDB, it couldn’t deploy new functions or update existing ones. Serverless became useless.

Amazon EC2 (Elastic Compute Cloud), the virtual server service that millions of companies rent, relies on DynamoDB for instance launch metadata. Suddenly, developers couldn’t spin up new servers. Applications couldn’t auto-scale. The cloud stopped being elastic.

Amazon S3, the object storage service that hosts everything from Netflix videos to corporate backups, uses DynamoDB for internal state management. S3 requests started failing. CloudFront, AWS’s content delivery network, couldn’t serve cached content properly. API Gateway stopped routing requests. Amazon SQS message queues backed up. Kinesis Data Streams froze. IAM Identity Center couldn’t authenticate users.

In total, 113 AWS services experienced degradation or outages—not because they each had individual failures, but because they all depended on the same broken foundation. It was a textbook example of cascading failure: one malfunction triggering a chain reaction through a tightly coupled system.

The Long Road to Recovery

By 1:26 AM PDT, AWS engineers had pinpointed DynamoDB as the epicenter. By 2:01 AM, they’d identified the DNS resolution issue as the root cause. But knowing the problem and fixing it are different challenges entirely.

AWS implemented a mitigation at 2:24 AM PDT—just over two hours after the initial detection. But the internet doesn’t heal instantly. Even after the DNS issues were resolved, AWS faced a massive backlog of failed requests, broken connections, and services that had crashed and needed restarting.

To prevent overwhelming the recovering infrastructure, AWS made the difficult decision to throttle EC2 instance launches—intentionally slowing down the rate at which new servers could start. This meant that even as some services recovered, others remained impaired. Universities couldn’t access Canvas. Airlines had booking system problems. Financial platforms kept users locked out.

It wasn’t until 3:01 PM PDT—nearly 15 hours after the initial incident—that AWS declared all services fully operational again. But the damage was done. Millions of users had lost access to critical services. Businesses had lost revenue. Trust had been shaken.

The Human Factor: When Expertise Walks Out the Door

Experts Leaving

But perhaps the most troubling aspect of this outage isn’t the technical failure itself—it’s what it reveals about the human infrastructure behind our digital infrastructure.

Industry observers like Corey Quinn, a prominent AWS critic and cloud economist, argue that this outage is a symptom of a deeper organizational problem: Amazon’s talent exodus.

Between 2022 and 2025, over 27,000 Amazon employees were impacted by layoffs. While it’s unclear exactly how many were AWS engineers, internal documents reportedly showed that Amazon suffers from 69% to 81% “regretted attrition”—people quitting who the company wished hadn’t.

The early engineers who built AWS’s foundational services understood the deep failure modes of these systems intimately. They knew where the single points of failure lurked. They understood the cascading dependencies. When these senior engineers leave—whether through layoffs, frustration with return-to-office mandates, or simply better opportunities elsewhere—they take that institutional knowledge with them.

The newer, leaner teams may be less expensive on paper, but they lack the hard-won wisdom that comes from building and babysitting these systems through previous crises. Detection took longer. Diagnosis took longer. The blast radius was larger than it needed to be.

Who Could Have Prevented This?

Holding the Cloud

The uncomfortable truth is that this failure had multiple points where intervention was possible:

The Architects:
The original system designers could have built more resilient DNS infrastructure with true multi-path redundancy, ensuring that no single load balancer failure could corrupt critical DNS records. Failure of DynamoDB could cripple the entire infrastructure. This is very uncomfortable, and does not get along with their own advertised architecture principles.

The SREs (Site Reliability Engineers):
They could have implemented better circuit breakers and fail-safes to prevent cascading failures from propagating so completely. When DynamoDB goes down, dependent services should degrade gracefully, not catastrophically.

The Engineering Managers:
They should have pushed back against the talent exodus, arguing that experience and institutional knowledge aren’t luxuries—they’re necessities for operating internet-scale infrastructure.

The Executives:
They made the decision to prioritize cost-cutting over operational resilience. When you treat engineers as interchangeable resources rather than keepers of critical knowledge, this is the inevitable result.

The problem wasn’t a single bad deployment or a rookie mistake. The problem was systemic:

A centralized architecture maintained by an under-resourced team at an organization prioritizing short-term efficiency over long-term resilience

The Cloud Provider Reality Check:

AWS vs Azure

Before digging further, let’s address the elephant in the room: is AWS uniquely unreliable, or is this just the reality of cloud computing at scale?

The data tells a sobering story. According to a comprehensive study by Cherry Servers analyzing incidents between August 2024 and August 2025, AWS outages averaged 1.5 hours in duration—the shortest among the major cloud providers. Microsoft Azure, by contrast, averaged 14.6 hours per outage, nearly ten times longer. Google Cloud fell in between at 5.8 hours. Even more striking, an earlier analysis from 2018-2019 found that Azure reported 1,934 hours of total downtime compared to AWS’s 338 hours during the same period.

Azure has experienced its share of spectacular failures. In July 2024, a configuration update caused a nearly 8-hour global outage affecting Virtual Machines and Cosmos DB. Later that same month, Azure suffered another multi-hour disruption impacting Microsoft 365, Teams, and Outlook. Most dramatically, Azure’s China North 3 region experienced a 50-hour outage in late 2024—longer than this entire AWS incident. A 2024 Parametrix Cloud Outage Risk Report confirmed that AWS remained the most reliable of the “big three” cloud providers, though Azure showed improvement from the previous year while Google Cloud’s critical downtime increased by 57%.

The point isn’t to declare AWS superior—every major cloud provider has catastrophic failures. Rather, it’s to emphasize that no cloud provider is infallible. AWS’s market dominance (roughly 30% of global cloud infrastructure) means its outages impact more services and make bigger headlines, but Azure and Google Cloud users aren’t immune to similar disasters. The October 20 AWS outage was severe, but it’s part of a broader pattern affecting the entire industry: complex distributed systems operated by humans will occasionally fail in spectacular ways, regardless of the logo on the status page.

This reality reinforces the core lesson: relying on any single cloud provider—no matter how reliable their historical track record—is an architectural risk you cannot afford.

Lessons for the Rest of Us: Building for the Inevitable

If you’re running your infrastructure on AWS—or any cloud provider—this outage should be your wake-up call. Not because AWS is uniquely unreliable (it’s not), but because centralized infrastructure creates centralized risks.

The Multi-Region Imperative

The most critical lesson is this: single-region deployments are not production-ready architectures. They are prototypes waiting to fail.

Yes, multi-region deployments are complex. Yes, they cost more. Yes, they require sophisticated data replication strategies and intelligent load balancing. But you know what’s more expensive? Your entire service going down for 15 hours because US-EAST-1 had a bad day.

Here’s what multi-region resilience looks like in practice:

Active-Active Deployments:
Run your application simultaneously in multiple AWS regions (US-EAST-1, US-WEST-2, EU-WEST-1). Use Route 53 or a similar global load balancer to distribute traffic based on health checks and latency. When one region fails, traffic automatically flows to healthy regions.

Cross-Region Data Replication:
Use DynamoDB Global Tables, RDS cross-region read replicas, or S3 cross-region replication to ensure your data exists in multiple regions. Yes, this introduces consistency challenges. Yes, you need to think carefully about eventual consistency models. But regional isolation means regional failures stay regional.

Regular Failover Drills:
It’s not enough to build multi-region infrastructure. You need to actually test it. Schedule quarterly disaster recovery exercises where you deliberately kill an entire region and verify that your application stays up. If you can’t survive AWS forcibly terminating US-EAST-1, you won’t survive an actual outage.

Beyond AWS: Multi-Cloud Strategies

For truly critical systems, consider going further: multi-cloud architectures that span AWS, Google Cloud Platform, and Microsoft Azure. This is genuinely difficult—each cloud provider has different APIs, different networking models, different services. But it’s the only way to truly avoid single-provider risk.

Container orchestration platforms like Kubernetes can help here. When you package your application in containers and use cloud-agnostic storage abstractions, you can potentially run the same workload on any cloud provider. It won’t protect you from application-level bugs, but it will protect you from provider-level catastrophes.

Circuit Breakers and Graceful Degradation

Your application should be designed to function—even in a limited capacity—when dependencies fail. If your authentication service can’t reach its database, it should fail open (temporarily allowing access) or fail closed (temporarily blocking everyone) based on your security model, but it shouldn’t crash your entire application.

Implement circuit breakers using libraries like Hystrix or Resilience4j. When a dependency starts failing, the circuit breaker stops trying to call it, preventing cascading timeouts and allowing your application to serve cached data or degrade functionality instead of dying completely.

Chaos Engineering: Breaking Things on Purpose

The Netflix engineering team pioneered this approach: deliberately inject failures into your production systems to verify that they’re resilient. Tools like Chaos Monkey randomly terminate instances. Chaos Kong simulates entire region failures. Latency Monkey introduces artificial delays.

This sounds terrifying, but it’s actually safer than the alternative. Better to discover your single points of failure during a controlled experiment than during a real crisis when your customers are screaming and your CEO is on the phone.

Monitoring, Observability, and Runbooks

You need comprehensive visibility into your system’s health across all dependencies. Use distributed tracing (OpenTelemetry, Jaeger) to understand how requests flow through your services. Set up synthetic monitoring to continuously probe your system from multiple global locations. Create detailed runbooks that your on-call engineers can follow during outages—including specific escalation procedures for when the cloud provider itself is the problem.

The Cultural Shift: Expecting Failure

Finally, and perhaps most importantly, you need to change how your organization thinks about failure. In a cloud-native, microservices world, failures aren’t anomalies—they’re constants. Servers die. Networks partition. Cloud regions go offline. Your architecture must assume that everything will fail eventually, and design accordingly.

This is the philosophy behind AWS’s own “Well-Architected Framework,” which emphasizes designing for failure. The irony of AWS itself experiencing a massive failure isn’t lost on anyone, but the principle remains sound: build your systems assuming that any component can fail at any time, and make sure your architecture survives anyway.

Don’t Panic: You’re In Good Company

And now, let’s take a breath and acknowledge something comforting: if your code fails in production, you’re not alone.

Amazon, with its thousands of brilliant engineers, its billions in infrastructure investment, and its decades of operational experience, just took down 38% of global internet traffic for half a day because of a DNS misconfiguration. If the company that literally invented modern cloud computing can have a catastrophic outage, then yes, you’re allowed to feel less bad about that bug you pushed to production last week that made the login button disappear.

The smartest engineers at Google have accidentally deleted customer data. Facebook has gone offline globally. Microsoft Azure has had similar cascading failures. Apple’s iCloud has had multi-day outages. This isn’t about competence—it’s about the inherent complexity of distributed systems operating at planetary scale.

Every engineer who’s ever worked on production systems has a story. That time they accidentally DDoS’d their own database. That time they deployed a config change that disabled monitoring so they couldn’t see the disaster unfolding. That time they fat-fingered a command and deleted a table in production instead of staging. We’ve all been there.

The point isn’t to avoid failure—that’s impossible. The point is to fail gracefully, learn quickly, recover faster, and build systems that can withstand the next inevitable catastrophe. Document your postmortems. Share your lessons learned. Build those circuit breakers. Set up those health checks. And when things break anyway—because they will—remember that even AWS has bad days.

So the next time you get paged at 3 AM because something you deployed broke production, take comfort in knowing that somewhere, some AWS engineer is also getting paged because their load balancer forgot how to health check and took down half the internet. We’re all just trying to keep the digital world spinning, one incident at a time.

And hey, at least your outage probably didn’t make international news.

The AWS outage of October 20, 2025, serves as a stark reminder that no system is infallible and no cloud provider is immune to failure. The best we can do is learn from these incidents, build more resilient architectures, and support each other when things inevitably go wrong. Because in the end, we’re all in this together—engineers, operators, and users alike—navigating the beautiful, fragile complexity of the modern internet.

Total
0
Shares
Leave a Reply

Your email address will not be published. Required fields are marked *

Previous Post
how-i-added-parallel-routing-to-react-router-v6-—-introducing-parallel-router-

How I added Parallel Routing to React Router v6 — Introducing parallel-router 🚀

Related Posts