AWS’s Thermal Event Is a Warning: Disaster Recovery Plans Must Account for Physical Cloud Failure

AWS' thermal event in US-East-1 took down Coinbase, FanDuel, CME Group and more. Here's what that means for your DR plan.
Share post:

AWS’s latest “thermal event” in US-East-1 took down Coinbase, FanDuel, CME Group and more for nearly two days because a data center overheated. But the bigger story is less about the outage; it’s what it signals: the cloud is entering a new era of physical infrastructure failures, and not only are most disaster recovery plans still not built for it, enterprises don’t appear to be treating this as a systemic risk. They’re waiting for their cloud provider to fix physical issues rather than being honest about what true resiliency takes.

What happened?

For nearly two days, servers in one AWS availability zone in US-East-1 were shut down due to overheating in use1-az4. Although just one AZ was affected, the AWS Health Dashboard was reporting multiple AZ’s with completely disrupted or degraded EC2 instances, EBS volumes, Redshift clusters (and other services) in the North Virginia region. This affected crypto, compute, sports betting and derivative exchange platforms such as Coinbase, Heroku, Modal, CME Group and FanDuel. Millions of users felt this, all due to an overheated rack in a data center.

Multi-AZ isn’t the safety net we assumed

The simple answer when there are physical limitations and thus availability concerns within data centers (and the most recommended by hyperscalers) was always: just go multi-AZ. This works for small, contained physical hiccups in a particular AZ because your VMs fail over and most of the time, your users barely notice anything. But a prolonged physical outage, like the one we had over the weekend, is different.

We saw this in the March AWS regional outage when a drone hit an AZ in the Gulf Region. A large number of workloads were redirected from an impaired AZ to a remaining one. Dependency issues start to emerge and overload becomes the new problem. In this case, AWS was vague on why multiple AZs were affected but we can read between the lines and conclude: Multi-AZ just isn’t reliable anymore.

Coinbase learned this the hard way. Brian Armstrong, CEO of Coinbase openly acknowledged on ‘X’ that they’d been running dependent on a single AZ , and defended it by pointing to the tradeoffs that come with it.

“We design our services to be redundant to downtime in any one AWS AZ — this can introduce latency delays that are not desirable along with breaking customer co-location.”

It’s an honest admission, and it reflects a calculation that a lot of enterprises are (unfortunately) making. Cross-region redundancy costs more. It adds latency. It complicates compliance. And so teams decide the risk is acceptable, right up until their customers can’t access their cryptocurrency trading platform for over a day.

AI workloads and the physical reality of the cloud

For years, DR planning focused primarily on software failures, ransomware, and isolated infrastructure bugs. The assumption was that hyperscalers could absorb physical risk behind the scenes. Back in March the drone strike was a very unexpected physical impact directed at data centers. And as AI workloads rapidly increase power density inside data centers, we’re seeing the reality of just how much the cloud depends on its physical components.

Now, we are reminded that data centers need to be cooled. The reality is that AI is quickly changing the physical dependencies of the cloud. Data center cooling systems have serious limits as AI workloads are pushing power densities that current thermal engineering simply wasn’t designed to handle.

AI is concentrating enormous compute (and therefore enormous heat) in every single region. US-East-1 happens to be one of the most popular regions and in this scenario, cooling couldn’t keep up. And when physical limits are pushed, the fixes aren’t like bugs. They are prolonged repairs and these require very specific timelines. Most DR plans are built for quick software bugs or patches, not longer hardware fixes.

Meanwhile, AWS’s own guidance during the last Gulf region outage when a fire broke out was to back up to another region. They sent that recommendation after the fact, when many customers couldn’t even access their data to act on it.

The threat model must expand and plan for the unfamiliar in every region

Business continuity teams have traditionally planned for ransomware, code bugs, and the occasional natural disaster.  Now it has to include cooling failures, power density limits and geopolitical risk.

These outages are no longer isolated. It feels redundant to even write about each one as they happen. And as AI workloads multiply, we can fully expect this to happen in more regions.

So what should teams actually do?

The goal isn’t to build a perfect, zero-downtime systems, but to be honest about what you’re actually prepared for. We also need to close the gaps that we know about and seek advice on what we don’t.

Here’s where to start:

  • Have you accounted for physical risks in your resiliency plan, not just software and ransomware?
  • Is your latest snapshot backed up in another region, ready to restore?
  • Have you pre-built a target environment with the right IAM roles, security groups, and permissions before you need it?
  • Are your network configs cloned and deployable? A restore without your VPN, load balancers, and routing is meaningless.
  • Are you minimizing transfer costs? Keeping only your most recent backup in a secondary region goes a long way.
  • Have you considered cross-cloud backup? It protects against physical cloud failures and full-cloud software outages alike.
  • Have you dry-run a real restore scenario, including one where your primary region is completely unavailable?

One more question worth asking out loud: are the events your team calls “high impact, low probability” still low probability? When they’re happening every few months, that label needs to be revisited in IT business continuity meetings.

US-East-1 overheated, and for thousands of teams, workloads went with it. The bigger question isn’t whether this will happen again (because it will). The question is whether you’re treating this as the signal it is.

Our recommendation? Close the gaps now, before a larger and more widespread outage makes the decision for you. If you’re not sure where to start, our step-by-step outage response guide walks you through exactly what to assess, what to build, and how to dry-run a real restore before you ever need it.

You might also like