The Latest AWS’s “Thermal Event” and what it means for your disaster recovery planning

AWS's latest 'thermal event' in US-East-1 took down Coinbase, FanDuel, CME Group and more for nearly two days because a data center overheated. Here's what that means for your disaster recovery plan, and why multi-AZ alone isn't the solution for a future forward DR plan.
Share post:

The Cloud Overheated. Did your DR Plan Account for That? Probably Not

Last Friday, AWS’s prolonged “Thermal Event” in US-East-1 is essentially a preview of what’s to come and the worry is not just current data loss. It’s that enterprises are looking for their cloud provider to fix this, rather than being honest about systemic risks and how to achieve true resiliency.

What Happened?

For nearly two days, servers in one AWS availability zone in US-East-1 were shut down due to overheating in use1-az4. Although just one AZ was affected, the AWS Health Dashboard was reporting multiple AZ’s with completely disrupted or degraded EC2 instances, EBS volumes, Redshift clusters (and other services) in the North Virginia region. This affected crypto, compute, sports betting and derivative exchange platforms such as Coinbase, Heroku, Modal, CME Group and FanDuel. Millions of users felt this, all due to an overheated rack in a data center.

Multi-AZ isn’t the safety net we assumed

The simple answer (and the most recommended by hyperscalers) to availability concerns has always been: just go multi-AZ. This works for small, contained physical hiccups in a particular AZ because your VMs fail over and most of the time, your users barely notice anything. But a prolonged physical outage, like the one we had over the weekend, is different.

We saw this in the March AWS regional outage when a drone hit an AZ in the Gulf Region. A large number of workloads were redirected from an impaired AZ to a remaining one. Dependency issues start to emerge and overload becomes the new problem. In this case, AWS was vague on why multiple AZs were affected but we can read between the lines based on what we’ve seen. Multi-AZ just isn’t reliable anymore.

Coinbase learned this the hard way. Brian Armstrong, CEO of Coinbase openly acknowledged on ‘X’ that they’d been running dependent on a single AZ , and defended it by pointing to the tradeoffs that come with it.

“We design our services to be redundant to downtime in any one AWS AZ — this can introduce latency delays that are not desirable along with breaking customer co-location.”

It’s an honest admission, and it reflects a calculation that a lot of enterprises are (unfortunately) making. Cross-region redundancy costs more. It adds latency. It complicates compliance. And so teams decide the risk is acceptable, right up until their customers can’t access their cryptocurrency trading platform for over a day.

AI workloads and the physical reality of the cloud

We talk a lot about virtual architecture when we design our cloud systems, but what we’re seeing in the last year especially is that the physical reality gets lost. We saw this physical impact affect data centers back in March due to war.

Now, we are reminded that data centers need to be cooled. Even more than that, these cooling systems have serious limits as AI workloads are pushing power densities that current thermal engineering simply wasn’t designed to handle.

AI is concentrating enormous compute (and therefore enormous heat) in every single region. US-East-1 happens to be one of the most popular regions and in this scenario, cooling couldn’t keep up. And when physical limits are pushed, the fixes aren’t like bugs. They are prolonged repairs and these require very specific timelines. Most DR plans are built for quick software bugs or patches, not longer hardware fixes.

Meanwhile, AWS’s own guidance during the last Gulf region outage when a fire broke out was to back up to another region. They sent that recommendation after the fact, when many customers couldn’t even access their data to act on it.

The threat model needs to plan for the unfamiliar

Business continuity teams have traditionally planned for ransomware, code bugs, and the occasional natural disaster.  Now it has to include cooling failures, power density limits and geopolitical risk.

These outages are no longer isolated. It feels redundant to even write about each one as they happen. And as AI workloads multiply, we can fully expect this to happen in more regions.

So what should teams actually do?

The goal isn’t to build a perfect, zero-downtime systems, but to be honest about what you’re actually prepared for. We also need to close the gaps that we know about and seek advice on what we don’t.

Here’s where to start:

  • Have you accounted for physical risks in your resiliency plan, not just software and ransomware?
  • Is your latest snapshot backed up in another region, ready to restore?
  • Have you pre-built a target environment with the right IAM roles, security groups, and permissions before you need it?
  • Are your network configs cloned and deployable? A restore without your VPN, load balancers, and routing is meaningless.
  • Are you minimizing transfer costs? Keeping only your most recent backup in a secondary region goes a long way.
  • Have you considered cross-cloud backup? It protects against physical cloud failures and full-cloud software outages alike.
  • Have you dry-run a real restore scenario, including one where your primary region is completely unavailable?

One more question worth asking out loud: are the events your team calls “high impact, low probability” still low probability? When they’re happening every few months, that label needs to be revisited in IT business continuity meetings.

US-East-1 overheated, and for thousands of teams, workloads went with it. The bigger question isn’t whether this will happen again (because it will). The question is whether you’re treating this as the signal it is.

Our recommendation? Close the gaps now, before a larger and more widespread outage makes the decision for you. If you’re not sure where to start, our step-by-step outage response guide walks you through exactly what to assess, what to build, and how to dry-run a real restore before you ever need it.

You might also like