“Last Monday, at 3:47 AM,” a customer (who we’ll call Todd) explained, “my phone lit up. Not with one alert—but with dozens.”
Todd is the Infrastructure Lead at a global FinTech company. He and his team have been running the backup processes for years. And, despite regions going down previously with no effect, this outage caused their primary region to go dark.
This could easily have been a story about the 15 hours of chaos that followed. But it’s not.
Here’s what makes Todd’s story different from the hundreds of other sleepless engineers that night: by 4:15 AM, his company was back online. Fully operational. While competitors sent apologetic emails to customers and executives held emergency board calls, Todd’s team was drinking coffee and monitoring dashboards that showed green across the board.
How?
It was a few clicks about a half year ago, simple execution when the disaster struck. It was essentially the foresight to appropriately prepare and the decisions that they’d made months earlier. In particular, four reasons turned a potential catastrophe into what Todd now calls “the night that justified every DR conversation we’ve ever had with N2W.”
1. Their Lifeline Wasn’t in the Fire
When the AWS outage hit, Todd’s first instinct was to log on the AWS’ Service Health Dashboard where customers are able to see in real time if there any ongoing issues across services and regions. That’s when he learned his company wasn’t the only one that was down.
As an N2W customer, Todd’s N2W server runs entirely inside his own AWS environment, not on an external SaaS platform. That means his backup and recovery isn’t tied to someone else’s infrastructure, nor is it limited by when or what he’s allowed to recover. When the regional outage hit, Todd could still access everything through native AWS APIs, the gold standard for cloud-native backup and recovery.
By contrast, many backup tools store customer data in their own proprietary clouds, which still depend on the very AWS services and regions that may be failing. Todd didn’t have that vulnerability. The only component he had to spin up was his lightweight N2W backup instance that’s fully contained and safely running in his own account, exactly where it needed to be during an outage.
“We weren’t waiting for someone else’s service to come back online,” he told us later. “We weren’t in a queue with ten thousand other panicked customers. Our tools were ours, running in our environment, completely accessible.”
No matter what prevents access to your data, restore must be immediate. And you can’t restore anything if your recovery tool is just as unavailable as your production environment.
2. They Had the Playbook Already
Most companies didn’t have a regional failure in their disaster recovery playbook. Around 80% of restores are a few deleted files, maybe a critical server or two. Regional blackouts? That felt highly unlikely and even if it did on the off chance happen, it was too costly to prepare for. Why replicate data and pay double the cost for another regional storage when AZs are 99.9999% available?
This wasn’t the case with Todd’s team.
Months earlier, they’d built Recovery Scenarios specifically for this type of outage. Their critical data was already replicated to a separate region AND a separate account (in the case of a ransomware attack), not as a someday project, but as live, tested infrastructure. They had gone through scheduled drills with the click of a button, pre-prioritized which resources to recover first and produced reports documenting that they had a successful plan in place.
In addition, Todd knew his team had thought about costs beforehand. They were saving significantly on storage costs by archiving their snapshots to low-cost Amazon S3 tiers, in this case AWS Glacier Instant Recovery.
When the alerts started flooding in, Todd didn’t need to figure out what to replicate or where to send it. He didn’t need to calculate how long it would take to copy terabytes of data across regions because he was intelligently and affordably storing data.
While the clock was ticking and customers were locked out in other companies, Todd knew that their playbook was already written. The infrastructure was already in place. All he had to do was execute.
“We had regional recovery scenarios ready to go,” she explained. “When the outage hit, we didn’t scramble to figure out what to prioritize or from where to restore.”
- Use Lifecycle Policies to Save Money: Set up lifecycle policies to move old backups to cheaper storage options like Amazon S3 Glacier. This can help you save a lot of money on storage costs.
- Back Up to Different Regions and Accounts: Make your disaster recovery plan stronger by copying backups to different AWS regions or accounts. This protects your data from region-specific problems and security issues.
- Automate Your Backup to Reduce RTO: Use AWS Backup to set up frequent backup intervals. Automating backups every hour or even every few minutes ensures you can recover data quickly, minimizing downtime.
- Tag Your Resources for Easy Management: Tags help you quickly identify and group related backups, making it easier to manage them and to monitor costs. This can also simplify reporting and compliance checks.
- Test Your Disaster Recovery Plan Regularly: Automate DR drills to check your backup and recovery processes. Make sure your backups work and that you can restore data quickly to find and fix any potential problems.
3. They Recovered More Than Just Data
Here’s something most people miss about disaster recovery: your data is only half the battle.
You can spin up perfect copies of your databases and servers in a new region, but if your network configuration is wrong (i.e. security groups, subnets, and routing tables) or aren’t set up correctly, your restore is pointless. And configuring all of that manually, at 4 AM, under pressure is not exactly doable.
Many times the wrong cables that are connecting restored servers are where things fall apart.
Todd’s team had pre-cloned their entire network environment and all of the metadata associated with their servers. Other teams, meanwhile, had to scramble through files to figure out every piece of their server configuration.
4. They’d Practiced Until It Became Boring
This is the part that made all the difference.
Todd’s team had tested their recovery scenarios so many times that it had become routine. Almost boring. Every quarter, they’d run through the entire process: trigger the failover, verify the recovery, document the results, then fail back.
By the time the real outage happened, executing the recovery wasn’t a high-stakes moment. It was a regular Monday. 15 minutes later, they were operational in their secondary region.
They built it into muscle memory. No scrambling over where to recover from or which servers to prioritize. Their network settings were already pre-cloned. Their reports were already generated to prove successful DR tests.
And here’s what truly set them apart: N2W is the only backup and DR platform, unlike AWS Backup or any other competitor, that lets customers rehearse and automate full-region recovery using their own AWS-native environment.
No OverHaul Needed. Just Simple Execution
After the outage, I heard from dozens of teams who spent the next week questioning everything. “Do we need to go Multicloud?” “Should we rebuild our entire DR strategy?” “How do we make sure this never happens again?”
But the real lesson isn’t about your infrastructure being wrong. It’s so much simpler than that. You don’t need to rebuild everything from scratch. You need
- The realization that 99.99999% availability across Availability Zones doesn’t help when the entire region becomes unreachable.
- Full control of your recovery tools. If your SaaS DR solution lives in the same cloud region that’s failing, you don’t have DR. You have a shared point of failure.
- To build recovery scenarios before you need them. Cross-region replication and pre-prioritized resources aren’t luxuries. They’re the baseline for regional outages that will inevitably happen again.
- To automate the complicated stuff. Your team shouldn’t be manually configuring security groups at 3 AM. Pre-clone your network settings and load them into your recovery templates.
- To test until it’s muscle memory. The organizations that recovered fastest weren’t lucky. They’d practiced. Regularly. Until the execution became routine.
The best DR strategies aren’t the most complex ones. They’re the ones you can execute under pressure without thinking.
Todd ended our conversation with something that stuck with me: “Every dollar we spent on DR felt like insurance we’d probably never use. Until Monday night. Now it feels like the best money we ever spent.”
The AWS outage wasn’t a sign that your strategy is broken. It was a reminder that having a strategy on paper isn’t enough.
The question is: when your phone lights up at 3:47 AM, will you be ready to execute?
Ready to build a disaster recovery strategy you can execute under pressure? Learn how N2W helps organizations prepare for regional outages: get.n2ws.com/cloud-outage-guide