Amazon AWS Outage: Why Your Disaster Recovery Plan Probably Won’t Work

Charlie Maclean-Bristol looks at how the AWS outage shows that “it does actually happen” despite all our layers of defense and resilience – a stark reminder of the close coupling of systems, where hidden dependencies can align to trigger failure on a global scale.

While preparing for an exercise recently, I saw a BBC News report on the Amazon outage. Given that my exercise scenario focused on the loss of a critical payment system at a financial institution, the timing proved particularly fitting. It reminded all the participants of the fragility of our internet infrastructure, and that “it wouldn’t happen, we have this in place…” does actually happen despite all the layers of defense and resilience. It also serves as an example of the close coupling of systems I have written about in previous bulletins. The concept of close coupling refers to building new systems on top of existing ones, often without anyone having a complete overview of the entire structure. This creates inherent vulnerabilities that only become apparent when specific conditions align to cause a failure.

Looking at the AWS outage on 20 October proves a hard truth: your disaster recovery strategy likely can’t handle what just happened to over 1,000 organizations worldwide. Here’s what went wrong and what I think you should consider.

The DNS single point of failure

Amazon’s US-EAST-1 data center in Virginia, its largest cluster, experienced what AWS called a “latent race condition” in its DNS system. Critical processes that store and manage Domain Name System records fell out of sync, triggering automated failures across the network.

“It’s always DNS!” is what tech professionals say, because this common error causes disproportionate havoc. The problem wasn’t a cyber attack or dramatic failure – it was mundane infrastructure falling out of sync in an unlikely sequence of events.

Action: Map your DNS dependencies now. Run a tabletop exercise asking: what happens when your DNS provider fails? Most organizations can’t answer this.

The concentration risk regulators warned about

The UK government holds 189 AWS contracts worth £1.7bn, with 35 public sector authorities dependent on AWS across 41 active contracts. The irony? As technology partner Tim Wright noted: “The FCA and PRA have repeatedly highlighted the dangers of concentration risk in cloud service provision for regulated entities for a number of years.”

The Treasury committee has now written asking why Amazon hasn’t been designated a “critical third party” to UK financial services, which would impose regulatory oversight.

Over 2,000 companies were affected globally. Lloyds Bank customers couldn’t access services until mid-afternoon. HMRC was disrupted. Airlines faced check-in delays. Even smart beds overheated and got stuck in inclined positions when Eight Sleep’s internet-connected mattresses lost connectivity.

Professor Brent Ellis called it “nested dependency”- where popular platforms rely on technical underpinnings controlled by just a few providers. “Even small service outages can ripple through the global economy.”

The migration problem: Companies face “prohibitively high” costs to move data away from AWS once embedded. Stephen Kelly of Cirata noted the explosion of enterprise data with single providers makes switching vendors financially unrealistic.

Action: Document AWS dependencies in your supply chain risk register. Which critical third parties in your ecosystem run on AWS? This should be added to your due diligence checks, and you may find that both you and your supplier have a specific Amazon data center as a single point of failure.

The recovery reality check

The outage began at 8am BST. Some services recovered within hours. Others, including Lloyds, Venmo, and Reddit, experienced problems until mid-afternoon. Full restoration took approximately 15 hours.

Professor Mike Chapple from Notre Dame identified “cascading failures” during recovery: “It’s like a large-scale power outage. The power might flicker a few times.” Amazon initially “only addressed the symptoms” rather than the root cause.

This matters because your Recovery Time Objectives (RTOs) almost certainly assume faster vendor resolution. Delta Air Lines is still pursuing over £500m in losses from CrowdStrike more than a year after that incident, partly because they had to manually reset 40,000 servers even after the vendor fixed the problem.

Action: Review your RTOs. What’s your actual recovery capability when your cloud provider says “investigating”? When they say “resolved” but cascading failures continue? Are your RTOs realistic?

What companies should have done differently

Professor Ken Birman from Cornell was blunt: “Companies using Amazon haven’t been taking adequate care to build protection systems into their applications.”

The appeal of hyperscalers is clear – no hefty server costs, fluctuating traffic handled seamlessly, enhanced cybersecurity. But as Professor Madeline Carr noted: “Assuming they are too big to fail or inherently resilient is a mistake, with the evidence being the current outage and past ones.”

Three things to do next week:

Review your exposure to DNS failure – not just multi-zone within the same region. US-EAST-1 took down “distributed” architectures because they shared DNS infrastructure.

Run a supplier dependency audit – map which critical services (yours and your third parties’) depend on AWS, Azure, or Google Cloud.

Challenge your RTO assumptions and manual workaround abilities – ask your incident management team: is a data center failure liable to cause us to breach our RTOs or Impact Tolerances? What manual workarounds exist, and are they documented?

As Dr Corinne Cath-Speth from Article 19 stated: “The infrastructure underpinning democratic discourse, independent journalism and secure communications cannot be dependent on a handful of companies.”

Your next exercise should test what happens when that handful fails.

+++++++++++++++++++++++++++++++++++++++++++++++

This article was originally published by BC Training Ltd.

Charlie Maclean-Bristol is the author of the groundbreaking book, Business Continuity Exercises: Quick Exercises to Validate Your Plan

“Charlie drives home the importance of continuing to identify lessons from real-life incidents and crises, but more importantly, how to learn the lessons and bring them into our plans. Running an exercise, no matter how simple, is always an opportunity to learn.” – Deborah Higgins, Head of Cabinet Office, Emergency Planning College, United Kingdom

Amazon AWS Outage: Why Your Disaster Recovery Plan Probably Won’t Work

Amazon AWS Outage: Why Your Disaster Recovery Plan Probably Won’t Work

Charlie Maclean-Bristol looks at how the AWS outage shows that “it does actually happen” despite all our layers of defense and resilience – a stark reminder of the close coupling of systems, where hidden dependencies can align to trigger failure on a global scale.

Charlie Maclean-Bristol is the author of the groundbreaking book, Business Continuity Exercises: Quick Exercises to Validate Your Plan

Click here for your FREE business continuity exercises!

Stay in touch with Our Updates

p r