INSIGHTS Blog

Designing for failure

multicloud

AWS

[By Andy Bold]

When AWS us-east-1 goes down, the internet holds its breath. Services fail like dominoes. Status pages stop updating (because they’re hosted in us-east-1). Engineers scramble. And inevitably, once the dust settles, the hot takes begin.

“They just need better multi-region architecture.”

“We just need to be multicloud.”

“Companies just need proper Business Continuity Plans.”

That word—”just”—is doing a lot of heavy lifting. And it’s dangerous.
But more than that, it’s unkind. Behind every outage is a team having the worst day of their month, trying to bring systems back online while the internet watches and judges.

The seductive simplicity of “just”

“Just” makes complex problems sound simple. It transforms months of engineering work, millions in infrastructure costs, and fundamental architectural decisions into something that sounds like an afternoon’s work.

Let’s unpack what’s really hiding behind these “just’s.”

“Just” build multi-region

Data consistency: Do you need strong consistency or eventual consistency? Strong means dramatically increased latency. Eventual means conflict resolution strategies. Who decides which write wins?
Compliance: GDPR says European data stays in Europe. Your “simple” multi-region setup now needs geo-fencing, data residency controls, and legal review for every region.
Cost: You’re not doubling infrastructure costs—you’re multiplying by 3-5x when you include cross-region data transfer, replication, monitoring, and engineering time.
Testing complexity: You now need to test region failover, failback, partial failures, network partitions between regions, and degraded-but-not-down scenarios. Your test matrix just exploded.
Multi-region isn’t a switch you flip. It’s a fundamental architectural decision that touches every part of your system.

“Just” go multicloud

If multi-region is complex, multicloud is multi-region on hard mode.

API differences: AWS Lambda isn’t Google Cloud Functions isn’t Azure Functions. Your infrastructure-as-code? Rewrite it. Your team’s expertise? Divide it across multiple platforms.
Data egress costs: Moving data between clouds isn’t just expensive—it’s really expensive. Your CFO would like a word.
Operational overhead: Different consoles, different APIs, different debugging tools, different support processes. Your on-call engineer needs fluency in multiple cloud dialects.
The paradox: The outage you’re protecting against is vanishingly rare. The complexity you’re introducing is constant. You’re trading a 0.01% risk for 100% operational overhead.

“Just” write a better BCP

BCPs are written in calm moments by rested people with coffee. They’re executed by stressed engineers who’ve been awake for 18 hours, hands shaking from adrenaline, trying to think through brain fog.
The plan says “Step 3: Verify replication status.” It doesn’t say what to do when the replication dashboard is also down, you can’t access the database because your VPN is flaking, and you’ve never actually done this before.

They’re obsolete immediately: Your architecture changed last sprint. When was the last time you updated your BCP?
They assume you can think clearly: The person executing this is terrified of making it worse.
They can’t cover everything: The failure you didn’t plan for is the one that will happen.

“Just” ignores the humans

Here’s what “just” really means:

“If they had just done this obvious thing, everything would be fine.”

It’s blame disguised as advice. But worse, it ignores the people.

The engineer who chose differently

The architect who deployed to a single region wasn’t lazy. They were working with a budget approved six months ago, a go-to-market deadline that couldn’t slip, a team of three engineers supporting five services, and a backlog of security issues that needed attention first.

They probably did consider multi-region. They looked at what it would actually take and made a trade-off. “Just” implies they hadn’t thought of it. They had. They chose differently.

The person getting paged at 3 AM

Someone gets woken up. They’re pulled from sleep, from dinner, from their kid’s birthday party. The incident channel is chaos. Users are angry. Leadership wants updates.

This person is trying to think clearly while exhausted and stressed, following runbooks that might be outdated, making decisions that could make things better or worse, hoping they don’t make it worse.

“Just follow the BCP” assumes peak cognitive function. The reality is adrenaline, fear, and brain fog.

The team that shipped last quarter

Behind every “just build it better” is a team told to ship features faster. They weren’t given time to build perfect architecture. They were given a sprint and a goal. They made it work and moved on to the next urgent thing.

“Just” implies they had unlimited time and resources. They had standup at 9 AM and a demo on Friday.

The human cost of “just”

Every “just” carries invisible weight.

The engineer who made the architectural decision reads “just use multi-region” and feels guilt over a choice they made with different constraints.
The person on-call during the incident reads “just follow the plan” and remembers the panic and the feeling of letting everyone down.

The manager who allocated the budget reads “just invest in resilience” and thinks about the other critical needs that went unfunded.

“Just” is a word that costs nothing to say and everything to hear.

A better framework: Expect failure, master response

Since we can’t “just” prevent all failures, let’s talk about what we can control: how we respond.

Failure is not optional

Everything fails. The question isn’t “Will we have an outage?” but “When we have an outage, what happens next?”

This shift is liberating. You stop pursuing the impossible goal of zero failures and start building for the inevitable.

Response time > Recovery time

Fast detection means clear alerts, observable systems, and automated monitoring. Fast diagnosis means updated runbooks, accessible diagnostic tools, and clear ownership.

Communication over complexity

When things break, your users need three things: you know, you’re working on it, and here’s what they should do.

Your status page shouldn’t be hosted in the region that’s down. Have a backup channel. Test it.

Practice over planning

Run failure drills with your actual team. Kill services. Corrupt data. Do it during business hours.

But here’s what makes practice valuable: it’s about building confidence in the people, not just testing systems. When an engineer has recovered from a simulated failure three times, they don’t panic when it happens for real.
Make practice safe to fail. Celebrate the mistakes you catch during drills.

Build a culture where practicing for failure is normal.

Build graceful degradation

What if your system could work, just… worse? Maybe you can’t process new orders, but you can show existing ones. Maybe you can’t do real-time, but you can do asynchronous.

Think about your system as capabilities with different criticality levels. When something fails, you lose capabilities incrementally rather than catastrophically.

The uncomfortable truth

Perfect reliability is astronomically expensive, and even then you can’t guarantee it.

AWS spends billions on reliability and us-east-1 still has outages. If perfect reliability is impractical for companies with unlimited resources, what does that mean for the rest of us?

It means we need honesty about trade-offs. We need to stop pretending there’s a “just” that solves everything. We need systems that fail well, not systems that never fail.

So what should you actually do?

Start with honesty:

What’s your actual risk tolerance and budget?
What’s your actual team capability?
What does “good enough” look like?

Build for reality:

Instrument everything
Make recovery easy
Communicate clearly
Learn from incidents without blame
Accept trade-offs

Build for humans:

Protect your on-call engineers with reasonable rotations
Create psychological safety
Document for exhausted people at 3 AM
Celebrate recovery, not just uptime
Give teams time to build resilience

And most importantly: Stop saying “just.”
Every “just” hides complexity, dismisses trade-offs, and makes someone feel worse about a choice they made with the constraints they had.

When you’re tempted to say “they should just,” try instead: “I wonder what trade-offs led to that decision?” Empathy is cheaper than multi-region, and it makes everything better.

Final thought

Resilience isn’t about preventing every failure. It’s about taking care of the people who keep the systems running.

The next major outage will happen. You can’t prevent it. But you can prepare your team to respond with confidence rather than panic. You can create a culture where people feel supported, not blamed.
And when that outage happens to someone else, you can choose empathy over “just.”

Because behind every incident is a person doing their best. They deserve better than our armchair quarterbacking.
And that’s worth far more than any “just.”