- Azure FinOps Essentials
- Posts
- Azure FinOps Essentials
Azure FinOps Essentials
Designing for Failure Without Paying Twice
Hi there, and welcome to this week’s edition of Azure FinOps Essentials.
This week, I am tackling a challenge many teams face but rarely discuss: how to design for failure without designing yourself into a bigger invoice.
High availability is important. But most cloud architectures quietly overdo it. Secondary regions run at full power. Zone redundancy is enabled by default. Premium SKUs are selected for peace of mind, not based on real risk.
The result? You get resilience, but you also get a bill that keeps growing.
In this edition, I will show you how to design fault-tolerant systems without paying for infrastructure you do not need. You will see common over-engineering patterns, better tradeoffs, and Azure tools that help you build smart without building twice.
Let’s dig into the cost side of reliability.
Cheers, Michiel
Find out why 100K+ engineers read The Code twice a week
Staying behind on tech trends can be a career killer.
But let’s face it, no one has hours to spare every week trying to stay updated.
That’s why over 100,000 engineers at companies like Google, Meta, and Apple read The Code twice a week.
Here’s why it works:
No fluff, just signal – Learn the most important tech news delivered in just two short emails.
Supercharge your skills – Get access to top research papers and resources that give you an edge in the industry.
See the future first – Discover what’s next before it hits the mainstream, so you can lead, not follow.
Cost and resilience pull in opposite directions
We’re told to design for failure.
Use multiple zones.
Use premium storage.
Add a second region.
Scale out, not up.
It all makes sense — until the invoice arrives.
In many teams, availability architecture is driven by worst-case thinking. Outages are scary, and nobody wants to be blamed for downtime. So we duplicate services, pay for extra capacity, and layer in failover options “just in case.” The problem is that these decisions are rarely reviewed later.
Over time, what starts as a good-faith effort to build resilience turns into a quiet cost multiplier.
I’ve seen secondary regions that were never used, but fully provisioned.
Zone redundancy on services that don’t need it.
Premium SKUs running in environments with zero traffic.
It’s not that resilience is the problem.
It’s that we’ve stopped treating it like a tradeoff.
In this edition, I’ll break down how to design architectures that can handle failure and respect your cloud budget. We’ll explore over-engineering patterns, cost-aware design alternatives, and how to make reliability decisions that are both technical and financial.
Let’s start with where things often go too far.
Common over-engineering patterns
Most teams don’t set out to overspend on availability. These patterns often start with good intentions: protect against failure, reduce risk, improve uptime. But without clear boundaries, they quietly expand into over-engineering and unnecessary cost.
Here are some of the most common examples.
Always-on capacity in multiple regions
Setting up a secondary region is a solid high availability strategy. But if both regions are active and always-on, you’re paying double. Even worse, the secondary region often goes unused for months. If failover is rare, consider an active-passive setup with lower specs or a cold standby model.
Zone redundancy by default
Many services like Azure Storage, App Service, and SQL offer ZRS or zone-redundant options. These are useful, but not always required. I’ve seen dev and test environments use ZRS without any business requirement. That means you’re paying for durability where no one needs it.
Premium tiers often include SLAs, scaling options, or extra features. But once selected, they rarely get reviewed. A team might choose Premium V3 for an App Service that only runs a staging app, or pick Premium Functions to avoid cold starts, even when cold starts aren’t a real issue.
Overprovisioning for hypothetical traffic
Failover planning often assumes peak usage during an outage. So teams provision the backup region or failover path for worst-case load, even if traffic would likely drop in a real-world incident. You don’t need to match full production scale unless business rules explicitly require it.
Identical infrastructure across all environments
Using the same architecture for dev, test, and prod improves consistency. But when that includes regional failover, zone redundancy, or premium services, you’re adding cost where risk is low. Resilience should be based on how critical the environment really is.
These patterns are not always wrong, but when left unchecked, they shift your cloud spend away from value and toward silent waste.
Designing for failure without paying twice
Resilience is important, but that does not mean every system needs the highest possible uptime. In practice, many workloads can tolerate short periods of downtime, degraded performance, or manual failover. Designing for failure does not mean duplicating everything. It means being deliberate about what you protect, how you recover, and what you are willing to pay for.
Here are practical strategies to reduce availability costs while still protecting against failure, along with Azure tools that can support those choices.
Use active-passive failover where possible
Not every workload needs full active-active deployments. For many internal tools, batch systems, or customer-facing portals with moderate usage, an active-passive setup is enough. Keep the passive region warm or cold, depending on your recovery time goals. Only scale when failover is triggered.
Azure tools that help:
Azure Traffic Manager to route traffic based on endpoint health
Azure DNS failover to switch over at the DNS level
Azure Site Recovery to maintain a cold standby environment with lower SKUs
Choose lower tiers for standby
Your secondary region does not need to match production specs. Use lower SKUs, minimal autoscaling, or standard tiers for the backup path. If failover is rare and temporary, the user experience can degrade slightly without breaking the service.
Azure tools that help:
App Service Standard tiers with autoscale rules scoped to regions
Azure Monitor autoscale for predictable growth when failover is triggered
ARM templates or Bicep to recreate or resize infrastructure quickly
Build for fast recovery, not just high uptime
If the cost of high availability is out of line with the business value of the system, invest in fast recovery instead. That means automation, snapshots, and clear recovery steps rather than duplicated infrastructure.
Azure tools that help:
Azure Backup for storage, VM, and SQL snapshots
Infrastructure-as-code for redeployment speed
Deployment slots to speed up controlled swaps without impact
Match strategy to service criticality
Critical production systems may need premium uptime guarantees. Internal tools or non-critical APIs probably do not. Resilience should reflect the impact of failure, not a blanket rule applied across environments.
Azure tools that help:
Azure Policy and Management Groups to enforce different standards per environment
Tags to classify services by business impact
Budgets and alerts scoped to environment and workload type
Watch your failover costs
If you have a standby region, monitor it. These environments often drift into production usage. Someone adds a Function App, enables logging, or forgets to turn it off, and suddenly your failover zone is costing as much as production.
Azure tools that help:
Azure Cost Management to detect unexpected resource growth
Scheduled shutdown scripts for dev, test, and backup infrastructure
Alerts on specific SKUs, tags, or subscriptions tied to failover regions
The best failover strategy is not always the most redundant one. It is the one that meets business continuity needs at a cost the business is willing to pay. Azure gives you the building blocks to design resilient systems. FinOps helps you decide which ones to use, and when enough is enough.
Wrapping up
Resilience matters. But so does cost.
Many teams build for availability without asking what it is worth. They duplicate infrastructure, overprovision capacity, and pay for premium uptime guarantees, even when the business impact of failure is low.
FinOps is not about cutting these protections. It is about making the tradeoffs visible. It gives you the language and data to ask the right questions.
Do we really need this to be active-active?
Can we recover in minutes instead of staying always-on?
Does the cost align with the risk?
When cost becomes part of the design conversation, resilience becomes intentional.
And that is where the real savings begin.
Most coverage tells you what happened. Fintech Takes is the free newsletter that tells you why it matters. Each week, I break down the trends, deals, and regulatory shifts shaping the industry — minus the spin. Clear analysis, smart context, and a little humor so you actually enjoy reading it.
Please help me by visiting my sponsor. And interested in sponsoring yourself, then visit the sponsor page.
Thanks for reading this week’s edition. Share with your colleagues and make sure to subscribe to receive more weekly tips. See you next time!
Want more FinOps news, then have a look at FinOps Weekly by Victor Garcia
|




Reply