Shaji John — Cloud Architect & Tech Leader

The Humbling Experience

We were confident. Our system handled production traffic without major incidents. Deployments were automated. Monitoring seemed comprehensive. Then we ran a formal Well-Architected Review.

Forty-seven high-risk findings. Our "solid" architecture had critical gaps we'd been too close to see.

What The Review Revealed

The AWS Well-Architected Framework examines six pillars: operational excellence, security, reliability, performance efficiency, cost optimization, and sustainability. Our gaps were spread across all of them.

Security: IAM policies using wildcards. Security groups allowing broad access "temporarily" for three years. No encryption on EBS volumes because "it was optional when we launched."

Reliability: Single-AZ database because "RDS handles failover." (It doesn't, not without Multi-AZ.) Auto-scaling configured but never tested. Backup retention at default 7 days when compliance required 90.

Operational Excellence: Runbooks existed but were outdated. Incident response was tribal knowledge. Changes deployed without documented rollback procedures.

Every finding was technically obvious in retrospect. But nobody had systematically reviewed against a framework.

The Prioritization Challenge

Forty-seven findings can't be fixed at once. We needed prioritization. Here's how we approached it:

Category 1: Security and compliance risks. These block audits and create legal liability. Fix immediately regardless of engineering cost.

Category 2: Reliability gaps that affect customers. Single points of failure, inadequate disaster recovery, missing monitoring. Fix within 30 days.

Category 3: Operational improvements. Process gaps, documentation, efficiency improvements. Address within 90 days.

Category 4: Optimizations. Cost efficiency, performance tuning, sustainability. Backlog for continuous improvement.

The Implementation Journey

Security fixes came first. We created IAM policies from scratch using least-privilege principles. Enabled encryption on everything. Reviewed and tightened every security group.

Reliability improvements followed. Migrated RDS to Multi-AZ. Actually tested auto-scaling by simulating load. Extended backup retention and tested restore procedures.

The operational excellence work took longest because it required culture change, not just configuration. We wrote runbooks collaboratively. Practiced incident response with tabletop exercises. Required rollback plans for every deployment.

Making It Sustainable

The review findings were symptoms of a missing practice: regular architecture assessment. We implemented ongoing measures:

Quarterly Well-Architected reviews for critical workloads. New projects require Well-Architected alignment before production. AWS Trusted Advisor and Security Hub findings reviewed weekly.

Architecture debt, like technical debt, accumulates silently. Regular reviews make it visible before it becomes critical.

The Unexpected Benefits

Beyond fixing risks, the review improved team capability. Engineers learned AWS best practices they hadn't encountered. Design reviews became more rigorous. New hires onboarded faster with documented architecture.

I now recommend Well-Architected Reviews for any organization that hasn't done one recently. The findings will surprise you. The improvements will be worth it.

The Well-Architected Review That Changed Our Architecture