Shaji John — Cloud Architect & Tech Leader

The Wake-Up Call

Our AWS bill hit $2.1 million per month. Finance asked engineering to explain. We couldn't. Not really. We had 847 EC2 instances, but nobody knew what half of them did. Reserved Instance coverage was at 23%. Someone was running GPU instances for a proof-of-concept that ended two years ago.

Twelve months later, we were at $1.2 million per month with better performance. Here's exactly how.

Phase 1: Visibility (Month 1-2)

You cannot optimize what you cannot see. Before making any changes, we built visibility.

Tagging enforcement: Every resource required Owner, Application, Environment, and CostCenter tags. AWS Config rules prevented untagged resource creation. We gave teams 30 days to tag existing resources or face deletion.

Cost allocation: Enabled AWS Cost Explorer with tag-based allocation. Each team saw their monthly spend for the first time. The reactions were... enlightening.

Anomaly detection: AWS Cost Anomaly Detection flagged unexpected spikes. Several "temporary" test environments were discovered running for months.

Visibility alone—before any optimization—reduced our bill by 12% as teams cleaned up abandoned resources.

Phase 2: Right-Sizing (Month 3-4)

AWS Compute Optimizer analyzed our fleet. The results were embarrassing. 67% of instances were over-provisioned. Teams had requested "room to grow" years ago and never needed it.

We created a right-sizing program. Each week, Compute Optimizer recommendations were converted into tickets. Teams had one sprint to evaluate and implement or provide written justification for keeping current sizes.

The key: involve teams rather than mandating changes. They understood their workloads best. Some recommendations were wrong for specific use cases. Most weren't.

Right-sizing saved 18% of EC2 costs.

Phase 3: Purchasing Strategy (Month 5-6)

With stable, right-sized workloads, we could predict baseline capacity. That's when purchasing optimization becomes powerful.

Compute Savings Plans: Purchased 1-year plans covering 70% of our baseline EC2/Lambda compute. These flex across instance families and regions—much better than Reserved Instances for most use cases.

Spot for stateless workloads: Development environments, batch processing, and stateless services moved to Spot. Karpenter managed Spot instance procurement with multi-instance-type fallback.

Reserved Instances for databases: RDS Reserved Instances for production databases where Savings Plans don't apply.

Purchasing optimization saved an additional 15%.

Phase 4: Architecture Changes (Month 7-12)

The final phase was harder: architectural changes that reduced resource needs.

Moved batch processing to Lambda instead of always-on EC2. Replaced self-managed Elasticsearch with OpenSearch Serverless for variable workloads. Implemented S3 Intelligent Tiering for all buckets over 128KB.

These changes required engineering investment but provided ongoing savings.

The Ongoing Discipline

Cost optimization isn't a project. It's a practice. We now review Compute Optimizer weekly. Cost anomalies trigger immediate investigation. New projects include cost estimates in design reviews. Finance and engineering meet monthly to review trends.

The culture shift—treating cloud spend as engineering concern, not just finance concern—was more valuable than any specific optimization.

How We Cut Our AWS Bill by 40% Without Sacrificing Performance