Shaji John — Cloud Architect & Tech Leader

The Evolution Nobody Warns You About

Everyone's Terraform journey starts the same way. A single main.tf file. Everything works. Life is good. Then you add more resources. The file grows. You split into modules. State files multiply. Now you have 50 state files, circular dependencies, and that one module nobody dares to touch.

I've managed Terraform across 200+ AWS accounts at a Fortune 500 company. Here's how to avoid the traps I fell into.

The Directory Structure That Scales

Forget the flat structure from tutorials. In production, you need hierarchy that reflects organizational reality.

Organize by environment, then by component, then by region. Each environment (dev, staging, prod) gets complete isolation—separate state files, separate IAM roles, separate backend configurations. When prod breaks, you can confidently test fixes in staging because they share zero state.

Within environments, separate state by blast radius. Networking in one state. Databases in another. Application infrastructure in another. When someone accidentally destroys their app terraform, the VPC and RDS instances remain untouched.

Module Design Philosophy

Bad modules try to do everything. Good modules do one thing well with sensible defaults.

I follow the "3-layer" module approach. Layer 1 is primitive modules—thin wrappers around single resources with company standards (like required tags). Layer 2 is composite modules—combining primitives into coherent patterns like "web application" (ALB + ECS + CloudWatch). Layer 3 is product modules—complete stacks for specific business needs.

Teams consume layer 2 and 3 modules. They rarely need layer 1 directly. This abstraction lets the platform team evolve underlying implementations without breaking consumers.

State Management at Scale

State is where Terraform becomes dangerous. Remote state with locking is mandatory. But there's more to consider.

State segmentation: More granular state means smaller blast radius and faster operations. My rule: if terraform plan takes more than 60 seconds, the state is too big.

State recovery: Enable S3 bucket versioning on your state bucket. When (not if) someone corrupts state, you can recover.

Cross-state references: Use data sources and remote state carefully. Circular dependencies between states are a nightmare to resolve. Design one-way dependencies: networking → databases → applications, never the reverse.

The CI/CD Pipeline That Saves You

Never run terraform apply from a laptop in production. Ever. The pipeline is your protection.

PR opens: terraform plan runs automatically, results posted as PR comment. Reviewers can see exactly what will change before approving.

PR merges: terraform apply runs with approval gates for production. Apply output captured in CI logs for audit trail.

Drift detection: scheduled job runs terraform plan against all states daily. If manual changes occurred, alert immediately.

The One Thing I'd Do Differently

Start with Terragrunt from day one. DRY configuration, automatic backend configuration, and dependency management solve problems that become extremely painful to fix later. The learning curve is worth it.

Infrastructure as Code: Lessons from Managing 200+ AWS Accounts