Terraform and AWS: I Broke Production Twice Before I Figured This Out
Infrastructure as Code with Terraform on AWS — practical lessons and real mistakes
Mostafa
Fractional CTO & Software Architect
I’ve spent the last decade building and scaling systems on AWS. And for the last 6 years, Terraform has been my primary tool for doing it. It’s powerful. It’s flexible. And it’s deceptively easy to screw things up with. Seriously.
I’ve seen teams spend weeks wrestling with Terraform, only to end up with inconsistent infrastructure and exploding costs. I’ve been there. I broke production twice before I understood the core principles. This isn’t a “Terraform best practices” guide. This is what actually matters when you’re managing complex environments.
State Management: The Silent Killer
Let’s start with state. Terraform’s state file is the single source of truth. Sounds simple, right? Wrong. I’ve seen teams treat it like a shared text file. Disaster.
When I was at Zkawa, we were a small team, moving fast. We had a single Terraform state file in a Git repo. Developer A would terraform apply, Developer B would terraform apply simultaneously. Chaos. Resource conflicts. The state file became corrupted. We spent a full day manually untangling dependencies. It was… not fun.
The problem is Terraform needs exclusive access to the state file. Concurrent writes will destroy it.
The solution? Remote state backends. Specifically, AWS S3 and DynamoDB.
terraform {
backend "s3" {
bucket = "my-terraform-state"
key = "state/terraform.tfstate"
region = "us-east-1"
}
}
S3 stores the state file. DynamoDB provides locking. It’s straightforward. It adds complexity, sure. But the alternative is far worse. I’ve seen teams try to “solve” this with file locking in Git. Don’t. It’s fragile. It’s slow. S3 and DynamoDB are the right answer.
Cost Optimization: It’s Not Just About Discounts
People think IaC is about automation. It’s not. It’s about control. And that control extends to your AWS bill.
I migrated a monolith to microservices at a fintech company. Each microservice had its own Terraform configuration. The initial setup was… generous. We provisioned large EC2 instances, over-sized databases. “Just to be safe,” we told ourselves.
Within a month, the bill was 30% higher than projected. I dug in. The problem wasn’t the model. It was the infrastructure. We were paying for capacity we didn’t need.
Terraform modules became our salvation. We created reusable modules for common components: databases, load balancers, queues. These modules were pre-configured with optimized instance types and scaling policies.
For example, a database module might look like this (simplified):
resource "aws_db_instance" "example" {
instance_class = var.instance_class
engine = "postgres"
// ... other config
}
variable "instance_class" {
type = string
default = "db.t3.medium"
}
The key is the var.instance_class. We could easily swap out instance types without modifying the core configuration. We started experimenting. Downsized instances. Enabled auto-scaling. Within two weeks, we saved 40% on cloud costs.
Don’t just automate your existing bad habits. Use Terraform to enforce cost optimization.
CI/CD: Stop Treating Infrastructure Like a Mystery
I’ve seen teams manually run terraform apply in production. I shudder just thinking about it.
Managing a team across multiple time zones, manual deployments were a nightmare. Someone would inevitably forget a step. Or make a typo. Or deploy the wrong version.
The solution? CI/CD. Specifically, GitHub Actions paired with Terraform.
name: Terraform Deploy
on:
push:
branches:
- main
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: hashicorp/setup-terraform@v1
- run: terraform init
- run: terraform plan
- run: terraform apply -auto-approve
This is basic. But it’s transformative. Every code change triggers a pipeline. Terraform validates the configuration. Terraform generates a plan. Terraform applies the changes.
It eliminates manual errors. It ensures consistency. It provides an audit trail.
I will admit, it adds overhead. You need to write and maintain pipelines. But the peace of mind is worth it.
The Complexity Argument
Some people say Terraform is too complex. That simpler tools are good enough. I disagree.
I’ve seen teams try to get away with CloudFormation. Or even shell scripts. It works… for a while. But as your infrastructure grows, the complexity will inevitably overwhelm you. You’ll end up rebuilding Terraform anyway.
Terraform’s power comes from its flexibility. Its provider ecosystem. Its state management capabilities. Its module system. It’s not a magic bullet. But it’s the most powerful tool I’ve found for managing large-scale AWS infrastructure.
Final Thoughts
Terraform isn’t about writing code. It’s about managing risk. It’s about controlling costs. It’s about building reliable systems.
Implement proper state management. Integrate with CI/CD pipelines. Continuously optimize AWS resources. And for the love of everything holy, don’t treat your state file like a shared text file. You’ve been warned.
And if you’re still on the fence, remember this: I broke production twice. Learning those lessons cost me a lot of sleep. Don’t make the same mistakes.