Infrastructure as Code: Terraform in Production, No Regrets
Many organizations start Terraform projects with excitement. "Everything in code! Full reproducibility! No more manual drift!" Three months later, they're in chaos: state files out of sync, developers accidentally destroying production databases, and the entire project grinding to a halt while someone figures out how to recover.
The technology isn't at fault. Terraform is powerful and reliable. The failures come from organizational and process issues, not tooling.
Here's how to implement Terraform in production without the chaos.
Why Terraform, Not Other Tools?
A quick comparison:
| Tool | Best For | Learning Curve | State Management |
|---|---|---|---|
| Terraform | Multi-cloud, complex infrastructure | Moderate | Explicit (state file) |
| CloudFormation | AWS-only deployments | Steep | Implicit |
| Pulumi | Polyglot teams (Python, Go, etc.) | Moderate | Explicit |
| Ansible | Configuration management, simpler setup | Gentle | Implicit |
For Swiss enterprises running multi-cloud or hybrid infrastructure, Terraform is the standard choice. It works across AWS, GCP, Azure, Kubernetes, and on-premises. That flexibility matters when you're not locked into a single provider.
The Core Concepts
State File: The Ground Truth
Terraform maintains a state file that maps your code to actual resources. When you run terraform apply, Terraform:
- Reads your code (desired state)
- Reads the state file (current state)
- Compares them
- Calculates what needs to change
- Applies those changes
Critical insight: If your state file is wrong, Terraform will make wrong decisions.
This is the #1 source of Terraform disasters:
Scenario: Developer manually changes something in AWS console.
Actual state: Instance has 4GB RAM
State file: Instance has 2GB RAM
Developer updates Terraform: "Set it to 2GB"
Result: Running instance loses 2GB RAM (might crash)
Prevention: Never manually change infrastructure. Always use Terraform.
Remote State: Essential for Teams
Don't store state files locally. They'll get out of sync across team members.
Setup:
terraform {
backend "s3" {
bucket = "my-terraform-state"
key = "prod/terraform.tfstate"
region = "eu-central-1"
encrypt = true
dynamodb_table = "terraform-locks"
}
}
This stores state in AWS S3 with encryption and locking (prevents simultaneous changes).
Equivalent for GCP:
terraform {
backend "gcs" {
bucket = "my-terraform-state"
prefix = "prod"
}
}
Reality check: If you skip remote state, your team will have divergent infrastructure. Don't skip this.
The Implementation Roadmap
Phase 1: Foundation (Weeks 1-4)
Goal: Set up basic infrastructure as code without breaking anything.
Step 1: Choose your layout
terraform/
├── main.tf (primary infrastructure)
├── variables.tf (input variables)
├── outputs.tf (what to expose)
├── terraform.tfvars (variable values)
├── prod/
│ └── terraform.tfvars (prod-specific values)
└── dev/
└── terraform.tfvars (dev-specific values)
Step 2: Start with non-critical infrastructure
Don't Terraform production on day one. Start with dev/staging environments.
# main.tf
terraform {
backend "s3" {
bucket = "state-bucket"
key = "dev/terraform.tfstate"
region = "eu-central-1"
}
}
provider "aws" {
region = var.region
}
resource "aws_instance" "web" {
ami = data.aws_ami.ubuntu.id
instance_type = var.instance_type
tags = {
Name = "web-server"
}
}
Step 3: Test the workflow
- Run
terraform init(initialize working directory) - Run
terraform plan(show what would change) - Run
terraform apply(apply changes) - Verify in console that resources were created
- Modify code slightly
- Run
terraform planagain (shows what would change) - Run
terraform destroy(cleanup)
This becomes muscle memory.
Phase 2: Scaling (Weeks 5-12)
Goal: Build reusable patterns and handle multiple environments.
Modules: Reusable infrastructure components
Instead of repeating code, create modules:
# modules/kubernetes_cluster/main.tf
resource "aws_eks_cluster" "this" {
name = var.cluster_name
role_arn = aws_iam_role.cluster.arn
version = var.kubernetes_version
vpc_config {
subnet_ids = var.subnet_ids
}
}
# modules/kubernetes_cluster/variables.tf
variable "cluster_name" {
type = string
description = "EKS cluster name"
}
variable "kubernetes_version" {
type = string
default = "1.28"
}
Using the module:
# prod/main.tf
module "prod_cluster" {
source = "../modules/kubernetes_cluster"
cluster_name = "prod-cluster"
kubernetes_version = "1.28"
subnet_ids = [aws_subnet.a.id, aws_subnet.b.id]
}
This eliminates code duplication across environments.
Environment separation
terraform/
├── modules/
│ ├── kubernetes_cluster/
│ ├── database/
│ └── monitoring/
├── prod/
│ └── main.tf (uses modules with prod values)
├── staging/
│ └── main.tf (uses same modules with staging values)
└── dev/
└── main.tf (uses same modules with dev values)
Each environment is self-contained, but shared modules ensure consistency.
Phase 3: Production Hardening (Weeks 13+)
Goal: Safe production deployments with review and validation.
Code review workflow
Developer commits Terraform code
↓
CI/CD runs: terraform plan
↓
Plan output posted to PR for review
↓
Team reviews: "Does this change look correct?"
↓
If approved: terraform apply (automated)
↓
Apply results posted to PR
Example CI/CD (GitHub Actions):
name: Terraform
on:
pull_request:
paths:
- 'terraform/**'
push:
branches: [main]
paths:
- 'terraform/**'
jobs:
terraform:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: hashicorp/setup-terraform@v2
- run: terraform init
- run: terraform plan -out=tfplan
- uses: actions/upload-artifact@v3
with:
name: tfplan
path: tfplan
Common Pitfalls and Recovery
Pitfall 1: State File Corruption
Symptom: "Terraform thinks resource X exists, but it doesn't."
Prevention:
- Remote state with encryption
- Regular backups
- Never edit state files manually
Recovery:
# If resource is gone but Terraform doesn't know:
terraform state rm aws_instance.web
# Now Terraform will recreate it
# If state is corrupted:
# Restore from backup, not from console
Pitfall 2: Accidental Destruction
Symptom: A typo in code or variable destroys your database.
Prevention:
- Use
terraform planbefore every apply (review what changes) - Use
-targetflag for surgical changes - Code review for production changes
- Pre-production validation
Recovery:
# Don't panic. Terraform destroy is not instant.
# Usually terraform plan shows you first.
# Read the output carefully.
# If you accidentally deleted something:
# Restore from backups (disaster recovery saves you here)
Pitfall 3: State Drift
Symptom: Someone manually changed infrastructure in the console. Terraform doesn't know about it.
Prevention:
- Policy: "Never manually change infrastructure"
- Audit logging on cloud changes
- Regular
terraform planto detect drift
Detection:
terraform plan
# If it shows changes not in your code, you have drift.
# Figure out what happened and update code.
Pitfall 4: Credential Leaks
Symptom: Database passwords and API keys end up in your Terraform code.
Prevention:
# WRONG: Never do this
variable "db_password" {
default = "superSecure123" # BAD
}
# RIGHT: Use external secrets management
resource "aws_ssm_parameter" "db_password" {
name = "/prod/db/password"
type = "SecureString"
value = var.db_password # Injected from environment
}
# Inject via environment: TF_VAR_db_password=xxx terraform apply
Better: Use AWS Secrets Manager, HashiCorp Vault, or similar.
Terraform Best Practices
1. Write Readable Code
# Good: Clear variable names, comments for non-obvious choices
resource "aws_eks_cluster" "primary" {
name = var.cluster_name
role_arn = aws_iam_role.cluster_role.arn
# Enable logging for auditing requirements (GDPR)
enabled_cluster_log_types = ["api", "audit", "authenticator"]
}
# Bad: Unclear what this does
resource "aws_eks_cluster" "c" {
name = "c1"
role_arn = "arn:aws:iam::123456789:role/foo"
}
2. Validate Before Deploy
# Check syntax
terraform validate
# Format code consistently
terraform fmt -recursive
# Lint with Tflint for common errors
tflint
3. Use Data Sources for Read-Only Data
# Look up existing security group, don't create new one
data "aws_security_group" "default" {
name = "default"
}
# Use it
resource "aws_instance" "web" {
vpc_security_group_ids = [data.aws_security_group.default.id]
}
4. Output Important Values
output "cluster_endpoint" {
value = aws_eks_cluster.primary.endpoint
description = "Kubernetes API endpoint"
}
output "database_url" {
value = aws_rds_cluster.primary.endpoint
description = "Database connection string"
}
The 90-Day Implementation Plan
Month 1:
- Set up remote state (S3 + locking)
- Terraform dev/staging environments
- Team training on Terraform workflow
- Implement code review process
Month 2:
- Migrate prod infrastructure to Terraform (non-critical first)
- Implement CI/CD pipeline for Terraform
- Create reusable modules
- Document your Terraform patterns
Month 3:
- Migrate remaining infrastructure
- Test disaster recovery (can you rebuild everything?)
- Establish governance policies
- Team becomes proficient
The Reality
Terraform takes 8-12 weeks to implement properly. The early chaos you might experience is normal and temporary. By month three, you'll wonder how you ever managed infrastructure without it.
The key is discipline: use code review, automate plan/apply, backup state, document decisions. This prevents the disasters that give Terraform a bad reputation.
Done right, Terraform becomes your source of truth. Infrastructure becomes reproducible, auditable, and safe to change.
Related reading:
Found this helpful? See how Hidora can help: Professional Services · Managed Services · SLA Expert


