Infrastructure as Code: Terraform in Production

Many organizations start Terraform projects with excitement. "Everything in code! Full reproducibility! No more manual drift!" Three months later, they're in chaos: state files out of sync, developers accidentally destroying production databases, and the entire project grinding to a halt while someone figures out how to recover.

The technology isn't at fault. Terraform is powerful and reliable. The failures come from organizational and process issues, not tooling.

Here's how to implement Terraform in production without the chaos.

Why Terraform, Not Other Tools?

A quick comparison:

Tool	Best For	Learning Curve	State Management
Terraform	Multi-cloud, complex infrastructure	Moderate	Explicit (state file)
CloudFormation	AWS-only deployments	Steep	Implicit
Pulumi	Polyglot teams (Python, Go, etc.)	Moderate	Explicit
Ansible	Configuration management, simpler setup	Gentle	Implicit

For Swiss enterprises running multi-cloud or hybrid infrastructure, Terraform is the standard choice. It works across AWS, GCP, Azure, Kubernetes, and on-premises. That flexibility matters when you're not locked into a single provider.

The Core Concepts

State File: The Ground Truth

Terraform maintains a state file that maps your code to actual resources. When you run terraform apply, Terraform:

Reads your code (desired state)
Reads the state file (current state)
Compares them
Calculates what needs to change
Applies those changes

Critical insight: If your state file is wrong, Terraform will make wrong decisions.

This is the #1 source of Terraform disasters:

Scenario: Developer manually changes something in AWS console.

Actual state: Instance has 4GB RAM
State file: Instance has 2GB RAM
Developer updates Terraform: "Set it to 2GB"
Result: Running instance loses 2GB RAM (might crash)

Prevention: Never manually change infrastructure. Always use Terraform.

Remote State: Essential for Teams

Don't store state files locally. They'll get out of sync across team members.

Setup:

terraform {
  backend "s3" {
    bucket         = "my-terraform-state"
    key            = "prod/terraform.tfstate"
    region         = "eu-central-1"
    encrypt        = true
    dynamodb_table = "terraform-locks"
  }
}

This stores state in AWS S3 with encryption and locking (prevents simultaneous changes).

Equivalent for GCP:

terraform {
  backend "gcs" {
    bucket = "my-terraform-state"
    prefix = "prod"
  }
}

Reality check: If you skip remote state, your team will have divergent infrastructure. Don't skip this.

The Implementation Roadmap

Phase 1: Foundation (Weeks 1-4)

Goal: Set up basic infrastructure as code without breaking anything.

Step 1: Choose your layout

terraform/
├── main.tf           (primary infrastructure)
├── variables.tf      (input variables)
├── outputs.tf        (what to expose)
├── terraform.tfvars  (variable values)
├── prod/
│   └── terraform.tfvars  (prod-specific values)
└── dev/
    └── terraform.tfvars  (dev-specific values)

Step 2: Start with non-critical infrastructure

Don't Terraform production on day one. Start with dev/staging environments.

# main.tf
terraform {
  backend "s3" {
    bucket = "state-bucket"
    key    = "dev/terraform.tfstate"
    region = "eu-central-1"
  }
}

provider "aws" {
  region = var.region
}

resource "aws_instance" "web" {
  ami           = data.aws_ami.ubuntu.id
  instance_type = var.instance_type
  tags = {
    Name = "web-server"
  }
}

Step 3: Test the workflow

Run terraform init (initialize working directory)
Run terraform plan (show what would change)
Run terraform apply (apply changes)
Verify in console that resources were created
Modify code slightly
Run terraform plan again (shows what would change)
Run terraform destroy (cleanup)

This becomes muscle memory.

Phase 2: Scaling (Weeks 5-12)

Goal: Build reusable patterns and handle multiple environments.

Modules: Reusable infrastructure components

Instead of repeating code, create modules:

# modules/kubernetes_cluster/main.tf
resource "aws_eks_cluster" "this" {
  name            = var.cluster_name
  role_arn        = aws_iam_role.cluster.arn
  version         = var.kubernetes_version

  vpc_config {
    subnet_ids = var.subnet_ids
  }
}

# modules/kubernetes_cluster/variables.tf
variable "cluster_name" {
  type        = string
  description = "EKS cluster name"
}

variable "kubernetes_version" {
  type    = string
  default = "1.28"
}

Using the module:

# prod/main.tf
module "prod_cluster" {
  source = "../modules/kubernetes_cluster"

  cluster_name       = "prod-cluster"
  kubernetes_version = "1.28"
  subnet_ids         = [aws_subnet.a.id, aws_subnet.b.id]
}

This eliminates code duplication across environments.

Environment separation

terraform/
├── modules/
│   ├── kubernetes_cluster/
│   ├── database/
│   └── monitoring/
├── prod/
│   └── main.tf (uses modules with prod values)
├── staging/
│   └── main.tf (uses same modules with staging values)
└── dev/
    └── main.tf (uses same modules with dev values)

Each environment is self-contained, but shared modules ensure consistency.

Phase 3: Production Hardening (Weeks 13+)

Goal: Safe production deployments with review and validation.

Code review workflow

Developer commits Terraform code
  ↓
CI/CD runs: terraform plan
  ↓
Plan output posted to PR for review
  ↓
Team reviews: "Does this change look correct?"
  ↓
If approved: terraform apply (automated)
  ↓
Apply results posted to PR

Example CI/CD (GitHub Actions):

name: Terraform

on:
  pull_request:
    paths:
      - 'terraform/**'
  push:
    branches: [main]
    paths:
      - 'terraform/**'

jobs:
  terraform:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - uses: hashicorp/setup-terraform@v2
      - run: terraform init
      - run: terraform plan -out=tfplan
      - uses: actions/upload-artifact@v3
        with:
          name: tfplan
          path: tfplan

Common Pitfalls and Recovery

Pitfall 1: State File Corruption

Symptom: "Terraform thinks resource X exists, but it doesn't."

Prevention:

Remote state with encryption
Regular backups
Never edit state files manually

Recovery:

# If resource is gone but Terraform doesn't know:
terraform state rm aws_instance.web
# Now Terraform will recreate it

# If state is corrupted:
# Restore from backup, not from console

Pitfall 2: Accidental Destruction

Symptom: A typo in code or variable destroys your database.

Prevention:

Use terraform plan before every apply (review what changes)
Use -target flag for surgical changes
Code review for production changes
Pre-production validation

Recovery:

# Don't panic. Terraform destroy is not instant.
# Usually terraform plan shows you first.
# Read the output carefully.

# If you accidentally deleted something:
# Restore from backups (disaster recovery saves you here)

Pitfall 3: State Drift

Symptom: Someone manually changed infrastructure in the console. Terraform doesn't know about it.

Prevention:

Policy: "Never manually change infrastructure"
Audit logging on cloud changes
Regular terraform plan to detect drift

Detection:

terraform plan
# If it shows changes not in your code, you have drift.
# Figure out what happened and update code.

Pitfall 4: Credential Leaks

Symptom: Database passwords and API keys end up in your Terraform code.

Prevention:

# WRONG: Never do this
variable "db_password" {
  default = "superSecure123"  # BAD
}

# RIGHT: Use external secrets management
resource "aws_ssm_parameter" "db_password" {
  name  = "/prod/db/password"
  type  = "SecureString"
  value = var.db_password  # Injected from environment
}

# Inject via environment: TF_VAR_db_password=xxx terraform apply

Better: Use AWS Secrets Manager, HashiCorp Vault, or similar.

Terraform Best Practices

1. Write Readable Code

# Good: Clear variable names, comments for non-obvious choices
resource "aws_eks_cluster" "primary" {
  name            = var.cluster_name
  role_arn        = aws_iam_role.cluster_role.arn
  # Enable logging for auditing requirements (GDPR)
  enabled_cluster_log_types = ["api", "audit", "authenticator"]
}

# Bad: Unclear what this does
resource "aws_eks_cluster" "c" {
  name = "c1"
  role_arn = "arn:aws:iam::123456789:role/foo"
}

2. Validate Before Deploy

# Check syntax
terraform validate

# Format code consistently
terraform fmt -recursive

# Lint with Tflint for common errors
tflint

3. Use Data Sources for Read-Only Data

# Look up existing security group, don't create new one
data "aws_security_group" "default" {
  name = "default"
}

# Use it
resource "aws_instance" "web" {
  vpc_security_group_ids = [data.aws_security_group.default.id]
}

4. Output Important Values

output "cluster_endpoint" {
  value       = aws_eks_cluster.primary.endpoint
  description = "Kubernetes API endpoint"
}

output "database_url" {
  value       = aws_rds_cluster.primary.endpoint
  description = "Database connection string"
}

The 90-Day Implementation Plan

Month 1:

Set up remote state (S3 + locking)
Terraform dev/staging environments
Team training on Terraform workflow
Implement code review process

Month 2:

Migrate prod infrastructure to Terraform (non-critical first)
Implement CI/CD pipeline for Terraform
Create reusable modules
Document your Terraform patterns

Month 3:

Migrate remaining infrastructure
Test disaster recovery (can you rebuild everything?)
Establish governance policies
Team becomes proficient

Terraform in a Regulated Environment

Swiss organizations operating in financial services, healthcare, or government face additional requirements that shape how they use Terraform. Compliance frameworks like ISO 27001, SOC 2, and the nLPD (Swiss Federal Act on Data Protection) demand auditability, access controls, and change traceability for infrastructure modifications.

Terraform naturally supports these requirements when configured properly. Every terraform apply creates an auditable record of what changed, who approved it, and when it happened. Combined with a Git-based workflow where all changes go through pull requests, you get a complete audit trail that satisfies most compliance assessments.

Key practices for regulated environments:

Separate state files per environment with distinct access controls. Production state should be readable only by the CI/CD pipeline and a small group of senior engineers.
Mandatory plan review: No terraform apply without a human-reviewed plan output. Automate the plan step in CI/CD, but require explicit approval before apply runs against production.
State file versioning: Enable versioning on your S3 or GCS bucket. If a state file becomes corrupted or an apply goes wrong, you can restore to a known-good state.
Drift detection schedules: Run terraform plan on a daily schedule against production. If drift is detected, alert immediately. Manual changes to regulated infrastructure should be treated as incidents.

These practices add process overhead, but for organizations where a compliance failure carries regulatory consequences, they are non-negotiable.

The Reality

Terraform takes 8-12 weeks to implement properly. The early chaos you might experience is normal and temporary. By month three, you'll wonder how you ever managed infrastructure without it.

The key is discipline: use code review, automate plan/apply, backup state, document decisions. This prevents the disasters that give Terraform a bad reputation.

Done right, Terraform becomes your source of truth. Infrastructure becomes reproducible, auditable, and safe to change.

Related reading:

Found this helpful? See how Hidora can help: Professional Services · Managed Services · SLA Expert

Infrastructure as Code: Terraform in Production

Why Terraform, Not Other Tools?

The Core Concepts

State File: The Ground Truth

Remote State: Essential for Teams

The Implementation Roadmap

Phase 1: Foundation (Weeks 1-4)

Phase 2: Scaling (Weeks 5-12)

Phase 3: Production Hardening (Weeks 13+)

Common Pitfalls and Recovery

Pitfall 1: State File Corruption

Pitfall 2: Accidental Destruction

Pitfall 3: State Drift

Pitfall 4: Credential Leaks

Terraform Best Practices

1. Write Readable Code

2. Validate Before Deploy

3. Use Data Sources for Read-Only Data

4. Output Important Values

The 90-Day Implementation Plan

Terraform in a Regulated Environment

The Reality

Related articles

Cloud-Native Migration: 5 Costly Mistakes to Avoid

FinOps: Reduce Your Kubernetes Costs by 40%

Cloud Cost Optimization: What Your CFO Wants to Hear