Cloud
Blog
Cloud10 min

Infrastructure as Code: Terraform in Production, No Regrets

Jean-Luc Dubouchet15 juin 2023

Infrastructure as Code: Terraform in Production, No Regrets

Many organizations start Terraform projects with excitement. "Everything in code! Full reproducibility! No more manual drift!" Three months later, they're in chaos: state files out of sync, developers accidentally destroying production databases, and the entire project grinding to a halt while someone figures out how to recover.

The technology isn't at fault. Terraform is powerful and reliable. The failures come from organizational and process issues, not tooling.

Here's how to implement Terraform in production without the chaos.

Why Terraform, Not Other Tools?

A quick comparison:

Tool Best For Learning Curve State Management
Terraform Multi-cloud, complex infrastructure Moderate Explicit (state file)
CloudFormation AWS-only deployments Steep Implicit
Pulumi Polyglot teams (Python, Go, etc.) Moderate Explicit
Ansible Configuration management, simpler setup Gentle Implicit

For Swiss enterprises running multi-cloud or hybrid infrastructure, Terraform is the standard choice. It works across AWS, GCP, Azure, Kubernetes, and on-premises. That flexibility matters when you're not locked into a single provider.

The Core Concepts

State File: The Ground Truth

Terraform maintains a state file that maps your code to actual resources. When you run terraform apply, Terraform:

  1. Reads your code (desired state)
  2. Reads the state file (current state)
  3. Compares them
  4. Calculates what needs to change
  5. Applies those changes

Critical insight: If your state file is wrong, Terraform will make wrong decisions.

This is the #1 source of Terraform disasters:

Scenario: Developer manually changes something in AWS console.

Actual state: Instance has 4GB RAM
State file: Instance has 2GB RAM
Developer updates Terraform: "Set it to 2GB"
Result: Running instance loses 2GB RAM (might crash)

Prevention: Never manually change infrastructure. Always use Terraform.

Remote State: Essential for Teams

Don't store state files locally. They'll get out of sync across team members.

Setup:

terraform {
  backend "s3" {
    bucket         = "my-terraform-state"
    key            = "prod/terraform.tfstate"
    region         = "eu-central-1"
    encrypt        = true
    dynamodb_table = "terraform-locks"
  }
}

This stores state in AWS S3 with encryption and locking (prevents simultaneous changes).

Equivalent for GCP:

terraform {
  backend "gcs" {
    bucket = "my-terraform-state"
    prefix = "prod"
  }
}

Reality check: If you skip remote state, your team will have divergent infrastructure. Don't skip this.

The Implementation Roadmap

Phase 1: Foundation (Weeks 1-4)

Goal: Set up basic infrastructure as code without breaking anything.

Step 1: Choose your layout

terraform/
├── main.tf           (primary infrastructure)
├── variables.tf      (input variables)
├── outputs.tf        (what to expose)
├── terraform.tfvars  (variable values)
├── prod/
│   └── terraform.tfvars  (prod-specific values)
└── dev/
    └── terraform.tfvars  (dev-specific values)

Step 2: Start with non-critical infrastructure

Don't Terraform production on day one. Start with dev/staging environments.

# main.tf
terraform {
  backend "s3" {
    bucket = "state-bucket"
    key    = "dev/terraform.tfstate"
    region = "eu-central-1"
  }
}

provider "aws" {
  region = var.region
}

resource "aws_instance" "web" {
  ami           = data.aws_ami.ubuntu.id
  instance_type = var.instance_type
  tags = {
    Name = "web-server"
  }
}

Step 3: Test the workflow

  1. Run terraform init (initialize working directory)
  2. Run terraform plan (show what would change)
  3. Run terraform apply (apply changes)
  4. Verify in console that resources were created
  5. Modify code slightly
  6. Run terraform plan again (shows what would change)
  7. Run terraform destroy (cleanup)

This becomes muscle memory.

Phase 2: Scaling (Weeks 5-12)

Goal: Build reusable patterns and handle multiple environments.

Modules: Reusable infrastructure components

Instead of repeating code, create modules:

# modules/kubernetes_cluster/main.tf
resource "aws_eks_cluster" "this" {
  name            = var.cluster_name
  role_arn        = aws_iam_role.cluster.arn
  version         = var.kubernetes_version

  vpc_config {
    subnet_ids = var.subnet_ids
  }
}

# modules/kubernetes_cluster/variables.tf
variable "cluster_name" {
  type        = string
  description = "EKS cluster name"
}

variable "kubernetes_version" {
  type    = string
  default = "1.28"
}

Using the module:

# prod/main.tf
module "prod_cluster" {
  source = "../modules/kubernetes_cluster"

  cluster_name       = "prod-cluster"
  kubernetes_version = "1.28"
  subnet_ids         = [aws_subnet.a.id, aws_subnet.b.id]
}

This eliminates code duplication across environments.

Environment separation

terraform/
├── modules/
│   ├── kubernetes_cluster/
│   ├── database/
│   └── monitoring/
├── prod/
│   └── main.tf (uses modules with prod values)
├── staging/
│   └── main.tf (uses same modules with staging values)
└── dev/
    └── main.tf (uses same modules with dev values)

Each environment is self-contained, but shared modules ensure consistency.

Phase 3: Production Hardening (Weeks 13+)

Goal: Safe production deployments with review and validation.

Code review workflow

Developer commits Terraform code
  ↓
CI/CD runs: terraform plan
  ↓
Plan output posted to PR for review
  ↓
Team reviews: "Does this change look correct?"
  ↓
If approved: terraform apply (automated)
  ↓
Apply results posted to PR

Example CI/CD (GitHub Actions):

name: Terraform

on:
  pull_request:
    paths:
      - 'terraform/**'
  push:
    branches: [main]
    paths:
      - 'terraform/**'

jobs:
  terraform:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - uses: hashicorp/setup-terraform@v2
      - run: terraform init
      - run: terraform plan -out=tfplan
      - uses: actions/upload-artifact@v3
        with:
          name: tfplan
          path: tfplan

Common Pitfalls and Recovery

Pitfall 1: State File Corruption

Symptom: "Terraform thinks resource X exists, but it doesn't."

Prevention:

  • Remote state with encryption
  • Regular backups
  • Never edit state files manually

Recovery:

# If resource is gone but Terraform doesn't know:
terraform state rm aws_instance.web
# Now Terraform will recreate it

# If state is corrupted:
# Restore from backup, not from console

Pitfall 2: Accidental Destruction

Symptom: A typo in code or variable destroys your database.

Prevention:

  • Use terraform plan before every apply (review what changes)
  • Use -target flag for surgical changes
  • Code review for production changes
  • Pre-production validation

Recovery:

# Don't panic. Terraform destroy is not instant.
# Usually terraform plan shows you first.
# Read the output carefully.

# If you accidentally deleted something:
# Restore from backups (disaster recovery saves you here)

Pitfall 3: State Drift

Symptom: Someone manually changed infrastructure in the console. Terraform doesn't know about it.

Prevention:

  • Policy: "Never manually change infrastructure"
  • Audit logging on cloud changes
  • Regular terraform plan to detect drift

Detection:

terraform plan
# If it shows changes not in your code, you have drift.
# Figure out what happened and update code.

Pitfall 4: Credential Leaks

Symptom: Database passwords and API keys end up in your Terraform code.

Prevention:

# WRONG: Never do this
variable "db_password" {
  default = "superSecure123"  # BAD
}

# RIGHT: Use external secrets management
resource "aws_ssm_parameter" "db_password" {
  name  = "/prod/db/password"
  type  = "SecureString"
  value = var.db_password  # Injected from environment
}

# Inject via environment: TF_VAR_db_password=xxx terraform apply

Better: Use AWS Secrets Manager, HashiCorp Vault, or similar.

Terraform Best Practices

1. Write Readable Code

# Good: Clear variable names, comments for non-obvious choices
resource "aws_eks_cluster" "primary" {
  name            = var.cluster_name
  role_arn        = aws_iam_role.cluster_role.arn
  # Enable logging for auditing requirements (GDPR)
  enabled_cluster_log_types = ["api", "audit", "authenticator"]
}

# Bad: Unclear what this does
resource "aws_eks_cluster" "c" {
  name = "c1"
  role_arn = "arn:aws:iam::123456789:role/foo"
}

2. Validate Before Deploy

# Check syntax
terraform validate

# Format code consistently
terraform fmt -recursive

# Lint with Tflint for common errors
tflint

3. Use Data Sources for Read-Only Data

# Look up existing security group, don't create new one
data "aws_security_group" "default" {
  name = "default"
}

# Use it
resource "aws_instance" "web" {
  vpc_security_group_ids = [data.aws_security_group.default.id]
}

4. Output Important Values

output "cluster_endpoint" {
  value       = aws_eks_cluster.primary.endpoint
  description = "Kubernetes API endpoint"
}

output "database_url" {
  value       = aws_rds_cluster.primary.endpoint
  description = "Database connection string"
}

The 90-Day Implementation Plan

Month 1:

  • Set up remote state (S3 + locking)
  • Terraform dev/staging environments
  • Team training on Terraform workflow
  • Implement code review process

Month 2:

  • Migrate prod infrastructure to Terraform (non-critical first)
  • Implement CI/CD pipeline for Terraform
  • Create reusable modules
  • Document your Terraform patterns

Month 3:

  • Migrate remaining infrastructure
  • Test disaster recovery (can you rebuild everything?)
  • Establish governance policies
  • Team becomes proficient

The Reality

Terraform takes 8-12 weeks to implement properly. The early chaos you might experience is normal and temporary. By month three, you'll wonder how you ever managed infrastructure without it.

The key is discipline: use code review, automate plan/apply, backup state, document decisions. This prevents the disasters that give Terraform a bad reputation.

Done right, Terraform becomes your source of truth. Infrastructure becomes reproducible, auditable, and safe to change.


Related reading:


Found this helpful? See how Hidora can help: Professional Services · Managed Services · SLA Expert

Does this article resonate?

Hidora can support you on this topic.

Need support?

Let's talk about your project. 30 minutes, no strings attached.