Vetora logo
📜Cloud-Native

Infrastructure as Code

Infrastructure as Code (IaC) manages and provisions cloud resources through machine-readable definition files rather than manual console clicks or ad-hoc scripts. IaC enables version-controlled, peer-reviewed, repeatable infrastructure with the same rigor applied to application code.

Overview

Infrastructure as Code is the practice of managing infrastructure through declarative or imperative code rather than manual processes. Before IaC, provisioning a production environment involved clicking through cloud consoles, running ad-hoc shell scripts, and maintaining runbooks that drifted from reality within weeks. IaC eliminates this by codifying infrastructure in files that can be version-controlled, peer-reviewed, tested, and applied consistently across environments.

The declarative paradigm dominates modern IaC. Tools like Terraform (HashiCorp), CloudFormation (AWS), and Pulumi describe the desired end state of infrastructure: 'I want a VPC with 3 subnets, an RDS PostgreSQL instance, and an EKS cluster with 5 nodes.' The tool compares the desired state with the current state (tracked in a state file or cloud API), computes a diff (the 'plan'), and applies only the necessary changes (create, update, or delete resources). This is analogous to the Kubernetes reconciliation loop but for infrastructure provisioning.

Terraform is the most widely adopted IaC tool, using HashiCorp Configuration Language (HCL) to define resources across 3,000+ providers (AWS, GCP, Azure, Kubernetes, GitHub, Datadog, PagerDuty). Its provider model means a single tool and workflow can manage multi-cloud and SaaS infrastructure. Terraform's state file (stored in S3, GCS, or Terraform Cloud) tracks the mapping between code and real resources, enabling accurate diffs and safe deletions. Pulumi and AWS CDK take a different approach, allowing infrastructure to be defined in general-purpose languages (TypeScript, Python, Go, C#), enabling loops, conditionals, and type safety that HCL cannot express natively.

The operational challenge of IaC is state management. Terraform's state file is a single point of truth: if it is lost, corrupted, or diverges from reality (due to out-of-band changes), operations become dangerous. Best practices include remote state backends with locking (S3 + DynamoDB), state encryption, and import commands to reconcile drift. Organizations also implement policy-as-code (Open Policy Agent, Sentinel, Checkov) to enforce guardrails: 'no public S3 buckets', 'all databases must be encrypted', 'instances must use approved AMIs.'

Key Points
  • 1Declarative IaC (Terraform, CloudFormation) specifies desired state; the tool computes and applies the minimal diff. Imperative IaC (Ansible, shell scripts) specifies steps, which are harder to reason about and not idempotent by default.
  • 2State management is the most critical operational concern. Terraform stores resource mappings in a state file. Remote backends (S3 + DynamoDB locking) prevent concurrent modifications. State loss means Terraform does not know what exists and may duplicate or orphan resources.
  • 3Modules enable reusable infrastructure patterns. A 'VPC module' encapsulating subnets, route tables, NAT gateways, and security groups can be parameterized and used across teams, ensuring consistent networking patterns without copy-paste.
  • 4Plan-then-apply workflow enables safe changes: 'terraform plan' shows exactly what will be created, modified, or destroyed before any mutation. CI/CD pipelines automate plan output in pull requests for human review.
  • 5Drift detection identifies manual changes made outside IaC (console clicks, ad-hoc CLI commands). Regular drift detection runs compare actual cloud state to the state file and flag discrepancies for reconciliation.
  • 6Policy-as-code tools (OPA/Rego, Sentinel, Checkov) validate IaC before apply: preventing public S3 buckets, enforcing encryption, restricting instance types. This shifts security left into the development workflow.
Simple Example

Blueprint Analogy

Infrastructure as Code is like building a house from blueprints rather than verbal instructions. Without blueprints, every house is built differently depending on who is on site that day -- walls end up in the wrong place, and rebuilding after a fire requires guessing how things were originally constructed. With blueprints (IaC), every house built from the same plans is identical. You can peer-review the blueprints before construction begins (terraform plan in PR), track every revision (git history), and rebuild exactly the same house anywhere (disaster recovery). If you need 10 identical houses (10 environments), you use the same blueprints with different addresses (variables). Terraform's plan command is like a contractor walking through the blueprints saying 'I will add this wall, remove that door, and leave this window unchanged' before touching anything.

Real-World Examples

Shopify

Shopify manages their entire cloud infrastructure (GCP and some AWS) via Terraform, with over 100,000 resources defined in code. Their platform engineering team maintains a library of shared Terraform modules for common patterns (GKE clusters, Cloud SQL instances, VPCs). All infrastructure changes go through pull requests with automated 'terraform plan' output, security policy checks (Checkov), and cost estimation before human approval.

Deliveroo

Deliveroo uses Terraform to manage 500+ AWS accounts across their organization. They built a custom Terraform wrapper (terraform-scaffold) that enforces directory structure, remote state configuration, and module versioning. A single PR can provision an entire new service environment (VPC, ECS cluster, RDS database, ALB, Route53 records) in under 10 minutes, compared to the hours of manual provisioning required previously.

Twilio

Twilio uses Pulumi (TypeScript) for infrastructure provisioning, choosing it over Terraform for the ability to use loops, conditionals, and type checking in a familiar programming language. Their infrastructure team builds reusable Pulumi component resources that encapsulate company standards (encryption requirements, tagging policies, networking patterns). TypeScript's type system catches misconfigured resources at compile time rather than at plan or apply time.

Trade-Offs
AspectDescription
Declarative (Terraform/HCL) vs. Imperative (Pulumi/CDK)Declarative HCL is simpler to learn, inherently idempotent, and has the largest ecosystem (3,000+ providers). But HCL lacks loops, complex conditionals, and type safety -- workarounds (count, for_each, dynamic blocks) are clumsy. Pulumi/CDK use real programming languages (TypeScript, Python) enabling full expressiveness and IDE support, but require developers to understand both infrastructure and software engineering patterns.
State File vs. Stateless ReconciliationTerraform's state file enables accurate diffs and resource tracking but is a single point of failure. CloudFormation and Pulumi Cloud manage state server-side, reducing operational burden but adding vendor dependency. Crossplane (Kubernetes-native IaC) uses the Kubernetes API server as its state store, leveraging etcd's built-in HA and backup capabilities.
Monorepo vs. Polyrepo for IaCA single infrastructure monorepo enables cross-cutting changes and consistent tooling but creates merge conflicts and blast radius (a bad commit can affect all environments). Splitting IaC into per-service or per-team repos reduces blast radius and enables independent release cycles but creates drift in module versions and duplicated configuration.
Case Study

Segment's Terraform Migration for Multi-Account AWS

Scenario

Segment (now part of Twilio) operated in a single AWS account with infrastructure managed through a mix of CloudFormation, Ansible, and manual console configuration. As the company grew to 100+ engineers, the single-account model created security concerns (blast radius, IAM complexity) and the mixed tooling made it impossible to audit or reproduce infrastructure reliably. Deploying a new service required 2-3 days of manual infrastructure setup.

Solution

Segment migrated to a multi-account AWS architecture managed entirely by Terraform. They created an account-vending-machine: a Terraform module that provisions a new AWS account with standardized VPC layout, IAM roles, CloudTrail logging, GuardDuty monitoring, and SSO integration in a single 'terraform apply.' Shared modules for common patterns (ECS services, RDS databases, S3 buckets) were published to an internal Terraform registry with semantic versioning. All changes required PR review with automated plan output and Sentinel policy checks.

Outcome

New service provisioning dropped from 2-3 days to 30 minutes (single 'terraform apply'). The team manages 40+ AWS accounts from a single Terraform codebase. Infrastructure drift was reduced by 95% because manual console changes are detected and flagged. Security audit time decreased from weeks to hours because every resource is traceable to a git commit. The migration also enabled Segment to achieve SOC 2 Type II certification, as auditors could review infrastructure changes through git history.

Common Mistakes
  • Storing Terraform state locally. Local state files cannot be shared across team members, have no locking (concurrent applies corrupt state), and are easily lost. Always use a remote backend (S3 + DynamoDB, GCS, Terraform Cloud) with encryption and state locking from day one.
  • Making manual changes alongside IaC. Clicking through the AWS console to 'quickly fix' a security group creates drift between code and reality. The next 'terraform apply' may revert the manual change or fail unexpectedly. All changes must go through IaC, enforced by read-only console access in production accounts.
  • Monolithic Terraform configurations. Putting all infrastructure in a single state file means every 'terraform plan' refreshes every resource (slow, rate-limited) and every apply risks unrelated resources. Split state by lifecycle boundary: networking, databases, compute, and application-level resources should be separate state files.
  • Hardcoding values instead of using variables and modules. Copy-pasting resource blocks across environments leads to configuration drift. Use modules for reusable patterns and workspaces or directory structures for environment-specific configuration (dev, staging, prod).
Related Concepts

See Infrastructure as Code in action

Explore system design templates that use infrastructure as code and run traffic simulations to see how these concepts perform under real load.

Browse Templates

Deploy identical e-commerce stacks across environments

Metrics to watch
drift_detection_countdeploy_time_msrollback_time_msenvironment_parity_pct
Run Simulation
Test Your Understanding

1What is the primary purpose of Terraform's state file?

2Why is making infrastructure changes via the cloud console considered an antipattern when using IaC?

Deeper Reading