Skip to main content

Infrastructure as Code

All Auraison cloud infrastructure must be provisioned and configured through code -- never through manual dashboard interactions. This is a non-negotiable engineering standard.

Central infra/, not per-plane

Infrastructure lives in a single top-level infra/ directory, not scattered across planes. Cloud resources don't respect plane boundaries -- an R2 bucket is a management-plane concern that serves the data plane and gets embedded in control-plane docs. DNS and Access policies span everything. Forcing Terraform under data-plane/infra/ or control-plane/infra/ means picking an arbitrary home for resources that don't belong to one plane.

Terraform state files should group by blast radius, not by org chart. One terraform apply on the Cloudflare stack touches R2 + DNS + Access together -- that's the right atomic unit. Splitting state across planes creates cross-stack references and dependency hell.

Kubernetes manifests are cluster-scoped. The KubeRay operator, RayCluster CRs, and GPU environments apply to the whole cluster, not one plane.

infra/
terraform/
modules/ Reusable, single-purpose (one R2 bucket, one DNS zone)
stacks/cloudflare/ Root module -- this is where terraform plan/apply runs
environments/ Per-environment tfvars + backend config (prod, staging)
k8s/ Cluster-wide Kubernetes manifests

The rule: if it provisions shared cloud resources, it goes in infra/. If it's local dev tooling (Docker Compose, dev containers), it stays in-plane.

The principle

Infrastructure as Code (IaC) means every cloud resource -- buckets, DNS records, access policies, CORS rules, API tokens -- is defined in version-controlled configuration files and applied through deterministic tooling. The infrastructure state is reproducible, reviewable, and auditable.

PropertyManual (dashboard)IaC
ReproducibleNo -- depends on screenshots and memoryYes -- terraform apply from any machine
AuditableCloudflare audit log onlyGit history + PR reviews
ReviewableAfter the factBefore deployment
RecoverableRecreate from memoryRe-apply from code
DocumentedTribal knowledge or blog postsThe code is the documentation

Mandated tooling

LayerToolScope
Cloud resourcesTerraform + Cloudflare providerR2 buckets, DNS, Access policies, CORS, Workers
KubernetesHelm + ArgoCDKubeRay operator, RayCluster CRs, GPU environments
Secrets1Password CLI + op injectAPI keys, S3 credentials, database DSNs
Container imagesDockerfiles in-repolakehouse.dev.gpu, lakehouse.dev.cpu
Database schemaDuckLake migrations in catalog.pyCatalog tables, schema evolution

Case study: Cloudflare R2 bucket (the wrong way)

When setting up the auraison-artifacts R2 bucket for public Rerun recording access, we provisioned everything manually through the Cloudflare dashboard:

  1. Created the R2 bucket via MCP tool call
  2. Connected custom domain artifacts.aegeanai.com via dashboard
  3. Configured CORS policy via dashboard
  4. Created an Access application via dashboard
  5. Added a Bypass policy via dashboard
  6. Created API tokens via dashboard

This worked, and is documented as a tutorial. But it has serious problems:

  • Not reproducible. If we need a second artifacts bucket (staging, another region), we repeat 6 manual steps and hope we remember every setting.
  • Not reviewable. No one reviewed the CORS policy or Access bypass before it went live.
  • Not recoverable. If someone deletes the bucket or changes the CORS policy, we have no source of truth to restore from.
  • Configuration drift. Dashboard changes leave no trace in version control. Six months from now, no one will know why the bypass policy exists or whether the CORS origins are still correct.

How it should have been done

The entire R2 setup should be a Terraform module:

# infra/terraform/modules/r2-public-artifacts/main.tf

resource "cloudflare_r2_bucket" "artifacts" {
account_id = var.cloudflare_account_id
name = "auraison-artifacts"
location = "ENAM"
}

resource "cloudflare_record" "artifacts" {
zone_id = var.zone_id
name = "artifacts"
content = cloudflare_r2_bucket.artifacts.id
type = "CNAME"
proxied = true
}

resource "cloudflare_r2_bucket_cors" "artifacts" {
account_id = var.cloudflare_account_id
bucket_id = cloudflare_r2_bucket.artifacts.id

cors_rules = [{
allowed_origins = ["https://app.rerun.io"]
allowed_methods = ["GET", "HEAD"]
allowed_headers = ["*"]
expose_headers = ["Content-Length", "Content-Type"]
max_age_seconds = 86400
}]
}

resource "cloudflare_zero_trust_access_application" "artifacts" {
account_id = var.cloudflare_account_id
name = "artifacts"
domain = "artifacts.aegeanai.com"
type = "self_hosted"
}

resource "cloudflare_zero_trust_access_policy" "artifacts_bypass" {
account_id = var.cloudflare_account_id
application_id = cloudflare_zero_trust_access_application.artifacts.id
name = "public-artifacts-bypass"
decision = "bypass"
precedence = 1

include {
everyone = true
}
}

With this in place:

  • terraform plan shows exactly what will change before it happens
  • terraform apply provisions everything in seconds
  • The configuration lives in Git -- reviewable, auditable, recoverable
  • A second environment (staging, EU region) is a new tfvars file

Remediation plan

The manually provisioned R2 infrastructure must be imported into Terraform:

  1. Create infra/terraform/cloudflare/ with the module above
  2. terraform import the existing R2 bucket, DNS record, Access app, and policies
  3. Verify terraform plan shows no drift
  4. All future Cloudflare changes go through Terraform PRs
caution

Until this remediation is complete, the R2 tutorial documents the current (manual) state. The Terraform module is the target state. Do not provision new cloud resources manually -- write the Terraform first.

Decision record

Decision: All cloud infrastructure is provisioned via Terraform. Manual provisioning is only acceptable as a time-boxed spike, immediately followed by IaC remediation.

Rationale: The R2 bucket setup took 6 manual steps across 3 Cloudflare dashboard sections. Reproducing this for a second environment would require re-reading a blog post. Terraform makes it a single command.

Consequence: New cloud resources require a Terraform PR before provisioning. The R2 bucket is flagged for import.