I Built a Tool That Writes Its Own Infrastructure

A complete beginner-friendly walkthrough of building SwiftDeploy: a CLI tool that generates Nginx configs, manages Docker containers, enforces deployment policies, and gives you a live metrics dashboard — all from a single YAML file.

Who This Post Is For

If you have heard words like “Docker”, “Nginx”, “deployment”, or “policy-as-code” and felt a little lost — this post is for you. I will explain every concept from scratch before using it. By the end, you will understand not just what I built, but why every piece exists.

The Problem This Solves

Imagine you are deploying a web app. Normally you would have to:

  1. Write a Docker Compose file to describe your containers
  2. Write an Nginx config to set up your web server
  3. Remember to update both files every time something changes
  4. Hope you did not make a typo somewhere

That is a lot of manual work, and manual work leads to mistakes.

What if instead you just wrote one simple file describing what you want, and a smart tool generated everything else?

That is exactly what SwiftDeploy does. You write manifest.yaml. SwiftDeploy does the rest.

Quick Glossary (Read This First)

Before we dive in, here are the key terms used throughout this post:

Docker — A tool that packages your app into a “container” — think of it like a shipping container for software. It runs the same way everywhere.

Container — A lightweight isolated box running your app. Like a mini computer inside your computer.

Docker Compose — A tool that lets you run multiple containers together and define how they connect.

Nginx — A web server that sits in front of your app and handles incoming traffic. Think of it as a receptionist who forwards calls to the right person.

Reverse proxy — When Nginx receives a request from a user and forwards it to your app. The user never talks to your app directly.

YAML — A simple file format using indentation to represent data. Like a structured shopping list.

CLI — Command Line Interface. A tool you run by typing commands in the terminal.

Canary deployment — A technique where you run two versions of your app at the same time — a stable version for most users and a “canary” version for testing. Like sending one canary into a mine before sending all the miners.

OPA (Open Policy Agent) — A policy engine. You write rules in a language called Rego, and OPA decides whether an action is allowed based on those rules.

Prometheus metrics — A standard format for exposing app statistics like request counts and response times. Looks like plain text with specific formatting.

Part 1 — The Foundation: manifest.yaml as the Single Source of Truth

What Is a “Single Source of Truth”?

In software, a “single source of truth” means one place where the authoritative information lives. If you have the same information in multiple files and they disagree, which one is right? Nobody knows. That is a bug waiting to happen.

SwiftDeploy solves this by making manifest.yaml the only file you ever edit. Every other config file is generated from it automatically.

Here is what the manifest looks like:

services:
  image: swift-deploy-1-node:latest   # Which Docker image to run
  port: 3000                          # What port your app uses inside the container
  mode: stable                        # stable or canary
  version: "1.0.0"                    # App version
  restart: unless-stopped             # Restart if it crashes

nginx:
  image: nginx:latest                 # The Nginx web server image
  port: 8080                          # The port the outside world connects to
  proxy_timeout: 30                   # Give up after 30 seconds

network:
  name: swiftdeploy-net               # Name of the internal network
  driver_type: bridge                 # Type of network

contact: "admin@swiftdeploy.local"    # Shown in error messages

Plain English: “Run my app on port 3000, put Nginx in front of it on port 8080, connect them on an internal network.”

How Does swiftdeploy init Work?

When you run ./swiftdeploy init, it:

  1. Reads manifest.yaml
  2. Opens templates/nginx.conf.tmpl — a template with placeholder values like {{NGINX_PORT}}
  3. Replaces every placeholder with the real value from the manifest
  4. Writes the result to nginx.conf
  5. Does the same thing for docker-compose.yml

The key insight is string replacement. The template has {{NGINX_PORT}} and the script replaces it with 8080. Simple but powerful.

# This is the core of how it works
nginx_conf = nginx_conf.replace('{{NGINX_PORT}}', nginx_port)
                       .replace('{{PROXY_TIMEOUT}}', proxy_timeout)
                       .replace('{{APP_PORT}}', app_port)

The grader can delete the generated files and run ./swiftdeploy init again — they will be recreated perfectly from the manifest. That is the whole point.

Part 2 — The Architecture: Three Containers, One Network

When you run ./swiftdeploy deploy, three containers start:

Outside World
      |
      | port 8080
      ↓
  ┌─────────┐
  │  Nginx  │  ← Receives all traffic, forwards to app
  └────┬────┘
       │ internal network (swiftdeploy-net)
       ↓
  ┌─────────┐
  │   App   │  ← Python service, port 3000 (never exposed to outside)
  └─────────┘

  ┌─────────┐
  │   OPA   │  ← Policy engine, port 8181 (CLI only, not through Nginx)
  └─────────┘

Why does the app port never get exposed?

This is a security decision. If port 3000 was exposed directly, anyone could bypass Nginx and hit your app without going through timeouts, error handling, logging, or security headers. By keeping it internal, all traffic is forced through Nginx.

Why is OPA isolated from Nginx?

OPA is only for the CLI to query. It is not a user-facing service. If it were accessible through Nginx on port 8080, anyone could query your policies. So OPA lives on the same internal Docker network but is never proxied by Nginx.

Part 3 — The App: A Python HTTP Service

The app (app/main.py) is a pure Python web server — no external frameworks, just Python’s built-in http.server. This keeps the Docker image tiny (74MB, well under the 300MB limit).

The Three Endpoints

GET / — Welcome message with mode, version, and timestamp:

{
  "message": "Welcome to SwiftDeploy! Running in canary mode.",
  "mode": "canary",
  "version": "1.0.0",
  "time": "2026-05-05T21:00:00+00:00"
}

GET /healthz — Is the app alive? How long has it been running?

{
  "status": "ok",
  "uptime": 342.15
}

Docker polls this endpoint every 10 seconds. If it fails three times in a row, Docker marks the container “unhealthy” and restarts it. That is why we never apply chaos to /healthz — if we did, Docker would kill our container during a chaos test.

POST /chaos — Only available in canary mode. Simulates failures:

# Make 50% of requests fail with 500 errors
curl -X POST http://localhost:8080/chaos 
  -d '{"mode": "error", "rate": 0.5}'

# Make every request sleep 5 seconds
curl -X POST http://localhost:8080/chaos 
  -d '{"mode": "slow", "duration": 5}'

# Cancel all chaos
curl -X POST http://localhost:8080/chaos 
  -d '{"mode": "recover"}'

Why Only in Canary Mode?

Because canary mode is the “I am testing something risky” mode. Stable mode means production traffic — you should never be able to break it intentionally. Canary mode is the sandbox where chaos makes sense.

Thread Safety

The app handles multiple requests at the same time. If two requests tried to update the chaos state simultaneously, they could corrupt each other. Python’s threading.Lock() prevents this:

# Lock means: only one request can change this at a time
with chaos_lock:
    chaos_state["mode"] = "error"
    chaos_state["rate"] = 0.5

This is called “thread safety” — making sure shared data is not corrupted by concurrent access.

Part 4 — The Nginx Configuration

Nginx sits between the user and the app. Here is what our generated nginx.conf does:

Custom Log Format

The task required logs in a specific format. In Nginx, you define log formats with log_format:

log_format swiftdeploy '$time_iso8601 | $status | $request_times | $upstream_addr | $request';

This produces logs like:

2026-05-05T21:00:00+00:00 | 200 | 0.001s | 172.18.0.2:3000 | GET / HTTP/1.1

Timestamp | Status code | How long it took | Where it went | What was requested

JSON Error Responses

When the app is down or slow, Nginx returns an error. By default, Nginx returns HTML error pages — ugly and not useful for APIs. We override this with JSON:

location @err502 {
    default_type application/json;
    return 502 '{"error": "Bad Gateway", "code": 502, "service": "swiftdeploy-api", "contact": "admin@swiftdeploy.local"}';
}

502 means “Bad Gateway” — Nginx could not reach the app. 503 means “Service Unavailable”. 504 means “Gateway Timeout” — the app took too long.

The DNS Resolution Trick

One problem we hit: when running nginx -t to validate the config syntax, Nginx tries to resolve the hostname app (the Docker container name). But app only exists inside the Docker network — not during standalone validation. This caused a “host not found” error even though the config was syntactically correct.

The fix is to use a variable for the upstream address:

resolver 127.0.0.11 valid=10s;  # Docker's internal DNS server

location / {
    set $upstream http://app:3000;  # Variable = skip DNS at startup
    proxy_pass $upstream;           # Nginx resolves it per-request instead
}

When proxy_pass uses a variable, Nginx skips DNS resolution at startup and resolves it per-request. This lets nginx -t pass even when the app container does not exist yet.

Part 5 — Stable vs Canary Mode

Canary deployments come from a real practice in mining. Miners used to bring canaries into mines — if the canary died, the air was bad and the miners knew to leave. In software, a “canary” is a small deployment that gets traffic first. If it fails, only a small percentage of users are affected before you roll back.

SwiftDeploy implements a simple version:

  • Stable — normal mode, chaos disabled
  • Canary — test mode, chaos enabled, every response gets X-Mode: canary header

When you promote:

./swiftdeploy promote canary

The script:

  1. Runs a 30-second OPA policy check (explained in Part 7)
  2. Updates mode: canary in manifest.yaml in-place (preserving all comments)
  3. Regenerates docker-compose.yml with the new MODE=canary environment variable
  4. Restarts only the app container — Nginx stays up, no downtime
  5. Polls /healthz to confirm the new mode is active

The X-Mode: canary header indicates to clients that they are talking to the canary version. Nginx forwards it from the app to the user:

proxy_pass_header X-Mode;
add_header X-Mode $upstream_http_x_mode always;

Part 6 — Prometheus Metrics: The “Eyes” of the System

What Are Metrics?

Metrics are numbers that describe how your app is behaving. Things like:

  • How many requests per second?
  • What percentage are failing?
  • How long do requests take?

Prometheus is a popular monitoring system. It expects metrics in a specific text format.

What We Track

http_requests_total — A counter. Counts every request, labelled by method, path, and status code:

http_requests_total{method="GET",path="https://dev.to/",status_code="200"} 142
http_requests_total{method="GET",path="https://dev.to/",status_code="500"} 8

http_request_duration_seconds — A histogram. Groups request durations into buckets:

http_request_duration_seconds_bucket{le="0.1"} 140   # 140 requests took ≤ 0.1s
http_request_duration_seconds_bucket{le="0.5"} 142   # 142 requests took ≤ 0.5s
http_request_duration_seconds_bucket{le="+Inf"} 150  # 150 total requests

A histogram is what allows us to calculate percentile latency. The P99 (99th percentile) means “99% of requests completed within this time.” It is more meaningful than the average because averages hide slow outliers.

app_mode — 0 for stable, 1 for canary.

chaos_active — 0 for none, 1 for slow, 2 for error.

How record_request() Works

After every request completes, we call:

record_request("GET", "/", 200, duration)

This updates both the counter and the histogram in one call. We do NOT record /healthz or /metrics — those are infrastructure endpoints, not user traffic.

Part 7 — OPA: The “Brain” That Makes Decisions

What Is OPA?

Open Policy Agent is a general-purpose policy engine. You write rules in a language called Rego (pronounced “ray-go”), and OPA evaluates them against data you send it.

The core principle: The CLI never decides whether to allow or deny. It only collects data, sends it to OPA, and surfaces the result. All decision logic lives in Rego.

Why does this matter? Because it means you can change your policies without touching your deployment code. Policy and mechanics are separated.

The Infrastructure Policy

Before every deployment, we check if the host has enough resources. Here is policies/infrastructure.rego:

package infrastructure

default allow := false

allow if {
    count(deny) == 0    # Allow if there are zero deny reasons
}

deny contains reason if {
    input.disk_free_gb < data.infrastructure.min_disk_free_gb
    reason := sprintf("Disk free %.1fGB is below minimum %dGB",
        [input.disk_free_gb, data.infrastructure.min_disk_free_gb])
}

deny contains reason if {
    input.cpu_load > data.infrastructure.max_cpu_load
    reason := sprintf("CPU load %.2f exceeds maximum %.2f",
        [input.cpu_load, data.infrastructure.max_cpu_load])
}

Notice: the thresholds (min_disk_free_gb, max_cpu_load) are not written in the Rego file. They come from data.json:

{
  "infrastructure": {
    "min_disk_free_gb": 10,
    "max_cpu_load": 2.0,
    "max_mem_percent": 90
  }
}

This separation means: change the threshold by editing data.json, not the policy logic. Environments have different needs — a production server and a test server should not have the same disk requirements.

The Canary Safety Policy

Before every promotion, we check if the running service is healthy enough to switch modes. This is policies/canary.rego:

package canary

default allow := false

allow if { count(deny) == 0 }

deny contains reason if {
    input.error_rate_percent > data.canary.max_error_rate_percent
    reason := sprintf("Error rate %.2f%% exceeds maximum %.2f%% over last 30s",
        [input.error_rate_percent, data.canary.max_error_rate_percent])
}

deny contains reason if {
    input.p99_latency_ms > data.canary.max_p99_latency_ms
    reason := sprintf("P99 latency %.0fms exceeds maximum %dms over last 30s",
        [input.p99_latency_ms, data.canary.max_p99_latency_ms])
}

How the 30-Second Window Works

The requirement says “over the last 30 seconds.” But Prometheus metrics are cumulative — they count from when the app started, not just the last 30 seconds.

The trick: take two snapshots 30 seconds apart and subtract them.

Snapshot 1 (time T)    →    wait 30s    →    Snapshot 2 (time T+30)

delta = Snapshot2 - Snapshot1   ←  this is what happened in the last 30 seconds
delta_total  = sum(totals2.values()) - sum(totals1.values())
delta_errors = sum(errors in snapshot2) - sum(errors in snapshot1)
error_rate_percent = (delta_errors / delta_total * 100)

This gives us a true “what happened in the last 30 seconds” measurement, regardless of historical traffic.

How the CLI Talks to OPA

OPA exposes a REST API. The CLI sends a POST request with the data:

curl -X POST http://127.0.0.1:8181/v1/data/infrastructure 
  -H "Content-Type: application/json" 
  -d '{
    "input": {
      "disk_free_gb": 11.0,
      "cpu_load": 0.18,
      "mem_percent": 53.5
    }
  }'

OPA responds with:

{
  "result": {
    "allow": true,
    "deny": []
  }
}

Or if blocked:

{
  "result": {
    "allow": false,
    "deny": ["Disk free 4.0GB is below minimum 10GB"]
  }
}

The CLI reads allow to know what to do, and prints the deny reasons so the operator knows exactly why something was blocked.

Part 8 — What Happened When I Injected Chaos

This is where it gets interesting. After deploying in canary mode and activating error chaos:

./swiftdeploy promote canary

curl -X POST http://localhost:8080/chaos 
  -d '{"mode": "error", "rate": 0.5}'

The status dashboard started showing failures:

══════════════════════════════════════════════════════
  SwiftDeploy Status Dashboard
  2026-05-05T21:55:12Z
══════════════════════════════════════════════════════
  Mode:        canary
  Chaos:       error

  Total reqs:  89
  Error rate:  48.31%
  P99 latency: 10000ms

  Policy Compliance:
    ✅ infrastructure: PASS (disk=11GB, cpu=0.22, mem=54.1%)
    ❌ canary: FAIL (error=48.31%, p99=10000ms)

Then when I tried to promote to stable while the chaos was running:

── OPA Pre-Promote Policy Check ──────────────────────
  Sampling metrics over a 30-second window...
  Last 30s: 80 requests, 36 errors
  Error rate: 45.0000%, P99 latency: 10000.00ms
  ❌ BLOCK: Canary policy denied promotion
    ↳ Error rate 45.00% exceeds maximum 1.00% over last 30s
    ↳ P99 latency 10000ms exceeds maximum 500ms over last 30s
❌ Promotion blocked by policy.

The policy worked. The system refused to let me promote while the service was actively failing. This is the whole point — you cannot accidentally promote a broken service.

After recovering:

curl -X POST http://localhost:8080/chaos -d '{"mode": "recover"}'
./swiftdeploy promote stable  # Now succeeds

Part 9 — The Audit Trail

Every significant event is written to history.jsonl:

{"timestamp":"2026-05-05T21:00:00Z","event":"deploy","details":{"mode":"stable"}}
{"timestamp":"2026-05-05T21:10:00Z","event":"promote","details":{"mode":"canary"}}
{"timestamp":"2026-05-05T21:15:00Z","event":"policy_check","details":{"policy":"canary","result":"fail","error_rate_percent":45.0}}
{"timestamp":"2026-05-05T21:20:00Z","event":"promote","details":{"mode":"stable"}}

Running ./swiftdeploy audit transforms this into a readable Markdown report:

# SwiftDeploy Audit Report

Generated: 2026-05-05T21:30:00Z
Total events: 24

## Timeline

| Timestamp | Event | Details |
|-----------|-------|---------|
| 2026-05-05T21:00:00Z | deploy | mode=stable |
| 2026-05-05T21:10:00Z | promote | → canary |
| 2026-05-05T21:15:00Z | policy_check | canary: fail |

## Policy Violations

| Timestamp | Policy | Reasons |
|-----------|--------|---------|
| 2026-05-05T21:15:00Z | canary | Error rate 45.00% exceeds maximum 1.00% over last 30s |

Part 10 — Lessons Learned

1. The DNS Resolution Problem Was the Biggest Surprise

I spent a long time debugging why nginx -t kept failing with “host not found” even though my config was correct. The issue was that Nginx tries to resolve hostnames at startup — but app is a Docker hostname that only exists inside a running Docker network.

The fix (using set $upstream as a variable) was not obvious. It is an Nginx-specific trick that delays DNS resolution from startup to per-request. Learning this saved me from a broken validation check.

2. Capabilities Matter More Than You Think

Running containers as non-root is well-known advice. But dropping capabilities (like cap_drop: ALL) is less discussed. When I dropped all capabilities from Nginx, it crashed on startup with:

chown("https://dev.to/var/cache/nginx/client_temp", 101) failed (1: Operation not permitted)

Nginx needs CHOWN to set up its temp directories. Dropping ALL capabilities silently breaks things. The lesson: drop specific capabilities you know the container does not need, not everything at once.

3. OPA’s sprintf Handles Integers and Floats Differently

When my Rego policy tried to format an integer from data.json using %.0f, OPA produced %!f(int=500) — a formatting error. The fix was to use %d for integers from data.json and %.0f only for float inputs. OPA does not silently convert types the way Python does.

4. Cumulative Metrics Need Careful Handling

Prometheus metrics are cumulative counters — they only go up. To calculate “what happened in the last 30 seconds,” you cannot just look at the current value. You need two snapshots and a delta. This is obvious in retrospect, but took a few failed attempts to get right.

5. The Manifest as the Single Source of Truth Actually Works

The grader requirement was: delete generated files, run ./swiftdeploy init, and verify they regenerate correctly. This forced a discipline that turned out to be genuinely useful. When you know everything comes from one file, debugging is much easier. There is only one place to look.

Complete Setup Instructions

# Clone the repo
git clone https://github.com/YOUR_USERNAME/swiftdeploy-automation.git
cd swiftdeploy-automation

# Install Python dependency
pip install pyyaml

# Build the Docker image
docker build -t swift-deploy-1-node:latest .

# Make CLI executable
chmod +x swiftdeploy

# Deploy
./swiftdeploy deploy

# Try everything
curl http://localhost:8080/
curl http://localhost:8080/healthz
curl http://localhost:8080/metrics
./swiftdeploy status
./swiftdeploy promote canary
./swiftdeploy promote stable
./swiftdeploy audit
./swiftdeploy teardown --clean

Conclusion

SwiftDeploy taught me that the most valuable thing in infrastructure is not the tools themselves — it is the discipline they enforce.

Making the manifest the single source of truth means you cannot have config drift. Putting all policy logic in OPA means your deployment tool cannot make silent decisions. Requiring metrics before promotion means you cannot promote a broken service by accident.

Each of these constraints feels like a restriction at first. But each one also prevents a whole category of mistakes.

If you want to replicate this project, the full source code is at the GitHub link below. Every file is commented and explained. Start with manifest.yaml, read swiftdeploy from top to bottom, then look at the Rego policies.

The best way to learn is to break it intentionally — fill up the disk, inject chaos, watch the policies fire. That is what the chaos endpoint is for.

Source code: github.com/devops-timi/swiftdeploy-automation

Total
0
Shares
Leave a Reply

Your email address will not be published. Required fields are marked *

Previous Post

5 science-backed pricing tips from the U.K.’s top marketing podcast

Next Post

Microsoft Dynamics Testing: How Enterprise QA Protects Your ERP Investment

Related Posts