A complete beginner-friendly walkthrough of building SwiftDeploy: a CLI tool that generates Nginx configs, manages Docker containers, enforces deployment policies, and gives you a live metrics dashboard — all from a single YAML file.
Who This Post Is For
If you have heard words like “Docker”, “Nginx”, “deployment”, or “policy-as-code” and felt a little lost — this post is for you. I will explain every concept from scratch before using it. By the end, you will understand not just what I built, but why every piece exists.
The Problem This Solves
Imagine you are deploying a web app. Normally you would have to:
- Write a Docker Compose file to describe your containers
- Write an Nginx config to set up your web server
- Remember to update both files every time something changes
- Hope you did not make a typo somewhere
That is a lot of manual work, and manual work leads to mistakes.
What if instead you just wrote one simple file describing what you want, and a smart tool generated everything else?
That is exactly what SwiftDeploy does. You write manifest.yaml. SwiftDeploy does the rest.
Quick Glossary (Read This First)
Before we dive in, here are the key terms used throughout this post:
Docker — A tool that packages your app into a “container” — think of it like a shipping container for software. It runs the same way everywhere.
Container — A lightweight isolated box running your app. Like a mini computer inside your computer.
Docker Compose — A tool that lets you run multiple containers together and define how they connect.
Nginx — A web server that sits in front of your app and handles incoming traffic. Think of it as a receptionist who forwards calls to the right person.
Reverse proxy — When Nginx receives a request from a user and forwards it to your app. The user never talks to your app directly.
YAML — A simple file format using indentation to represent data. Like a structured shopping list.
CLI — Command Line Interface. A tool you run by typing commands in the terminal.
Canary deployment — A technique where you run two versions of your app at the same time — a stable version for most users and a “canary” version for testing. Like sending one canary into a mine before sending all the miners.
OPA (Open Policy Agent) — A policy engine. You write rules in a language called Rego, and OPA decides whether an action is allowed based on those rules.
Prometheus metrics — A standard format for exposing app statistics like request counts and response times. Looks like plain text with specific formatting.
Part 1 — The Foundation: manifest.yaml as the Single Source of Truth
What Is a “Single Source of Truth”?
In software, a “single source of truth” means one place where the authoritative information lives. If you have the same information in multiple files and they disagree, which one is right? Nobody knows. That is a bug waiting to happen.
SwiftDeploy solves this by making manifest.yaml the only file you ever edit. Every other config file is generated from it automatically.
Here is what the manifest looks like:
services:
image: swift-deploy-1-node:latest # Which Docker image to run
port: 3000 # What port your app uses inside the container
mode: stable # stable or canary
version: "1.0.0" # App version
restart: unless-stopped # Restart if it crashes
nginx:
image: nginx:latest # The Nginx web server image
port: 8080 # The port the outside world connects to
proxy_timeout: 30 # Give up after 30 seconds
network:
name: swiftdeploy-net # Name of the internal network
driver_type: bridge # Type of network
contact: "admin@swiftdeploy.local" # Shown in error messages
Plain English: “Run my app on port 3000, put Nginx in front of it on port 8080, connect them on an internal network.”
How Does swiftdeploy init Work?
When you run ./swiftdeploy init, it:
- Reads
manifest.yaml - Opens
templates/nginx.conf.tmpl— a template with placeholder values like{{NGINX_PORT}} - Replaces every placeholder with the real value from the manifest
- Writes the result to
nginx.conf - Does the same thing for
docker-compose.yml
The key insight is string replacement. The template has {{NGINX_PORT}} and the script replaces it with 8080. Simple but powerful.
# This is the core of how it works
nginx_conf = nginx_conf.replace('{{NGINX_PORT}}', nginx_port)
.replace('{{PROXY_TIMEOUT}}', proxy_timeout)
.replace('{{APP_PORT}}', app_port)
The grader can delete the generated files and run ./swiftdeploy init again — they will be recreated perfectly from the manifest. That is the whole point.
Part 2 — The Architecture: Three Containers, One Network
When you run ./swiftdeploy deploy, three containers start:
Outside World
|
| port 8080
↓
┌─────────┐
│ Nginx │ ← Receives all traffic, forwards to app
└────┬────┘
│ internal network (swiftdeploy-net)
↓
┌─────────┐
│ App │ ← Python service, port 3000 (never exposed to outside)
└─────────┘
┌─────────┐
│ OPA │ ← Policy engine, port 8181 (CLI only, not through Nginx)
└─────────┘
Why does the app port never get exposed?
This is a security decision. If port 3000 was exposed directly, anyone could bypass Nginx and hit your app without going through timeouts, error handling, logging, or security headers. By keeping it internal, all traffic is forced through Nginx.
Why is OPA isolated from Nginx?
OPA is only for the CLI to query. It is not a user-facing service. If it were accessible through Nginx on port 8080, anyone could query your policies. So OPA lives on the same internal Docker network but is never proxied by Nginx.
Part 3 — The App: A Python HTTP Service
The app (app/main.py) is a pure Python web server — no external frameworks, just Python’s built-in http.server. This keeps the Docker image tiny (74MB, well under the 300MB limit).
The Three Endpoints
GET / — Welcome message with mode, version, and timestamp:
{
"message": "Welcome to SwiftDeploy! Running in canary mode.",
"mode": "canary",
"version": "1.0.0",
"time": "2026-05-05T21:00:00+00:00"
}
GET /healthz — Is the app alive? How long has it been running?
{
"status": "ok",
"uptime": 342.15
}
Docker polls this endpoint every 10 seconds. If it fails three times in a row, Docker marks the container “unhealthy” and restarts it. That is why we never apply chaos to /healthz — if we did, Docker would kill our container during a chaos test.
POST /chaos — Only available in canary mode. Simulates failures:
# Make 50% of requests fail with 500 errors
curl -X POST http://localhost:8080/chaos
-d '{"mode": "error", "rate": 0.5}'
# Make every request sleep 5 seconds
curl -X POST http://localhost:8080/chaos
-d '{"mode": "slow", "duration": 5}'
# Cancel all chaos
curl -X POST http://localhost:8080/chaos
-d '{"mode": "recover"}'
Why Only in Canary Mode?
Because canary mode is the “I am testing something risky” mode. Stable mode means production traffic — you should never be able to break it intentionally. Canary mode is the sandbox where chaos makes sense.
Thread Safety
The app handles multiple requests at the same time. If two requests tried to update the chaos state simultaneously, they could corrupt each other. Python’s threading.Lock() prevents this:
# Lock means: only one request can change this at a time
with chaos_lock:
chaos_state["mode"] = "error"
chaos_state["rate"] = 0.5
This is called “thread safety” — making sure shared data is not corrupted by concurrent access.
Part 4 — The Nginx Configuration
Nginx sits between the user and the app. Here is what our generated nginx.conf does:
Custom Log Format
The task required logs in a specific format. In Nginx, you define log formats with log_format:
log_format swiftdeploy '$time_iso8601 | $status | $request_times | $upstream_addr | $request';
This produces logs like:
2026-05-05T21:00:00+00:00 | 200 | 0.001s | 172.18.0.2:3000 | GET / HTTP/1.1
Timestamp | Status code | How long it took | Where it went | What was requested
JSON Error Responses
When the app is down or slow, Nginx returns an error. By default, Nginx returns HTML error pages — ugly and not useful for APIs. We override this with JSON:
location @err502 {
default_type application/json;
return 502 '{"error": "Bad Gateway", "code": 502, "service": "swiftdeploy-api", "contact": "admin@swiftdeploy.local"}';
}
502 means “Bad Gateway” — Nginx could not reach the app. 503 means “Service Unavailable”. 504 means “Gateway Timeout” — the app took too long.
The DNS Resolution Trick
One problem we hit: when running nginx -t to validate the config syntax, Nginx tries to resolve the hostname app (the Docker container name). But app only exists inside the Docker network — not during standalone validation. This caused a “host not found” error even though the config was syntactically correct.
The fix is to use a variable for the upstream address:
resolver 127.0.0.11 valid=10s; # Docker's internal DNS server
location / {
set $upstream http://app:3000; # Variable = skip DNS at startup
proxy_pass $upstream; # Nginx resolves it per-request instead
}
When proxy_pass uses a variable, Nginx skips DNS resolution at startup and resolves it per-request. This lets nginx -t pass even when the app container does not exist yet.
Part 5 — Stable vs Canary Mode
Canary deployments come from a real practice in mining. Miners used to bring canaries into mines — if the canary died, the air was bad and the miners knew to leave. In software, a “canary” is a small deployment that gets traffic first. If it fails, only a small percentage of users are affected before you roll back.
SwiftDeploy implements a simple version:
- Stable — normal mode, chaos disabled
-
Canary — test mode, chaos enabled, every response gets
X-Mode: canaryheader
When you promote:
./swiftdeploy promote canary
The script:
- Runs a 30-second OPA policy check (explained in Part 7)
- Updates
mode: canaryinmanifest.yamlin-place (preserving all comments) - Regenerates
docker-compose.ymlwith the newMODE=canaryenvironment variable - Restarts only the app container — Nginx stays up, no downtime
- Polls
/healthzto confirm the new mode is active
The X-Mode: canary header indicates to clients that they are talking to the canary version. Nginx forwards it from the app to the user:
proxy_pass_header X-Mode;
add_header X-Mode $upstream_http_x_mode always;
Part 6 — Prometheus Metrics: The “Eyes” of the System
What Are Metrics?
Metrics are numbers that describe how your app is behaving. Things like:
- How many requests per second?
- What percentage are failing?
- How long do requests take?
Prometheus is a popular monitoring system. It expects metrics in a specific text format.
What We Track
http_requests_total — A counter. Counts every request, labelled by method, path, and status code:
http_requests_total{method="GET",path="https://dev.to/",status_code="200"} 142
http_requests_total{method="GET",path="https://dev.to/",status_code="500"} 8
http_request_duration_seconds — A histogram. Groups request durations into buckets:
http_request_duration_seconds_bucket{le="0.1"} 140 # 140 requests took ≤ 0.1s
http_request_duration_seconds_bucket{le="0.5"} 142 # 142 requests took ≤ 0.5s
http_request_duration_seconds_bucket{le="+Inf"} 150 # 150 total requests
A histogram is what allows us to calculate percentile latency. The P99 (99th percentile) means “99% of requests completed within this time.” It is more meaningful than the average because averages hide slow outliers.
app_mode — 0 for stable, 1 for canary.
chaos_active — 0 for none, 1 for slow, 2 for error.
How record_request() Works
After every request completes, we call:
record_request("GET", "/", 200, duration)
This updates both the counter and the histogram in one call. We do NOT record /healthz or /metrics — those are infrastructure endpoints, not user traffic.
Part 7 — OPA: The “Brain” That Makes Decisions
What Is OPA?
Open Policy Agent is a general-purpose policy engine. You write rules in a language called Rego (pronounced “ray-go”), and OPA evaluates them against data you send it.
The core principle: The CLI never decides whether to allow or deny. It only collects data, sends it to OPA, and surfaces the result. All decision logic lives in Rego.
Why does this matter? Because it means you can change your policies without touching your deployment code. Policy and mechanics are separated.
The Infrastructure Policy
Before every deployment, we check if the host has enough resources. Here is policies/infrastructure.rego:
package infrastructure
default allow := false
allow if {
count(deny) == 0 # Allow if there are zero deny reasons
}
deny contains reason if {
input.disk_free_gb < data.infrastructure.min_disk_free_gb
reason := sprintf("Disk free %.1fGB is below minimum %dGB",
[input.disk_free_gb, data.infrastructure.min_disk_free_gb])
}
deny contains reason if {
input.cpu_load > data.infrastructure.max_cpu_load
reason := sprintf("CPU load %.2f exceeds maximum %.2f",
[input.cpu_load, data.infrastructure.max_cpu_load])
}
Notice: the thresholds (min_disk_free_gb, max_cpu_load) are not written in the Rego file. They come from data.json:
{
"infrastructure": {
"min_disk_free_gb": 10,
"max_cpu_load": 2.0,
"max_mem_percent": 90
}
}
This separation means: change the threshold by editing data.json, not the policy logic. Environments have different needs — a production server and a test server should not have the same disk requirements.
The Canary Safety Policy
Before every promotion, we check if the running service is healthy enough to switch modes. This is policies/canary.rego:
package canary
default allow := false
allow if { count(deny) == 0 }
deny contains reason if {
input.error_rate_percent > data.canary.max_error_rate_percent
reason := sprintf("Error rate %.2f%% exceeds maximum %.2f%% over last 30s",
[input.error_rate_percent, data.canary.max_error_rate_percent])
}
deny contains reason if {
input.p99_latency_ms > data.canary.max_p99_latency_ms
reason := sprintf("P99 latency %.0fms exceeds maximum %dms over last 30s",
[input.p99_latency_ms, data.canary.max_p99_latency_ms])
}
How the 30-Second Window Works
The requirement says “over the last 30 seconds.” But Prometheus metrics are cumulative — they count from when the app started, not just the last 30 seconds.
The trick: take two snapshots 30 seconds apart and subtract them.
Snapshot 1 (time T) → wait 30s → Snapshot 2 (time T+30)
delta = Snapshot2 - Snapshot1 ← this is what happened in the last 30 seconds
delta_total = sum(totals2.values()) - sum(totals1.values())
delta_errors = sum(errors in snapshot2) - sum(errors in snapshot1)
error_rate_percent = (delta_errors / delta_total * 100)
This gives us a true “what happened in the last 30 seconds” measurement, regardless of historical traffic.
How the CLI Talks to OPA
OPA exposes a REST API. The CLI sends a POST request with the data:
curl -X POST http://127.0.0.1:8181/v1/data/infrastructure
-H "Content-Type: application/json"
-d '{
"input": {
"disk_free_gb": 11.0,
"cpu_load": 0.18,
"mem_percent": 53.5
}
}'
OPA responds with:
{
"result": {
"allow": true,
"deny": []
}
}
Or if blocked:
{
"result": {
"allow": false,
"deny": ["Disk free 4.0GB is below minimum 10GB"]
}
}
The CLI reads allow to know what to do, and prints the deny reasons so the operator knows exactly why something was blocked.
Part 8 — What Happened When I Injected Chaos
This is where it gets interesting. After deploying in canary mode and activating error chaos:
./swiftdeploy promote canary
curl -X POST http://localhost:8080/chaos
-d '{"mode": "error", "rate": 0.5}'
The status dashboard started showing failures:
══════════════════════════════════════════════════════
SwiftDeploy Status Dashboard
2026-05-05T21:55:12Z
══════════════════════════════════════════════════════
Mode: canary
Chaos: error
Total reqs: 89
Error rate: 48.31%
P99 latency: 10000ms
Policy Compliance:
✅ infrastructure: PASS (disk=11GB, cpu=0.22, mem=54.1%)
❌ canary: FAIL (error=48.31%, p99=10000ms)
Then when I tried to promote to stable while the chaos was running:
── OPA Pre-Promote Policy Check ──────────────────────
Sampling metrics over a 30-second window...
Last 30s: 80 requests, 36 errors
Error rate: 45.0000%, P99 latency: 10000.00ms
❌ BLOCK: Canary policy denied promotion
↳ Error rate 45.00% exceeds maximum 1.00% over last 30s
↳ P99 latency 10000ms exceeds maximum 500ms over last 30s
❌ Promotion blocked by policy.
The policy worked. The system refused to let me promote while the service was actively failing. This is the whole point — you cannot accidentally promote a broken service.
After recovering:
curl -X POST http://localhost:8080/chaos -d '{"mode": "recover"}'
./swiftdeploy promote stable # Now succeeds
Part 9 — The Audit Trail
Every significant event is written to history.jsonl:
{"timestamp":"2026-05-05T21:00:00Z","event":"deploy","details":{"mode":"stable"}}
{"timestamp":"2026-05-05T21:10:00Z","event":"promote","details":{"mode":"canary"}}
{"timestamp":"2026-05-05T21:15:00Z","event":"policy_check","details":{"policy":"canary","result":"fail","error_rate_percent":45.0}}
{"timestamp":"2026-05-05T21:20:00Z","event":"promote","details":{"mode":"stable"}}
Running ./swiftdeploy audit transforms this into a readable Markdown report:
# SwiftDeploy Audit Report
Generated: 2026-05-05T21:30:00Z
Total events: 24
## Timeline
| Timestamp | Event | Details |
|-----------|-------|---------|
| 2026-05-05T21:00:00Z | deploy | mode=stable |
| 2026-05-05T21:10:00Z | promote | → canary |
| 2026-05-05T21:15:00Z | policy_check | canary: fail |
## Policy Violations
| Timestamp | Policy | Reasons |
|-----------|--------|---------|
| 2026-05-05T21:15:00Z | canary | Error rate 45.00% exceeds maximum 1.00% over last 30s |
Part 10 — Lessons Learned
1. The DNS Resolution Problem Was the Biggest Surprise
I spent a long time debugging why nginx -t kept failing with “host not found” even though my config was correct. The issue was that Nginx tries to resolve hostnames at startup — but app is a Docker hostname that only exists inside a running Docker network.
The fix (using set $upstream as a variable) was not obvious. It is an Nginx-specific trick that delays DNS resolution from startup to per-request. Learning this saved me from a broken validation check.
2. Capabilities Matter More Than You Think
Running containers as non-root is well-known advice. But dropping capabilities (like cap_drop: ALL) is less discussed. When I dropped all capabilities from Nginx, it crashed on startup with:
chown("https://dev.to/var/cache/nginx/client_temp", 101) failed (1: Operation not permitted)
Nginx needs CHOWN to set up its temp directories. Dropping ALL capabilities silently breaks things. The lesson: drop specific capabilities you know the container does not need, not everything at once.
3. OPA’s sprintf Handles Integers and Floats Differently
When my Rego policy tried to format an integer from data.json using %.0f, OPA produced %!f(int=500) — a formatting error. The fix was to use %d for integers from data.json and %.0f only for float inputs. OPA does not silently convert types the way Python does.
4. Cumulative Metrics Need Careful Handling
Prometheus metrics are cumulative counters — they only go up. To calculate “what happened in the last 30 seconds,” you cannot just look at the current value. You need two snapshots and a delta. This is obvious in retrospect, but took a few failed attempts to get right.
5. The Manifest as the Single Source of Truth Actually Works
The grader requirement was: delete generated files, run ./swiftdeploy init, and verify they regenerate correctly. This forced a discipline that turned out to be genuinely useful. When you know everything comes from one file, debugging is much easier. There is only one place to look.
Complete Setup Instructions
# Clone the repo
git clone https://github.com/YOUR_USERNAME/swiftdeploy-automation.git
cd swiftdeploy-automation
# Install Python dependency
pip install pyyaml
# Build the Docker image
docker build -t swift-deploy-1-node:latest .
# Make CLI executable
chmod +x swiftdeploy
# Deploy
./swiftdeploy deploy
# Try everything
curl http://localhost:8080/
curl http://localhost:8080/healthz
curl http://localhost:8080/metrics
./swiftdeploy status
./swiftdeploy promote canary
./swiftdeploy promote stable
./swiftdeploy audit
./swiftdeploy teardown --clean
Conclusion
SwiftDeploy taught me that the most valuable thing in infrastructure is not the tools themselves — it is the discipline they enforce.
Making the manifest the single source of truth means you cannot have config drift. Putting all policy logic in OPA means your deployment tool cannot make silent decisions. Requiring metrics before promotion means you cannot promote a broken service by accident.
Each of these constraints feels like a restriction at first. But each one also prevents a whole category of mistakes.
If you want to replicate this project, the full source code is at the GitHub link below. Every file is commented and explained. Start with manifest.yaml, read swiftdeploy from top to bottom, then look at the Rego policies.
The best way to learn is to break it intentionally — fill up the disk, inject chaos, watch the policies fire. That is what the chaos endpoint is for.
Source code: github.com/devops-timi/swiftdeploy-automation