We Cut AWS Onboarding from 7 Days to 1 Hour with Terragrunt (Here’s How)

We re-architected our AWS infrastructure from manual provisioning → Terraform → Terragrunt.

The result: New environment onboarding dropped from ~7 days to ~1 hour.

But the path there wasn’t obvious. We hit real problems that the Terragrunt docs don’t warn you about.

The Challenge: Multi-Account AWS with Real DR

Our constraints:

  • Each client runs in their own AWS account
  • Dev/UAT/Prod/DR all in the same account (different regions)
  • 2-person platform team (optimize for simplicity, not clever automation)
  • Actual RTO/RPO targets we had to meet (1 hour / <10 seconds)

Manual infrastructure didn’t scale. Terraform helped, but every new environment meant copying hundreds of lines of config.

The Dependency Cycle That Almost Broke Us

Here’s a problem the docs don’t prepare you for:

S3 Cross-Region Replication creates a circular dependency:

  • Source + destination buckets must exist
  • IAM policies reference both buckets
  • Replication rules reference IAM roles
  • Everything references everything 😅

Terraform’s dependency graph just… chokes.

We had to split resources into “physical” and “logical” layers. I’ll show you the exact pattern in the full article.

The “80 GB Cache” Problem

Terragrunt’s .terragrunt-cache grew to 75-80 GB in our CI/CD pipeline and crashed build agents.

The fix wasn’t obvious, and it’s not in the troubleshooting docs.

What Actually Worked

After 2 weeks of building this (with AI-powered tools to accelerate), here’s what we landed on:

1. Repository Structure That Scales

Infra/
  terragrunt.hcl              # Root config (DRY logic)
  /
    /
      /
        /
          /
            terragrunt.hcl    # Leaf config

The folder structure IS the documentation. If you can navigate the filesystem, you understand the infrastructure.

2. Pilot-Light DR Strategy

Keep data warm (always replicating), keep compute cold (spin up on failover).

  • Always on: Aurora Global DB, S3 CRR, networking
  • Provision on DR: EKS, OpenSearch, Redis

This balances cost with our 1-hour RTO target.

3. Skip DynamoDB Locking (For Now)

Controversial take: For a 2-person team, human coordination beats complex locking.

We enforced “one set of hands on infra” via agreement. If you’re scaling, revisit this.

The Lessons I Wish I Knew Earlier

Don’t over-modularize too early. Use Terraform first to stop variance, then Terragrunt to stop duplication.

Don’t ignore circular dependencies in DR. S3 CRR and Aurora Global DB will break your dependency graph. You need layers.

Do test your DR regularly. We ran drills every 6 months. Untested DR is just a hypothesis.

Do design for your team size. 2 people need clarity > automation complexity.

The Full Architecture Breakdown

I wrote a detailed deep dive covering:

  • The exact Terragrunt folder structure we use
  • How root inheritance keeps configs DRY (with code examples)
  • The dependency cycle solutions for S3 CRR and Aurora Global DB
  • Multi-region DR architecture (what stays on, what spins up)
  • State management decisions (and why we skipped DynamoDB locking)
  • The CI/CD cache cleanup fix that saved us

👉 Read the full article here

Hit a Terragrunt challenge yourself? Drop it in the comments. I’ll share what worked (or didn’t) for us.

If this was useful, follow me for the next article: AWS/Azure cost optimization strategies – how discount percentages translate to actual runtime hours (spoiler: Azure licensing between D-series and B-series is a trap).

Total
0
Shares
Leave a Reply

Your email address will not be published. Required fields are marked *

Previous Post

How to Run MLPerf Llama 2 70B Training on AMD MI325X Without SLURM

Related Posts