Software

1 minute read

We Cut AWS Onboarding from 7 Days to 1 Hour with Terragrunt (Here’s How)

Yevgeniya Gagarkina

January 17, 2026

We re-architected our AWS infrastructure from manual provisioning → Terraform → Terragrunt.

The result: New environment onboarding dropped from ~7 days to ~1 hour.

But the path there wasn’t obvious. We hit real problems that the Terragrunt docs don’t warn you about.

The Challenge: Multi-Account AWS with Real DR

Our constraints:

Each client runs in their own AWS account
Dev/UAT/Prod/DR all in the same account (different regions)
2-person platform team (optimize for simplicity, not clever automation)
Actual RTO/RPO targets we had to meet (1 hour / <10 seconds)

Manual infrastructure didn’t scale. Terraform helped, but every new environment meant copying hundreds of lines of config.

The Dependency Cycle That Almost Broke Us

Here’s a problem the docs don’t prepare you for:

S3 Cross-Region Replication creates a circular dependency:

Source + destination buckets must exist
IAM policies reference both buckets
Replication rules reference IAM roles
Everything references everything 😅

Terraform’s dependency graph just… chokes.

We had to split resources into “physical” and “logical” layers. I’ll show you the exact pattern in the full article.

The “80 GB Cache” Problem

Terragrunt’s .terragrunt-cache grew to 75-80 GB in our CI/CD pipeline and crashed build agents.

The fix wasn’t obvious, and it’s not in the troubleshooting docs.

What Actually Worked

After 2 weeks of building this (with AI-powered tools to accelerate), here’s what we landed on:

1. Repository Structure That Scales

Infra/
  terragrunt.hcl              # Root config (DRY logic)
  /
    /
      /
        /
          /
            terragrunt.hcl    # Leaf config

The folder structure IS the documentation. If you can navigate the filesystem, you understand the infrastructure.

2. Pilot-Light DR Strategy

Keep data warm (always replicating), keep compute cold (spin up on failover).

Always on: Aurora Global DB, S3 CRR, networking
Provision on DR: EKS, OpenSearch, Redis

This balances cost with our 1-hour RTO target.

3. Skip DynamoDB Locking (For Now)

Controversial take: For a 2-person team, human coordination beats complex locking.

We enforced “one set of hands on infra” via agreement. If you’re scaling, revisit this.

The Lessons I Wish I Knew Earlier

❌ Don’t over-modularize too early. Use Terraform first to stop variance, then Terragrunt to stop duplication.

❌ Don’t ignore circular dependencies in DR. S3 CRR and Aurora Global DB will break your dependency graph. You need layers.

✅ Do test your DR regularly. We ran drills every 6 months. Untested DR is just a hypothesis.

✅ Do design for your team size. 2 people need clarity > automation complexity.

The Full Architecture Breakdown

I wrote a detailed deep dive covering:

The exact Terragrunt folder structure we use
How root inheritance keeps configs DRY (with code examples)
The dependency cycle solutions for S3 CRR and Aurora Global DB
Multi-region DR architecture (what stays on, what spins up)
State management decisions (and why we skipped DynamoDB locking)
The CI/CD cache cleanup fix that saved us

👉 Read the full article here

Hit a Terragrunt challenge yourself? Drop it in the comments. I’ll share what worked (or didn’t) for us.

If this was useful, follow me for the next article: AWS/Azure cost optimization strategies – how discount percentages translate to actual runtime hours (spoiler: Azure licensing between D-series and B-series is a trap).

How to Run MLPerf Llama 2 70B Training on AMD MI325X Without SLURM

January 17, 2026

M	T	W	T	F	S	S
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Hand-Picked Top-Read Stories

We Cut AWS Onboarding from 7 Days to 1 Hour with Terragrunt (Here’s How)

How to Run MLPerf Llama 2 70B Training on AMD MI325X Without SLURM

SC #6: Petición web con async/await

Trending Tags

We Cut AWS Onboarding from 7 Days to 1 Hour with Terragrunt (Here’s How)

The Challenge: Multi-Account AWS with Real DR

The Dependency Cycle That Almost Broke Us

The “80 GB Cache” Problem

What Actually Worked

1. Repository Structure That Scales

2. Pilot-Light DR Strategy

3. Skip DynamoDB Locking (For Now)

The Lessons I Wish I Knew Earlier

The Full Architecture Breakdown

Leave a Reply Cancel reply

Previous Post

How to Run MLPerf Llama 2 70B Training on AMD MI325X Without SLURM

We Cut AWS Onboarding from 7 Days to 1 Hour with Terragrunt (Here’s How)

How to Run MLPerf Llama 2 70B Training on AMD MI325X Without SLURM

SC #6: Petición web con async/await

We Cut AWS Onboarding from 7 Days to 1 Hour with Terragrunt (Here’s How)

The Challenge: Multi-Account AWS with Real DR

The Dependency Cycle That Almost Broke Us

The “80 GB Cache” Problem

What Actually Worked

1. Repository Structure That Scales

2. Pilot-Light DR Strategy

3. Skip DynamoDB Locking (For Now)

The Lessons I Wish I Knew Earlier

The Full Architecture Breakdown

Leave a Reply Cancel reply

Previous Post

How to Run MLPerf Llama 2 70B Training on AMD MI325X Without SLURM

Related Posts

Global Google Developer Experts Share Their Favorite Tools and Advice for New Developers

Context-Sensitive Semantic Reasoning via Dynamic Triplet Graph Embeddings

Design Philosophy of Zero-Dependency Web Framework(8423)