Software

3 minute read

Brex Database Disaster Recovery

Helen Pollitt

November 30, 2025

Speaker: Fabiano Honorato, Michelle Koo, Stephen Brandon @ AWS FSI Meetup 2025 Q4

Introduction to Brex

Financial operating system platform for managing expenses, travel, credit.
Engineering manager and team members discuss leveraging Amazon Aurora for resiliency and international expansion

Brex services

Corporate cards, expense management, travel, bill pay, and banking
Aim to help clients spend wisely and smartly

Importance of preparing infrastructure for disaster scenarios

Focus on the data layer, primarily using PostgreSQL with PG bouncer and replicas for applications and analytical purposes
Merge smaller databases into a single database instance
Past disaster recovery process was manual and time-consuming

Goals for disaster recovery solution

Warm disaster recovery solution to decrease Recovery Time Objective (RTO) and Recovery Point Objective (RPO)
RTO: maximum time to recover normal operations after disaster
RPO: maximum amount of data tolerable to lose

Determining RPO and RTO

Analyze metrics, assess current capabilities, and conduct extensive testing
Understand how applications will handle additional latency and data loss

Choice of Amazon Aurora Global Database

Provides necessary features without significant changes to the current setup
Allows use of a secondary region when needed

Current implementation caveats

Creating custom DNS endpoint for read applications used for both applications and analytical purposes

Migration challenges and approach

Difficulty in migrating from PostgreSQL to Aurora due to potential application downtime
Focus on automation to minimize manual handling
Built a temporal workflow for running automated jobs to validate migration steps and prepare the environment
Performed the switch over to Aurora Global after ensuring automation validated database status

Downtime management during migration

AWS provides a small window of downtime (2-3 minutes) for migration
Utilized this window to adjust endpoints and applications consuming the database
Leveraged the short downtime period for a smooth transition

Using temporal workflows for automation

Current state before migration

Application connected to PG bouncer, which connected to PostgreSQL instance and replica instance

Migration process

Created Aurora read replica through AWS with zero downtime
Workflow promoted Aurora read replica and created Aurora global cluster
Application connected to PG bouncer, which then connected to Aurora global cluster using global writer endpoint
Possibility to create another cluster for multi-region setup

Flux:

Tool for keeping Kubernetes clusters in sync through git repositories
Workflow generated Flux git pull requests ahead of time
Workflow automatically merged pull requests after manual verification
Confirmation signal sent to workflow to proceed with downtime and promote Aurora global cluster

Automatically reviewing Flux using AI

Identified errors or issues in pull requests and provided comments for review

Dry run migration flag

Allowed testing of migration without causing destructive actions or downtime
Created Flux git pull requests ahead of time for review, but did not merge or promote cluster

Additional tools and processes used in the migration

Terraform: created a template for managing databases as a new global cluster
Added Terraform for each database after migration workflow completion
Managed reader instances in the global cluster through Terraform

Internal command line tool:

Added commands for teams to self-service switch over or failover for their Aurora global clusters
Failover: used to recover from unplanned outages by switching to a different region if one region is down
Switchover: used for controlled scenarios like operational maintenance or planned procedures with no data loss

Iterative journey of improving workflow performance

Initial workflow took around 15 minutes for end-to-end automation
Downtime for promoting Aurora global cluster and creating Flux git pull requests
Sequential process with no parallelization

Addition of parallelization reduced workflow time to 10 minutes

Updated workflow to perform steps in parallel, including fetching credentials and creating Flux pull requests ahead of time
Introduced dry run flag for non-destructive testing of migration

Final restructuring of workflow achieved 3-minute performance time

Created Flux pull requests ahead of time, allowing workflow to pause until downtime window
Reduced git add operations to minimize costs
Added signal command for controlled initiation of downtime

Lessons learned during the process

Thorough testing and deliberate deployment are crucial before formal migration
Start with staging environment, resolve issues, and then proceed to production
Automation reduces human error and enables easy replication for multiple databases
Simulate migration using the dry run option to test workflow without causing downtime (A dry run or practice run, a trial exercise or rehearsal, is a software testing process used to make sure that a system works correctly and will not result in severe failure.)
Iterative improvement by migrating a few databases each week

The 20 Most Essential DevOps Tools: Bridging the Gap Between Development and Operations

November 30, 2025

Quality Assurance

The Approach and Use of Standards in Quality, Manufacturing

November 30, 2025

M	T	W	T	F	S	S
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Hand-Picked Top-Read Stories

VIDEO PODCAST | Heeding the Uncertainty of the Supply Chain

cfix: Architecting a seamless diagnostic bridge between Linux runtime errors and GitHub Copilot’s LLM-powered intelligence

Iterative Project Management: Methods and Tools

Trending Tags

Brex Database Disaster Recovery

Leave a Reply Cancel reply

Previous Post

The 20 Most Essential DevOps Tools: Bridging the Gap Between Development and Operations

Next Post

The Approach and Use of Standards in Quality, Manufacturing

Brex Database Disaster Recovery

Leave a Reply Cancel reply

Previous Post

Next Post

Related Posts