Brex Database Disaster Recovery

Speaker: Fabiano Honorato, Michelle Koo, Stephen Brandon @ AWS FSI Meetup 2025 Q4

Introduction to Brex

  • Financial operating system platform for managing expenses, travel, credit.

  • Engineering manager and team members discuss leveraging Amazon Aurora for resiliency and international expansion

Brex services

  • Corporate cards, expense management, travel, bill pay, and banking

  • Aim to help clients spend wisely and smartly

Importance of preparing infrastructure for disaster scenarios

  • Focus on the data layer, primarily using PostgreSQL with PG bouncer and replicas for applications and analytical purposes

  • Merge smaller databases into a single database instance

  • Past disaster recovery process was manual and time-consuming

Goals for disaster recovery solution

  • Warm disaster recovery solution to decrease Recovery Time Objective (RTO) and Recovery Point Objective (RPO)

  • RTO: maximum time to recover normal operations after disaster

  • RPO: maximum amount of data tolerable to lose

Determining RPO and RTO

  • Analyze metrics, assess current capabilities, and conduct extensive testing

  • Understand how applications will handle additional latency and data loss

Choice of Amazon Aurora Global Database

  • Provides necessary features without significant changes to the current setup

  • Allows use of a secondary region when needed

Current implementation caveats

  • Creating custom DNS endpoint for read applications used for both applications and analytical purposes

Migration challenges and approach

  • Difficulty in migrating from PostgreSQL to Aurora due to potential application downtime

  • Focus on automation to minimize manual handling

  • Built a temporal workflow for running automated jobs to validate migration steps and prepare the environment

  • Performed the switch over to Aurora Global after ensuring automation validated database status

Downtime management during migration

  • AWS provides a small window of downtime (2-3 minutes) for migration

  • Utilized this window to adjust endpoints and applications consuming the database

  • Leveraged the short downtime period for a smooth transition

Using temporal workflows for automation

Current state before migration

  • Application connected to PG bouncer, which connected to PostgreSQL instance and replica instance

Migration process

  • Created Aurora read replica through AWS with zero downtime

  • Workflow promoted Aurora read replica and created Aurora global cluster

  • Application connected to PG bouncer, which then connected to Aurora global cluster using global writer endpoint

  • Possibility to create another cluster for multi-region setup

Flux:

  • Tool for keeping Kubernetes clusters in sync through git repositories

  • Workflow generated Flux git pull requests ahead of time

  • Workflow automatically merged pull requests after manual verification

  • Confirmation signal sent to workflow to proceed with downtime and promote Aurora global cluster

Automatically reviewing Flux using AI

  • Identified errors or issues in pull requests and provided comments for review

Dry run migration flag

  • Allowed testing of migration without causing destructive actions or downtime

  • Created Flux git pull requests ahead of time for review, but did not merge or promote cluster

Additional tools and processes used in the migration

  • Terraform: created a template for managing databases as a new global cluster

  • Added Terraform for each database after migration workflow completion

  • Managed reader instances in the global cluster through Terraform

Internal command line tool:

  • Added commands for teams to self-service switch over or failover for their Aurora global clusters

  • Failover: used to recover from unplanned outages by switching to a different region if one region is down

  • Switchover: used for controlled scenarios like operational maintenance or planned procedures with no data loss

Iterative journey of improving workflow performance

  • Initial workflow took around 15 minutes for end-to-end automation

  • Downtime for promoting Aurora global cluster and creating Flux git pull requests

  • Sequential process with no parallelization

Addition of parallelization reduced workflow time to 10 minutes

  • Updated workflow to perform steps in parallel, including fetching credentials and creating Flux pull requests ahead of time

  • Introduced dry run flag for non-destructive testing of migration

Final restructuring of workflow achieved 3-minute performance time

  • Created Flux pull requests ahead of time, allowing workflow to pause until downtime window

  • Reduced git add operations to minimize costs

  • Added signal command for controlled initiation of downtime

Lessons learned during the process

  • Thorough testing and deliberate deployment are crucial before formal migration

  • Start with staging environment, resolve issues, and then proceed to production

  • Automation reduces human error and enables easy replication for multiple databases

  • Simulate migration using the dry run option to test workflow without causing downtime (A dry run or practice run, a trial exercise or rehearsal, is a software testing process used to make sure that a system works correctly and will not result in severe failure.)

  • Iterative improvement by migrating a few databases each week

Total
0
Shares
Leave a Reply

Your email address will not be published. Required fields are marked *

Previous Post

The 20 Most Essential DevOps Tools: Bridging the Gap Between Development and Operations

Next Post
the-approach-and-use-of-standards-in-quality,-manufacturing

The Approach and Use of Standards in Quality, Manufacturing

Related Posts