Speaker: Fabiano Honorato, Michelle Koo, Stephen Brandon @ AWS FSI Meetup 2025 Q4
Introduction to Brex
-
Financial operating system platform for managing expenses, travel, credit.
-
Engineering manager and team members discuss leveraging Amazon Aurora for resiliency and international expansion
Brex services
-
Corporate cards, expense management, travel, bill pay, and banking
-
Aim to help clients spend wisely and smartly
Importance of preparing infrastructure for disaster scenarios
-
Focus on the data layer, primarily using PostgreSQL with PG bouncer and replicas for applications and analytical purposes
-
Merge smaller databases into a single database instance
-
Past disaster recovery process was manual and time-consuming
Goals for disaster recovery solution
-
Warm disaster recovery solution to decrease Recovery Time Objective (RTO) and Recovery Point Objective (RPO)
-
RTO: maximum time to recover normal operations after disaster
-
RPO: maximum amount of data tolerable to lose
Determining RPO and RTO
-
Analyze metrics, assess current capabilities, and conduct extensive testing
-
Understand how applications will handle additional latency and data loss
Choice of Amazon Aurora Global Database
-
Provides necessary features without significant changes to the current setup
-
Allows use of a secondary region when needed
Current implementation caveats
- Creating custom DNS endpoint for read applications used for both applications and analytical purposes
Migration challenges and approach
-
Difficulty in migrating from PostgreSQL to Aurora due to potential application downtime
-
Focus on automation to minimize manual handling
-
Built a temporal workflow for running automated jobs to validate migration steps and prepare the environment
-
Performed the switch over to Aurora Global after ensuring automation validated database status
Downtime management during migration
-
AWS provides a small window of downtime (2-3 minutes) for migration
-
Utilized this window to adjust endpoints and applications consuming the database
-
Leveraged the short downtime period for a smooth transition
Using temporal workflows for automation
Current state before migration
- Application connected to PG bouncer, which connected to PostgreSQL instance and replica instance
Migration process
-
Created Aurora read replica through AWS with zero downtime
-
Workflow promoted Aurora read replica and created Aurora global cluster
-
Application connected to PG bouncer, which then connected to Aurora global cluster using global writer endpoint
-
Possibility to create another cluster for multi-region setup
Flux:
-
Tool for keeping Kubernetes clusters in sync through git repositories
-
Workflow generated Flux git pull requests ahead of time
-
Workflow automatically merged pull requests after manual verification
-
Confirmation signal sent to workflow to proceed with downtime and promote Aurora global cluster
Automatically reviewing Flux using AI
- Identified errors or issues in pull requests and provided comments for review
Dry run migration flag
-
Allowed testing of migration without causing destructive actions or downtime
-
Created Flux git pull requests ahead of time for review, but did not merge or promote cluster
Additional tools and processes used in the migration
-
Terraform: created a template for managing databases as a new global cluster
-
Added Terraform for each database after migration workflow completion
-
Managed reader instances in the global cluster through Terraform
Internal command line tool:
-
Added commands for teams to self-service switch over or failover for their Aurora global clusters
-
Failover: used to recover from unplanned outages by switching to a different region if one region is down
-
Switchover: used for controlled scenarios like operational maintenance or planned procedures with no data loss
Iterative journey of improving workflow performance
-
Initial workflow took around 15 minutes for end-to-end automation
-
Downtime for promoting Aurora global cluster and creating Flux git pull requests
-
Sequential process with no parallelization
Addition of parallelization reduced workflow time to 10 minutes
-
Updated workflow to perform steps in parallel, including fetching credentials and creating Flux pull requests ahead of time
-
Introduced dry run flag for non-destructive testing of migration
Final restructuring of workflow achieved 3-minute performance time
-
Created Flux pull requests ahead of time, allowing workflow to pause until downtime window
-
Reduced git add operations to minimize costs
-
Added signal command for controlled initiation of downtime
Lessons learned during the process
-
Thorough testing and deliberate deployment are crucial before formal migration
-
Start with staging environment, resolve issues, and then proceed to production
-
Automation reduces human error and enables easy replication for multiple databases
-
Simulate migration using the dry run option to test workflow without causing downtime (A dry run or practice run, a trial exercise or rehearsal, is a software testing process used to make sure that a system works correctly and will not result in severe failure.)
-
Iterative improvement by migrating a few databases each week