Zero-Downtime Migration: Heroku to AWS While Processing $14M/Month
A 14-week strangler-fig migration off Heroku with zero downtime: -35% infra cost, -93% deploy time, +800% deploy frequency.
The client processes roughly $14M in monthly transaction volume across 180,000+ transactions for approximately 900 SMB merchants in the US and Canada. The entire platform ran as a monolithic Ruby on Rails application on Heroku — and the architecture had become the primary constraint on the business.
The pain was quantifiable. Every deploy was a 38-minute full restart, allowed only twice a week inside a maintenance window; one failed deployment had caused a 47-minute outage during peak hours and roughly $62,000 in failed transactions. Heroku couldn't scale components independently, so month-end reporting spikes forced the whole application to scale — costs had hit $14,200/month and were climbing 8–12% monthly. And a single bug anywhere could take down everything: a formatting error in reporting once crashed the payment queue for 23 minutes, costing two enterprise prospects.
The CTO's constraint was absolute: no migration window, no merchant-visible disruption. The migration had to happen underneath live payment traffic.
A $6,000 infrastructure assessment came first: review of the ~85,000-line Rails codebase, mapping of the 47-table schema (12 tables with >1M rows), and request tracing that showed reporting queries consuming 60% of database CPU at month-end while the payment path itself ran at p99 340ms. We recommended the strangler fig pattern — extract services one at a time, run old and new in parallel, shift traffic gradually — with extraction order ranked by risk: notifications, reporting, onboarding, webhooks, and the payment core last.
Weeks 3–4 built the foundation before any extraction: every AWS resource in Terraform, GitHub Actions CI/CD with security scanning and blue-green deploys to EKS, and a Datadog observability stack deployed up front to catch discrepancies between old and new systems within seconds.
Low-risk extractions ran weeks 5–8. The notification service ran in parallel for 5 days with a 0.00% discrepancy rate before cutover. Moving reporting to a read replica dropped primary database CPU from 78% to 31% and improved payment p99 latency from 340ms to 180ms as a side effect. The payment core migrated last with maximum caution: 48 hours of shadow traffic across 12,000 transactions at 0.00% discrepancy, then a progressive shift from 1% to 100% of traffic over 10 days with a tested rollback plan at every step. Week 14 closed with runbooks, knowledge transfer, and a 3-hour failure-injection game day.
Measured over the first 90 days post-migration versus the 90 days prior: deployment time fell from 38 minutes to 2.5 minutes per service (-93%), deployment frequency rose from 2×/week to 18×/week (+800%), monthly infrastructure cost dropped from $14,200 to $9,230 (-35%), payment p99 latency improved from 340ms to 145ms (-57%), and month-end reporting queries went from 12–45 seconds to 2–8 seconds (-82%). Unplanned downtime: 70 minutes before, zero after. Downtime during the migration itself: zero.
At the company's 15% quarterly volume growth, Heroku costs would have reached $22,000/month by year end; the AWS architecture projects to $11,500/month at the same volume — a growing annual savings of $120,000+. Payback on the $74,000 migration is 14.8 months on infrastructure savings alone, before counting the eliminated $62K-per-incident downtime risk.
The client retained a Gigabit DevOps engineer on an ongoing SRE retainer at $5,500/month, with multi-region deployment, a real-time fraud detection pipeline, and SOC 2 Type II infrastructure on the current roadmap.
Three firms pitched us on this migration. Two proposed a 12-month 'lift and shift' with a migration weekend. Gigabit proposed a 14-week strangler fig approach with zero-downtime guarantees. They delivered exactly what they promised — not a single merchant noticed the migration happened. Our deployment speed went from twice a week to multiple times a day. That alone has changed how fast we can ship product.


