Zero-Downtime Migration: Heroku to AWS While Processing $14M/Month

Snapshot

The Challenge

Our client operates a payment processing platform serving approximately 900 SMB merchants across the US and Canada. The platform handles payment acceptance, merchant onboarding, settlement, and reporting — processing roughly $14M in monthly transaction volume across 180,000+ individual transactions.

The entire platform ran on a monolithic Ruby on Rails application deployed on Heroku. This architecture had served the company well through product-market fit and early growth, but it had become the primary constraint on the business. 

The pain was specific and quantifiable

Deployment bottleneck. Every deployment was a full application restart affecting all merchants simultaneously. The deploy process took 38 minutes, required a maintenance window, and could only happen twice per week (Tuesday and Thursday mornings). One failed deployment in the prior quarter had caused a 47-minute outage during peak processing hours — resulting in approximately $62,000 in failed transactions and three merchant escalations.

Scaling wall

Heroku’s dyno-based architecture couldn’t scale individual components independently. When the reporting module spiked during month-end reconciliation, the entire application needed to scale up — including the payment processing, onboarding, and notification systems that had normal load. Monthly Heroku costs had reached $14,200 and were climbing 8–12% per month as transaction volume grew.

Blast radius

A single bug anywhere in the monolith could take down the entire platform. Three months before our engagement, a formatting error in the reporting module triggered an unhandled exception that crashed the payment processing queue. Merchants couldn’t accept payments for 23 minutes. The incident cost the company two enterprise prospects who were in late-stage evaluation.

The CTO’s constraint was absolute: “We process live payments. We cannot have a migration window. We cannot have merchants notice. Whatever we do has to happen underneath them while they continue to accept payments.” 

Our Approach

Infrastructure Assessment ($6,000)

We audited the existing architecture before proposing a migration plan.

Application analysis

We reviewed the Rails codebase (approximately 85,000 lines), mapped the database schema (47 tables, 12 with >1M rows), identified service boundaries, and documented every external integration (Stripe Connect for payment processing, Plaid for bank verification, SendGrid for notifications, Twilio for SMS, and a custom settlement engine).

Performance profiling

We instrumented the application with detailed request tracing. Key findings: the payment processing path was actually fast (p99: 340ms) but the reporting queries were consuming 60% of database CPU during month-end. The monolith’s shared database meant reporting load directly impacted payment processing latency.

Cost analysis. Current Heroku spend: $14,200/month. We modeled the equivalent workload on AWS using EKS with Fargate: projected $9,200/month at current volume, scaling more efficiently as transaction volume grows.

Migration strategy. We recommended the strangler fig pattern — extracting services from the monolith one at a time, running old and new in parallel, and gradually shifting traffic. This eliminates the “big bang” risk of a full rewrite.

Extraction order (risk-optimized):

  1. Notifications (lowest risk — if a notification is delayed 30 seconds, no merchant impact)
  2. Reporting and analytics (read-only workload, no transaction dependency)
  3. Merchant onboarding (medium risk — not on the payment processing path)
  4. Webhook delivery (medium risk — partner integrations, but retryable)
  5. Payment processing core (highest risk — the money pipeline, extracted last with maximum preparation) 

Migration Execution

Phase 1 — Foundation (Week 3–4)

Before extracting any service, we built the infrastructure foundation:

Infrastructure as Code. Every AWS resource defined in Terraform — VPC, subnets, security groups, EKS cluster, RDS instances, SQS queues, S3 buckets, IAM roles. Everything version-controlled, peer-reviewed, and reproducible. Zero manual console clicks. This investment paid for itself immediately: when we needed to create an identical staging environment, it took one terraform apply command instead of a week of manual setup.

CI/CD pipeline. GitHub Actions workflows for each service: automated tests, container builds, security scanning (Trivy for container vulnerabilities, Snyk for dependency vulnerabilities), and blue-green deployment to EKS. Target: any engineer can deploy any service to production with a single merged PR. No maintenance windows required.

Observability stack. Datadog APM, infrastructure monitoring, and log aggregation deployed before any migration began. We needed to detect discrepancies between old and new systems within seconds, not hours. Custom dashboards tracking: request latency by service, error rates, transaction processing volume, queue depths, database connection pools, and cost per transaction. 

Phase 2 — Low-Risk Extractions (Week 5–8)

Notification service (Week 5–6). Extracted all email and SMS notification logic into a standalone Node.js microservice. Communication pattern: the monolith publishes events to an SQS queue; the notification service consumes events and dispatches through SendGrid/Twilio. We ran both systems in parallel for 5 days, comparing outputs. Discrepancy rate: 0.00%. Cutover: monolith stopped sending notifications directly; queue-based delivery became the sole path.

Reporting service (Week 6–7). This was the strategic extraction — moving the heaviest database workload off the shared database. We created a read replica of the production PostgreSQL database on a dedicated RDS instance, built the reporting service to query the replica instead of the primary, and migrated all reporting API endpoints. Result: primary database CPU utilization dropped from 78% (month-end peak) to 31%. Payment processing p99 latency improved from 340ms to 180ms as a side effect — the database was no longer contending with reporting queries.

Merchant onboarding (Week 7–8). Extracted the onboarding workflow (merchant application, KYC verification via Plaid, Stripe Connect account creation, welcome email sequence) into a dedicated service. This was the first extraction that touched the write path — new merchants were being created in the new system. We implemented dual-write during transition: the new service writes to both the new database and the legacy database, ensuring the monolith still has access to merchant data until all dependent services are migrated. 

Phase 3 — High-Risk Extractions (Week 9–12)

Webhook delivery (Week 9–10). Partner integrations depended on reliable webhook delivery for settlement notifications, dispute alerts, and transaction events. We built an isolated webhook service with guaranteed delivery: events queued in SQS with dead-letter queue for failures, exponential backoff retry logic (retry at 1min, 5min, 30min, 2hr, 12hr), and a dashboard showing delivery status per partner endpoint. Improvement over the monolith: the old system had no retry logic — a single failed webhook was silently lost. Partners had complained about missing notifications for months.

Payment processing core (Week 10–12). The critical migration. We approached this with maximum caution:

Week 10: Built the new payment service and deployed alongside the monolith. No traffic routed to it yet.

Week 11: Shadow traffic. We duplicated every incoming payment request to both systems simultaneously. The monolith processed and returned the response to the merchant. The new service processed in parallel and logged its response. We compared results: discrepancy rate after 48 hours and 12,000 shadowed transactions: 0.00%.

Week 12: Progressive traffic shift. Day 1: 1% of transactions routed to new service. Day 2: 5%. Day 3: 10%. Day 5: 25%. Day 7: 50%. Day 10: 100%. At each step, we monitored latency, error rates, and settlement accuracy. Rollback plan tested and ready at every stage.

Week 13: Monolith decommission. With all services extracted and running on EKS, we decommissioned the Heroku dynos. The Rails monolith was reduced to a thin routing layer during migration and was finally shut down completely. Heroku subscription cancelled.

Week 14: Hardening and handoff. Final week dedicated to documentation, runbooks, on-call playbooks, and knowledge transfer to the client’s engineering team. We conducted a 3-hour “game day” — intentionally injecting failures (killed pods, saturated queues, database failover) and verified that monitoring, alerting, and recovery procedures worked correctly.

The Results

Measured over the first 90 days post-migration versus the 90 days prior:

Cost savings projection

At current growth rate (15% quarterly transaction volume increase), the Heroku infrastructure would have reached $22,000/month by end of year. The AWS architecture is projected to reach $11,500/month at the same volume — a growing annual savings of $120,000+.

Client Quote

“Three firms pitched us on this migration. Two proposed a 12-month ‘lift and shift’ with a migration weekend. Gigabit proposed a 14-week strangler fig approach with zero-downtime guarantees. They delivered exactly what they promised — not a single merchant noticed the migration happened. Our deployment speed went from twice a week to multiple times a day. That alone has changed how fast we can ship product.”

What’s Next

The client retained 1 Gigabit DevOps engineer on an ongoing SRE retainer ($5,500/month) for infrastructure monitoring, cost optimization, and capacity planning. Current projects include:

  • Multi-region deployment (US-East + US-West) for latency optimization and disaster recovery
  • Implementation of a real-time fraud detection pipeline using event-driven architecture on the new microservices platform
  • SOC 2 Type II compliance infrastructure (audit logging, access controls, evidence collection automation) 

Investment Summary

Annual infrastructure savings: $60,000+ (growing with volume)

Payback period on migration investment: 14.8 months on infrastructure savings alone — but the eliminated downtime risk ($62K single incident), 93% faster deployments, and 800% increase in deployment frequency are the real strategic value.

Running a critical production system on infrastructure that worries you?

We help fintech, SaaS, and e-commerce companies migrate to cloud-native architectures without downtime, without risk, and without the 12-month timeline other firms quote.

Ready to Offload Admin Work?

Let our offshore team handle the paperwork while you focus on installs.