Home / Writing / Zero-Downtime Migration: What the Case …
Cloud & GCP · · 4 min read

Zero-Downtime Migration: What the Case Studies Don't Tell You

Every cloud migration case study ends with success. None of them tell you about the Tuesday at 2am when the database connection pool hit zero during peak trading.

“Zero downtime” is a promise that sounds clean in architecture diagrams and complicated in the middle of a live migration. Here’s what actually happened when we moved Thailand’s largest retail platform off on-premise Adobe Commerce and onto a hybrid Google Cloud architecture — with full trading continuity throughout.

The Constraint That Changed Everything

Central Online handles peak loads during campaign events that would stress any e-commerce infrastructure. Campaigns don’t pause for infrastructure migrations. The business constraint was non-negotiable: no customer-facing downtime, no degraded performance during peak periods, no rollback that requires a 4am call.

This constraint ruled out the most common migration pattern: lift-and-shift followed by a cutover window. In a retail environment with 24/7 trading expectations, there is no acceptable cutover window.

What it required was something more architecturally demanding: a running system that exists in two states simultaneously, with traffic routing between them based on feature readiness — not a single migration event.

The Strangler Fig, Applied

The strangler fig pattern — replacing a legacy system piece by piece until nothing of the original remains — is well-documented in theory. In practice, the difficulty is choosing which vine to grow first.

Our sequencing logic was straightforward: start with the systems that are most decoupled from the Adobe Commerce core, and work inward. This gave us:

  1. Static content and media → migrated to GCP Cloud CDN first. Zero application risk.
  2. Product catalogue API → extracted as an independent service, fronted by a BFF layer.
  3. Search and navigation → decoupled from the monolith using an event-driven sync pattern.
  4. Cart and checkout → last, because this is where the complexity concentrates.

Each step produced a running service in production before the next was started. The monolith ran in parallel throughout, handling the portions not yet migrated.

The Tuesday at 2am

No migration account is complete without the incident that doesn’t make the slide deck.

Seven weeks into the migration, during a Thursday flash sale, the database connection pool for the newly migrated catalogue service hit saturation. The service degraded. The BFF layer fell back to the monolith automatically — which is why customers saw nothing. But the alert fired, and the post-mortem was educational.

The connection pool configuration had been sized for average load, not campaign load. The monolith had been tuned for this over years. The new service hadn’t inherited that institutional knowledge — it inherited a clean config from a staging environment.

The fix was straightforward. The lesson was not: every new service inherits the traffic behaviour of the system it replaces, not the traffic behaviour of a test environment. Sizing decisions need to come from production data, not architectural assumptions.

The Hybrid State as a Feature

The most counterintuitive outcome of this migration was that the hybrid state — monolith and microservices running in parallel — became a deliberate feature, not a temporary compromise.

Some services remained on-premise by design. The ERP integration, for example, runs on network latency that cloud-to-cloud cannot match at acceptable cost. The target architecture is hybrid by design, not by failure to complete the migration.

This is a distinction that matters in architectural conversations. “Hybrid cloud” is sometimes used as a euphemism for “we didn’t finish migrating.” In our case, it was a deliberate cost-performance decision backed by latency data.

What Zero Downtime Actually Cost

The migration took longer than a cutover approach would have. The running parallel system required maintaining two versions of integration contracts during transition periods. Testing effort doubled.

The trade was worth it. But anyone promising zero downtime at zero overhead is selling you a diagram, not a delivery.


The case studies are right that zero-downtime migration is achievable. They’re just incomplete about what it demands in architectural discipline, team coordination, and the unglamorous work of keeping two systems honest with each other while the transition completes.

The vine grows slowly. That’s the point.