Cloud Migration: A Practical Guide to Moving Without Breaking Things

Bits Lovers
Written by Bits Lovers on
Cloud Migration: A Practical Guide to Moving Without Breaking Things

I’ve watched three cloud migration projects fall apart. Not because the technology failed — the tech almost never fails. They failed because nobody planned for the human and process side of the equation. The databases migrated fine. The services cut over cleanly. And then the on-call team had no idea how to debug a production issue in a system they’d never touched.

This post is about the things that actually go wrong during a cloud migration, and how to plan around them. Not the theory — the reality.

Why Most Migrations Go Sideways

The pitch always sounds clean: lift-and-shift your VMs, modernize later, prove value fast. And that can work. But the migrations that crater usually do so because of three things nobody puts on the slide deck:

Knowledge debt. Your senior engineers know your on-prem systems deeply. They know which Oracle query runs slow at 3am on the last day of the month. They know that one app that was never documented but somehow keeps the billing system running. Cloud doesn’t have those tribal memories. Your team walks into a new environment and loses all that context instantly.

Dependency black holes. Everything is connected. I once spent two weeks debugging a migration failure because the downstream legacy system was doing a SOAP call to an internal IP address hardcoded in a config file that nobody knew existed. Nobody documented it because it “would never change.”

Cutover window pressure. When the migration is scheduled, the business expects downtime to be minimal. And then you discover that your rollback plan takes 4 hours, but the window is 2 hours. You’re now making decisions under pressure that should have been made weeks ago.

Phase 1: Assessment That Actually Helps

Here’s what most assessment phases look like: a consultant shows up, runs some scanning tools, builds a dependency map from static analysis, and delivers a 200-slide deck that lists your applications and classifies them as “lift-and-shift,” “re-platform,” or “re-architect.” That’s not assessment. That’s categorization.

What you actually need:

Asset Inventory With Runtime Context

Static inventory tells you what’s deployed. You need to know what actually talks to what during actual business hours. Instrument your network taps or use traffic analysis tools to capture real traffic patterns for at least two weeks — including a month-end or peak period if your business has one.

# Example: capturing flow data on Linux for traffic analysis
sudo tcpdump -i eth0 -c 10000 -w /tmp/capture_$(date +%Y%m%d).pcap
# Then analyze with something like argus or SiLK

The goal is to find the undocumented dependencies. You’ll find at least one service doing DNS lookups against an internal IP that nobody remembers configuring.

Blast Radius Mapping

For each application, map out what breaks if it goes down. Not just “the service is unavailable” — I mean the actual business impact. What’s the user-visible effect? What’s the revenue impact? How long can the business tolerate this being down?

This sounds like a business exercise, but your job is to make the engineers understand it so they can make intelligent tradeoffs during migration.

The Discovery You Won’t Want to Find

Look for:

  • Hardcoded IPs in application configs (yes, still happens in 2023)
  • Legacy authentication systems that use NTLM or Kerberos ticket passing
  • Services that depend on local filesystem paths across machines
  • Database links between systems (Oracle Data Guard, SQL Server Linked Servers)
  • Batch jobs that run at specific times and depend on other batch jobs completing

I found a medical records system last year where the backup job literally SCP’d a tar file to a different server and then ran a shell script from inside the tarball. Nobody on the current team knew this existed. The person who set it up retired in 2018.

AWS Migration Hub Discovery (2024-2026)

AWS Migration Hub’s Refactospace tool matured significantly in 2024. It uses ML-based workload profiling to recommend migration strategies per application:

# 1. Install the discovery agent on your source server
wget https://aws-discovery-agent.s3.amazonaws.com/latest/aws-discovery-agent.rpm
sudo rpm -ivh aws-discovery-agent.rpm

# 2. Configure with your activation credentials
sudo /usr/local/bin/aws-discovery-agent configure \
  --region us-east-1 \
  --activation-id <YOUR_ACTIVATION_ID> \
  --registration-endpoint <YOUR_REGISTRATION_ENDPOINT>

# 3. List discovered servers and their dependencies
aws discovery list-servers
aws discovery get-network-connections \
  --filters '[{"name":"serverId","values":["i-0123456789abcdef0"]}]'

This gives you a dependency map based on live network traffic rather than static analysis, which is far more accurate than anything a consultant’s scan tool finds.

Phase 2: Migration Strategy — Choosing Your Approach

There are now seven strategies (AWS added “re-architect” as a 7th R), and the right one depends on your constraints:

Rehost (Lift-and-Shift)

Move VMs as-is to cloud instances. Fast. Cheap in the short term. Painful in the long term because you haven’t fixed any of the problems that made your on-prem environment hard to manage.

Good for: migrations where the timeline is the primary constraint and you have the engineering capacity to refactor after the cutover.

AWS Application Migration Service (MGN) replaced CloudEndure as the preferred tool for this in 2024-2025:

# Register source server
aws mgn create-source-server \
  --source-server-id ssm-0123456789abcdef0 \
  --account-id 123456789012

# Launch a test instance
aws mgn create-launch-template \
  --name my-template \
  --source-server-id srv-0123456789abcdef0

aws mgn start-launch \
  --source-server-id srv-0123456789abcdef0 \
  --launch-template-id lt-0123456789abcdef0 \
  --licensing '{"osByol": true}'

# Final cutover
aws mgn create-cutover-launch \
  --source-server-id srv-0123456789abcdef0

Watch out for: licensing traps (Windows licenses don’t automatically move), storage throughput assumptions (cloud instance store vs. EBS behave differently), and network topology surprises (your 10Gbps on-prem network doesn’t exist in the cloud by default).

Replatform

Make targeted changes to run on cloud infrastructure without full re-architecting. Move to RDS instead of managing your own MySQL. Move to managed Kubernetes instead of managing your own cluster. Keep your application code largely intact.

For database migrations, AWS Database Migration Service (DMS) with CDC (Change Data Capture) handles the replication:

# Create a replication instance
aws dms create-replication-instance \
  --replication-instance-identifier my-dms-instance \
  --replication-instance-class dms.t3.medium \
  --vpc-security-group-ids sg-0123456789abcdef0 \
  --availability-zone us-east-1a

# Create source and target endpoints
aws dms create-endpoint \
  --endpoint-identifier source-oracle \
  --endpoint-type source \
  --engine-name oracle \
  --server-name oracle-source.internal \
  --port 1521 \
  --database-name ORCL

aws dms create-endpoint \
  --endpoint-identifier target-aurora \
  --endpoint-type target \
  --engine-name aurora \
  --server-name my-cluster.cluster-xxx.us-east-1.rds.amazonaws.com \
  --port 5432 \
  --database-name mydb

# Start full-load and CDC replication
aws dms create-replication-task \
  --replication-task-identifier my-task \
  --source-endpoint-arn arn:aws:dms:... \
  --target-endpoint-arn arn:aws:dms:... \
  --replication-instance-arn arn:aws:dms:... \
  --migration-type full-load-and-cdc \
  --table-mappings file://table-mappings.json

This is where most mid-size migrations end up. You’re making smart compromises between speed and long-term benefit.

Good for: teams that need to migrate within 6-12 months and have enough cloud familiarity to make good platform choices.

Refactor / Re-architect

Rewrite significant portions of the application to use cloud-native patterns. Move to serverless, adopt event-driven architecture, decompose the monolith.

This is the expensive, slow, high-risk option that often delivers the most long-term value. I’ve seen re-architect projects run 18 months and cost 3x the original estimate.

Good for: applications that are strategic, have high maintenance costs, and where the team has strong cloud engineering skills.

Repurchase / Retire

Move to SaaS instead of migrating. Or just turn it off.

This is the option nobody talks about but often makes the most financial sense. Before you plan a migration, ask: does this application still serve the business? When was the last time someone actually used it? I once saved a company $400K in migration costs by suggesting they just cancel a legacy ERP module that nobody had logged into in 14 months.

Phase 3: Cutover Planning — The Part Nobody Does Well

Here’s where migrations fail: the cutover plan assumes everything works the first time.

The Rollback Test

Before you migrate anything, test your rollback plan. Actually execute it. In staging, simulate a migration, then execute the rollback. Measure how long it takes. I guarantee you’ll find something that breaks your rollback window assumption.

# Example rollback test script structure
# Test in staging first!
rollback_test() {
    echo "Starting rollback test at $(date)"
    # Stop application
    # Snapshot cloud resources
    # Execute migration
    # Verify application state
    # Execute rollback
    # Verify return to original state
    # Measure elapsed time
    rollback_time=$(($(date +%s) - start_time))
    if [ $rollback_time -gt 7200 ]; then
        echo "ERROR: Rollback time exceeds 2 hour window"
        exit 1
    fi
}

The Dual-Write Problem

If you’re migrating a system that writes data, you have to decide how to handle the cutover. Do you:

  1. Stop the source, migrate, start the target (simple, but downtime equals migration time)
  2. Set up replication and cut over to read traffic first, then write traffic (complex, but lower downtime)

Option 2 sounds better until you try to implement it and discover that your database replication has a lag of 30 seconds that your application absolutely does not tolerate.

For most systems, a planned maintenance window with a stopped-source migration is more reliable than a complex dual-write setup. If you can’t tolerate downtime, you need to re-architect the application for dual-write first — that’s a separate project.

DNS Cutover Strategy

Most migrations use DNS changes to redirect traffic. Here’s what goes wrong:

  • Local DNS caches don’t respect your TTL (users’ routers cache for days)
  • Corporate DNS resolvers ignore TTLs
  • Some applications hardcode the old IP and don’t check DNS again

Fix: reduce your DNS TTL to 60 seconds at least 48 hours before cutover. But understand that this doesn’t guarantee instant propagation. For critical systems, implement a migration mode where the old system temporarily proxies to the new one, giving you a full rollback path even after DNS propagates.

Phase 4: Post-Migration — The Hidden Phase

The cutover is not the end. Plan for at least two weeks of intensive monitoring and quick iteration. Things that will break:

Logging and monitoring gaps. Your on-prem monitoring doesn’t translate. Cloud services emit different metrics. Set up CloudWatch or your preferred monitoring before cutover, not after.

Cost surprises. Cloud costs are non-linear and often surprising. A t3.medium looks cheap until you’re running 500 of them 24/7. Set up cost alerts before migration.

IAM permission hell. Your application will need cloud permissions it never needed on-prem. The “it worked in staging” phenomenon is real — staging has different IAM policies than production.

Connectivity to on-prem dependencies. If you still have on-prem systems (and you will, for a while), your VPN or Direct Connect setup needs to be validated before cutover, not after.

The Change Management Piece (The Human Side)

I’m not going to give you a 5-step change management framework. Here’s what actually works:

Document the operational runbook before migration, not after. Your on-call engineers need to know how to debug this system in the cloud. Write the runbook while the on-prem engineers are still available. That’s the knowledge transfer window — it’s short and it closes fast.

Shadow shifts. After migration, have someone who knew the on-prem system shadow the cloud team for at least a week. Not to hand-hold — to catch the undocumented dependencies that only surface under specific conditions.

Accept that you’ll break things. A migration with zero breakage is either trivial or hasn’t hit production yet. Plan for incidents. Define the escalation path. Pre-assign the incident commander role so nobody is figuring out who owns what at 2am.

What Changed Recently (2024-2026)

The cloud migration landscape shifted significantly between 2024 and 2026:

AWS MGN replaced CloudEndure. While CloudEndure (acquired by AWS in 2022) was the standard lift-and-shift tool for years, AWS MGN (Application Migration Service) became the recommended approach by 2024, with better integration into the broader AWS migration ecosystem and simplified licensing workflows.

FinOps became mandatory, not optional. Migrations that don’t include cost modeling are now considered incomplete. AWS Cost Explorer, budget alerts, and resource tagging must be set up before cutover, not after. A t3.medium at $33/month doesn’t look cheap when you’re running 200 of them plus NAT Gateway charges, data transfer costs, and EBS storage.

Lift-and-shift got a second look. The industry realized “lift-and-shift doesn’t work” was overstated. Rehosting with subsequent optimization is often the fastest path to cloud value. Organizations that spent 6 months re-architecting a legacy app could have been running in AWS for 4 months and refactoring in parallel.

Mainframe migrations became viable. Tools like AWS Mainframe Modernization (M2), Blu Age, and Ascent made COBOL/DB2-to-cloud migrations practical for the first time. If you have IBM mainframe workloads, they’re worth evaluating.

Landing Zone setup matured. AWS Control Tower became the standard for establishing a well-architected multi-account environment before migrating workloads, with automated guardrails for security and compliance.

Common Migration Gotchas (2024-2026)

  • Egress costs bite harder than expected. Moving large datasets to cloud incurs significant data transfer costs. Use Snowball Edge for petabyte-scale transfers rather than trying to push data over the internet.
  • Licensing complexity is brutal for Oracle, SQL Server, SAP. These vendors have complex licensing rules in cloud environments. Use AWS License Manager to track and optimize — and budget for surprises.
  • Tagging inconsistency breaks everything. Cost allocation, security audits, and operations all depend on consistent tagging. Enforce tags via AWS Control Tower SCPs before migration begins, not after.
  • “We can migrate the database in 2 weeks” is almost always wrong. Database migration is 40-60% of total migration effort for most applications. Budget accordingly.
  • Security doesn’t migrate with the workloads. Cloud doesn’t mean secure by default. IAM policies, Security Groups, and encryption must be applied during migration, not after.

What This Actually Costs

A realistic cloud migration for a mid-size application portfolio (call it 20-40 applications):

  • Assessment: 4-8 weeks
  • Migration execution: 3-9 months depending on strategy and team size
  • Post-migration stabilization: 2-3 months
  • Total: 6-18 months and significant cost

The teams that succeed treat cloud migration as a program, not a project. They have dedicated program management, engineering teams with allocated capacity, and executive sponsorship that survives the inevitable delays and scope changes.

The teams that fail treat it as an engineering task, throw it over the wall to the infrastructure team, and are surprised when it takes twice as long and costs twice as much.

Start with the assessment. Know what you’re actually dealing with. Then make informed decisions about strategy, timeline, and budget. Everything else follows from that.


If you’re mid-migration and running into specific problems, the posts on high availability in AWS and spot instances for cost optimization might help with some of the architectural decisions. For database-level decisions during migration, Aurora vs RDS covers managed database options, and AWS NAT Gateway covers VPC networking patterns for migrated workloads.

Bits Lovers

Bits Lovers

Professional writer and blogger. Focus on Cloud Computing.

Comments

comments powered by Disqus