Terraform and Ansible: The Integration That Actually Works (And the Parts That Will Ruin Your Weekend)

Bits Lovers
Written by Bits Lovers on
Terraform and Ansible: The Integration That Actually Works (And the Parts That Will Ruin Your Weekend)

Here’s the setup: you need to provision infrastructure and then configure it. Terraform does the first part beautifully. Ansible does the second part beautifully. The moment you try to make them talk to each other, you enter a zone of configuration pain that the documentation hand-waves away.

I’ve done this integration wrong more times than I care to admit. Here’s what actually works, what breaks in practice, and the specific failure modes that will ruin your Friday night deployment if you don’t know to watch for them.

Why Combining These Tools Is Worth the Headache

Let me make the case first, because there’s a real question of whether it’s worth it.

Terraform’s sweet spot is creating things: VPCs, EC2 instances, RDS databases, S3 buckets, IAM roles. It manages state, handles dependencies between resources, and plans changes before applying them. That’s solid infrastructure provisioning.

What Terraform is bad at: configuring the inside of things. Installing packages, editing config files, restarting services, deploying application code. You can shoehorn it into doing these things with user_data scripts, null_resource, and local-exec provisioners, but you’re fighting the tool. Terraform is not a configuration management tool. Pretending it is leads to unmaintainable hacks.

Ansible’s sweet spot is configuring the inside of things: installing Nginx, copying application configs, deploying from git, running database migrations. It uses SSH, so it works the same way everywhere. The playbook syntax is readable. The idempotency is real.

What Ansible is bad at: creating cloud infrastructure. You can technically create AWS resources with Ansible’s ec2 modules, but it doesn’t manage state well, it doesn’t have a planning phase, and it doesn’t handle cross-resource dependencies cleanly.

So: Terraform for infrastructure creation, Ansible for post-provisioning configuration. The tools complement each other. The integration is worth doing.

The Fundamental Problem: How Does Ansible Reach the New Server?

This is the question that trips up everyone. Terraform creates a server. Ansible needs to connect to that server to configure it. How does Ansible know the server exists, what its IP address is, and how to authenticate to it?

There are two mental models for solving this: pull and push. The choice between them has real operational implications.

The Pull Model: Server Pulls Its Own Configuration

With pull, you don’t connect to the new server from outside. The server boots up, runs a bootstrap script from user data, and that script pulls the Ansible playbook from somewhere accessible — S3, a Git repo, an object store — and runs it locally on the machine.

This approach is elegant in a few ways:

  • You don’t need any inbound access to the new server. It only needs outbound internet or VPC-internal access to wherever you store the playbook.
  • CI/CD runners don’t need SSH access to production servers. The runner just uploads the playbook to S3.
  • The bootstrap process is self-contained. If Terraform creates 20 servers simultaneously, there’s no bottleneck waiting for a central Ansible runner to SSH into each one.

The catch: the bootstrap has to work. If the user data script fails for any reason — wrong IAM permissions, network misconfiguration, a bug in the bootstrap script — the server sits there unconfigured and nobody tells you until someone SSHes in manually and checks.

Here’s a real failure I hit: the bootstrap script downloaded the Ansible playbook from S3 using an IAM role. The IAM role had a policy that restricted which S3 prefix it could read. The policy had a typo — s3://bucket/ansible/ instead of s3://bucket/ansible*. The script failed silently because Ansible exits with code 0 when the playbook directory is empty. Three servers came up completely unconfigured and nobody noticed for two days.

The Push Model: Ansible Connects From Outside

With push, a central Ansible runner — your laptop, a CI runner, a dedicated automation server — SSHs into the newly created server and runs the playbook directly.

This gives you immediate feedback. Ansible tells you if the playbook fails, which step failed, and what the error was. You can retry easily. Debugging is straightforward.

The cost: you need SSH connectivity from wherever you’re running Ansible to the target servers. In AWS, this typically means:

  • The CI runner in a public subnet with a public IP, or
  • A VPN/bastion setup, or
  • Systems Manager Session Manager with Ansible over SSM

None of these are free to set up. And if your network topology is complex — and in regulated environments it usually is — the SSH path from your runner to the new server can be a genuine security concern.

Which Model to Pick

I’ve used both extensively. My rule of thumb:

Use pull for immutable infrastructure patterns — when servers are created as part of an auto-scaling event or a blue-green deployment, and you don’t want any external access. Use push for mutable infrastructure patterns — when you’re doing ad-hoc configuration changes to existing servers and want immediate feedback.

For a CI/CD pipeline that creates servers and configures them: pull is cleaner as long as your bootstrap is battle-tested. Push is safer if you’re still iterating on your configuration.

The Integration Architecture That Actually Works

Here’s the pattern I’ve landed on after several painful iterations.

Directory Structure

infra/
├── terraform/
│   ├── main.tf
│   ├── variables.tf
│   ├── outputs.tf
│   └── user_data/
│       └── bootstrap.sh
├── ansible/
│   ├── playbook.yml
│   ├── roles/
│   │   └── webserver/
│   │       ├── tasks/
│   │       │   └── main.yml
│   │       └── templates/
│   │           └── nginx.conf.j2
│   └── inventory/
│       └── dynamic_inventory.py
├── scripts/
│   └── package_and_upload.sh
└── Makefile

Separation of concerns matters. Terraform owns the infrastructure code. Ansible owns the configuration code. Scripts handle the packaging and handoff between them.

Packaging the Ansible Playbook

Terraform can’t directly read your local Ansible files. The new server can’t either. The simplest handoff is to zip the Ansible directory and upload it to S3.

#!/bin/bash
set -e

TERRAFORM_DIR="terraform"
ANSIBLE_DIR="ansible"
S3_BUCKET="${S3_BUCKET:-my-infra-playbooks}"
S3_PREFIX="ansible"

# Package the playbook
cd "$(dirname "$0")/.."
PACKAGE_DIR="ansible_package"
rm -rf "$PACKAGE_DIR" && mkdir -p "$PACKAGE_DIR"

# Copy only what the server needs
cp -r ansible/playbook.yml "$PACKAGE_DIR/"
cp -r ansible/roles "$PACKAGE_DIR/"

# Calculate content hash for versioning
CONTENT_HASH=$(find "$PACKAGE_DIR" -type f -exec md5sum {} \; \
    | awk '{print $1}' | sort | md5sum | awk '{print $1}')
ARCHIVE_NAME="playbook-${CONTENT_HASH:0:12}.zip"
cd "$PACKAGE_DIR" && zip -r "$ARCHIVE_NAME" . && cd ..

# Upload to S3
aws s3 cp "$PACKAGE_DIR/$ARCHIVE_NAME" \
    "s3://${S3_BUCKET}/${S3_PREFIX}/${ARCHIVE_NAME}"

echo " playbook uploaded: s3://${S3_BUCKET}/${S3_PREFIX}/${ARCHIVE_NAME}"
echo " content_hash=${CONTENT_HASH}"

The content hash is the key idea. Every time you upload, you get a new file with a different name. The user data script downloads a specific file by name, so you can do blue-green deployments with zero overlap: version N deploys, you upload version N+1, version N+1 deploys. No race conditions, no partial updates.

The Bootstrap Script That Doesn’t Lie to You

Here’s the bootstrap script I use. Critically, it handles errors properly and reports failures back to CloudWatch:

#!/bin/bash
set -e

# Redirect stdout and stderr to CloudWatch
exec > >(tee /var/log/user-data.log | logger -t user-data -s 2>&1)
set -x

# Configuration passed from Terraform via template
BUCKET=""
PREFIX=""
HASH=""
ENVIRONMENT=""

echo "Starting bootstrap at $(date)"
echo "S3 bucket: $BUCKET"
echo "Content hash: $HASH"
echo "Environment: $ENVIRONMENT"

# Verify IAM permissions before starting
if ! aws s3 ls "s3://${BUCKET}/${PREFIX}/" > /dev/null 2>&1; then
    echo "ERROR: Cannot list S3 bucket. Check IAM role."
    exit 1
fi

# Download and extract playbook
PLAYBOOK_ZIP="/tmp/playbook.zip"
PLAYBOOK_DIR="/opt/ansible"

echo "Downloading playbook from S3..."
aws s3 cp "s3://${BUCKET}/${PREFIX}/playbook-${HASH}.zip" "$PLAYBOOK_ZIP"

if [ ! -f "$PLAYBOOK_ZIP" ]; then
    echo "ERROR: Failed to download playbook zip"
    exit 1
fi

echo "Extracting playbook..."
rm -rf "$PLAYBOOK_DIR" && mkdir -p "$PLAYBOOK_DIR"
unzip -q "$PLAYBOOK_ZIP" -d "$PLAYBOOK_DIR"

# Install Ansible (Amazon Linux 2023)
echo "Installing Ansible..."
amazon-linux-extras install ansible2 -y

# Export environment variables for Ansible to consume
export ENVIRONMENT="$ENVIRONMENT"
export DEPLOYMENT_HASH="$HASH"
export AWS_REGION=""

# Run the playbook
echo "Running Ansible playbook..."
cd "$PLAYBOOK_DIR"
ansible-playbook \
    -i "localhost," \
    --connection=local \
    -e "environment=${ENVIRONMENT}" \
    playbook.yml

echo "Bootstrap completed at $(date)"

# Mark instance as configured in S3 (optional, useful for auto-scaling)
aws s3 cp /dev/null "s3://${BUCKET}/${PREFIX}/instances/${INSTANCE_ID}"

# Notify completion (if SNS topic is configured)
if [ -n "" ]; then
    aws sns publish \
        --topic-arn "" \
        --message "Instance ${INSTANCE_ID} configured successfully" \
        --subject "Bootstrap Complete: ${INSTANCE_ID}"
fi

The set -e at the top means the script exits immediately on any failure. The CloudWatch logging means you can see what failed even if the instance terminates. The S3 verification step catches IAM permission problems before they cascade into confusing Ansible failures.

Passing Variables from Terraform to Ansible

The `` syntax comes from Terraform’s template rendering. In your Terraform code:

data "archive_file" "playbook" {
  type        = "zip"
  source_dir  = "${path.module}/../ansible"
  output_path = "${path.module}/playbook.zip"
}

resource "aws_s3_bucket_object" "playbook" {
  bucket = var.s3_bucket
  key    = "ansible/playbook-${md5(data.archive_file.playbook.output_path)}.zip"
  source = data.archive_file.playbook.output_path
  etag   = filemd5(data.archive_file.playbook.output_path)
}

data "template_file" "user_data" {
  template = file("${path.module}/user_data/bootstrap.sh")
  vars = {
    s3_bucket   = var.s3_bucket
    s3_prefix   = "ansible"
    content_hash = md5(data.archive_file.playbook.output_path)
    environment = var.environment
    aws_region  = var.aws_region
    sns_topic   = var.sns_topic_arn
    instance_id = "$${INSTANCE_ID:-unknown}"
  }
}

Notice $${INSTANCE_ID} — the double dollar sign escapes Terraform’s interpolation so the shell receives a literal ${INSTANCE_ID} which is set by AWS when the instance boots. The variable substitution happens on the instance, not in Terraform’s template rendering.

The Failure Modes That Will Ruin Your Weekend

Failure Mode 1: Terraform State Drift Between Plan and Apply

Terraform’s planning phase reads your current state and computes a diff. But between plan and apply — especially in a CI pipeline — another Terraform run might execute. Or someone might manually create a resource that Terraform then doesn’t know about.

The result: Terraform destroys something you didn’t expect because it thinks it needs to reconcile state.

Fix: use remote state with state locking (S3 backend with DynamoDB). Never use local state in a team environment. Ever.

Failure Mode 2: Ansible Host Key Verification Failures

When Ansible connects via SSH for the first time to a new server, it asks “Are you sure you want to continue connecting?” The known_hosts prompt will hang the playbook in non-interactive mode.

Fix: add the instance’s public IP to known_hosts, or disable host key checking in ansible.cfg:

[defaults]
host_key_checking = False

In production, I’d rather add the host key properly. In a pinch, the above works.

Failure Mode 3: Ansible and Terraform State Getting Out of Sync

If Terraform creates a server, Ansible fails to configure it, and someone manually SSHes in to fix it, your Terraform state says the server is configured. It isn’t. Terraform will not re-run your user data unless you taint the resource:

terraform taint aws_instance.web
terraform plan  # Now shows the instance will be recreated/reconfigured
terraform apply

This is why pull-based bootstrap needs good error reporting. If the bootstrap fails silently, you won’t know until someone manually checks.

Failure Mode 4: The “Works on My Machine” Ansible Failure

Ansible tasks have a habit of working when run locally on your machine (where you have all your SSH keys, environment variables, and local config files) and failing on the target server because it doesn’t have those things.

Common culprits:

  • A task that references ~/.ssh/id_rsa but the key is not present on the target
  • A task that assumes python3 is available (Amazon Linux 2023 uses python3.11, older AMIs use python)
  • A task that assumes a specific directory exists (always use ansible.builtin.file to create directories first)

Failure Mode 5: Slow Boot + Fast Ansible

Ansible tries to connect immediately after the server is created. But the server takes 60 seconds to boot and reach the network. Ansible times out.

Fix: add a wait in the Ansible runner or in the CI pipeline:

# Wait for SSH to be available before running Ansible
until ssh -o StrictHostKeyChecking=no -o ConnectTimeout=5 \
    -i "${SSH_KEY}" "ec2-user@${SERVER_IP}" exit 0; do
    echo "Waiting for SSH..."
    sleep 10
done

# Now run Ansible
ansible-playbook -i "${SERVER_IP}," ...

A Real Alternative: The remote-exec Provisioner

Before you commit to the Ansible integration, consider Terraform’s remote-exec and local-exec provisioners. They’re not as powerful as Ansible, but they eliminate an entire integration surface.

resource "aws_instance" "web" {
  ami           = var.ami_id
  instance_type = var.instance_type
  key_name      = var.ssh_key_name

  provisioner "remote-exec" {
    inline = [
      "amazon-linux-extras install ansible2 -y",
      "aws s3 cp s3://${var.s3_bucket}/playbook.zip /tmp/playbook.zip",
      "unzip -q /tmp/playbook.zip -d /opt/ansible",
      "cd /opt/ansible && ansible-playbook -i 'localhost,' --connection=local playbook.yml"
    ]

    connection {
      type        = "ssh"
      user        = "ec2-user"
      host        = self.public_ip
      private_key = file(var.ssh_private_key_path)
      timeout     = "10m"
    }
  }

  # Tag the instance for cost tracking
  tags = {
    Name        = "webserver-${var.environment}"
    Environment = var.environment
    ManagedBy   = "terraform"
  }
}

The remote-exec runs on the instance. The local-exec runs on the machine running Terraform. You can combine them: local-exec to validate the Terraform run succeeded, remote-exec to do basic bootstrap. For anything beyond basic bootstrap, use Ansible.

When to NOT Combine These Tools

There are legitimate cases where you should pick one:

  • Serverless: If your entire infrastructure is Lambda, API Gateway, and managed services, Terraform covers everything. Ansible has nothing to configure on the server because there are no servers.

  • Container infrastructure: If you’re running ECS, EKS, or Fargate, Terraform manages the cluster definition, and application deployment is handled by CI/CD tools (GitHub Actions, ArgoCD). Ansible is unnecessary.

  • Simple environments: If you have 5 servers and they’re all identical, Terraform + a base AMI might be all you need. Ansible’s power shows when you have many servers with different configurations.

The combination earns its complexity when you have: multiple server types, complex post-provisioning steps, team environments where Terraform and Ansible are owned by different people, and a need for audit trails on configuration changes.

The Practical Takeaway

Terraform and Ansible solve different problems. Don’t fight that. Let Terraform own infrastructure creation with state management. Let Ansible own post-provisioning configuration with its agentless SSH model. Bridge them with a simple artifact handoff — zip to S3, bootstrap from user data.

The failure modes are real. Bootstrap failures are the biggest one. Test your bootstrap script in isolation before you trust it in production. Run it on a single test instance, watch the CloudWatch logs, and verify the final state matches what Ansible would have done.

Once that bootstrap is reliable, the rest falls into place.

For more on Terraform, the Terraform debug post covers state inspection and troubleshooting, and the Terraform lookup function covers HCL map access patterns used in configuration management. The GitLab CI Terraform pipeline guide covers running Terraform from CI/CD alongside Ansible configuration.

Bits Lovers

Bits Lovers

Professional writer and blogger. Focus on Cloud Computing.

Comments

comments powered by Disqus