What no one tells you about AWS Auto Scaling Group!

Bits Lovers
Written by Bits Lovers on
What no one tells you about AWS Auto Scaling Group!

Most people know that Auto Scaling Groups monitor your servers and adjust capacity based on traffic. That’s the basic pitch, anyway.

But there’s a feature most tutorials skip over: Lifecycle Hooks. I’ve used these to solve some real problems with production instances, and I want to show you how they work.

What Lifecycle Hooks Actually Do

Lifecycle Hooks let you tap into events in your Auto Scaling Group. When something happens to an instance—launch, termination, whatever—you can trigger custom actions instead of just letting AWS handle it.

Save Logs from Your EC2 Before It’s Gone

Here’s the problem I kept running into: instances would fail health checks, get marked unhealthy, and then get terminated by the ASG. When that happened, I’d lose all my logs. No trace of what went wrong. It’s frustrating.

The fix is to use a Lifecycle Hook that fires when termination is imminent. The hook triggers a Lambda function, which runs commands on the instance via SSM to打包 logs and upload them to S3 before the instance disappears.

When This Saved My Bacon

I had an application that would randomly fail ELB health checks. Every few days, an instance would get terminated and I’d have no idea why. After I set up Lifecycle Hooks, I could actually download the logs afterward and see the real error messages. Previously, they were just gone.

What You Need

The setup uses a few AWS services together:

  1. Auto Scaling Group (you probably already have this)
  2. IAM Role for the EC2 instance
  3. EventBridge Rule (AWS renamed CloudWatch Events, in case you were confused)
  4. Lambda Function
  5. IAM Role for Lambda
  6. SSM Document
  7. Run Command via Systems Manager
  8. Optional: SNS for notifications
  9. Optional: Load Balancer (if you’re not already using one)

It’s simpler than it looks at first. Most of these you might already have.

How It Works

Here’s the flow:

  1. Instance fails health check and gets marked unhealthy
  2. ASG initiates termination but pauses thanks to the Lifecycle Hook
  3. EventBridge catches the termination event
  4. EventBridge triggers a Lambda function
  5. Lambda runs a script on the instance via SSM Run Command
  6. The script tars up the logs and uploads to S3
  7. Script sends optional SNS notification
  8. Script calls CompleteLifecycleAction to tell ASG to finish termination

The Lambda function is written in Python using boto3. It checks if the SSM document exists, sends the command to the instance, then monitors the command status. If something fails, it still completes the lifecycle action so you’re not stuck with orphaned instances.

Setting Up the IAM Role for EC2

You’ll need a role that allows the instance to:

  • Call autoscaling:CompleteLifecycleAction
  • Publish to SNS (if using notifications)
  • Upload to S3
{
 "Version": "2012-10-17",
 "Statement": [
   {
     "Action": [
       "autoscaling:CompleteLifecycleAction",
       "sns:Publish",
       "s3:*"
     ],
     "Effect": "Allow",
     "Resource": "*"
   }
 ]
}

Also attach AmazonEC2RoleforSSM so the instance works with Systems Manager. And make sure the instance can write to your S3 bucket—don’t forget this or your logs won’t go anywhere.

Setting Up the Lambda Role

Create a role for Lambda with these policies:

  • AWSLambdaBasicExecutionRole
  • AmazonSSMFullAccess

If you’re using SNS notifications, add an inline policy allowing sns:Publish.

Creating the Lifecycle Hook

In the AWS Console, find your Auto Scaling Group and go to the “Instance Management” tab. Click “Create lifecycle hook.”

Give it a name. For “Lifecycle transition,” pick “Instance Terminate” if you want to catch termination events. Set “Default Result” to “CONTINUE” so the instance terminates if something goes wrong.

Heartbeat Timeout Matters

This is where people mess up. The heartbeat timeout determines how long the instance stays in the wait state while your script runs. If you’re backing up large log files, you need more time. Set it too low and the instance terminates before your upload finishes.

I learned this the hard way.

Creating the SSM Document

Systems Manager Documents define what commands run on your instances. AWS provides pre-built documents, but you’ll create your own for this task.

Go to Systems Manager > Documents > Create document. Give it a name and paste the document content. The document should accept parameters like your S3 bucket name and the log paths you want to back up.

Our document defines the steps: find the logs, create a tar archive, upload to S3, optionally notify via SNS, then signal completion.

Creating the Lambda Function

Create a new Lambda function with Python 3.9 or later (the post mentioned 3.9 but newer versions work fine). Paste the handler code from the GitHub repository.

Set these environment variables:

  • S3BUCKET: Your bucket name
  • SNSTARGET: ARN of your SNS topic (optional)
  • SSM_DOCUMENT_NAME: The document you created above

Creating the EventBridge Rule

Go to EventBridge > Rules > Create rule.

Choose “Rule with an event pattern.” For the event source, select “Other.” Use this pattern:

{
"source": ["aws.autoscaling"],
"detail-type": ["EC2 Instance-terminate Lifecycle Action"],
"detail": {
"AutoScalingGroupName": ["your-asg-name"]
}
}

Attach your Lambda function as the target.

Testing It

Set your ASG’s Desired Capacity and Minimum Capacity to 0. This forces termination of all instances, which triggers the lifecycle hooks. Watch the Instances tab—you’ll see instances go into “Terminating: Wait” state.

Check CloudWatch Logs for Lambda output. Then verify your S3 bucket has the uploaded log files. You can also check SSM Command History in the EC2 console to see if the commands executed properly.

Troubleshooting

“Not authorized to perform: autoscaling:CompleteLifecycleAction”

This means your EC2 instance role is missing permissions. Add the IAM policy from earlier to the instance’s role.

Run Command failing

Check CloudWatch Logs from the Lambda function. Also look at SSM Command History in Systems Manager—you can see the exact parameters that were passed and any error messages.

Lifecycle hook not firing

Verify your EventBridge rule is attached to the right event bus and that the ASG name in your filter matches exactly.

What Else You Can Do

Lifecycle Hooks aren’t just for logs. I’ve used them to:

  • Gracefully drain GitLab Runners before termination
  • Deregister instances from service registries
  • Snapshot databases before shutdown
  • Move persistent data to another instance

The possibilities are wider than most articles suggest. Once you understand the pattern—pause, do something, resume—you can adapt it to all kinds of scenarios.

The Terraform Shortcut

If you don’t want to click through all these steps manually, there’s a Terraform module that deploys this entire setup. You just provide the ASG name and S3 bucket. It takes about 30 seconds to provision instead of an hour of console work. Check the GitHub repo linked in the post for details.

Conclusion

Auto Scaling Lifecycle Hooks, Lambda, and Run Command together let you react to scaling events in ways AWS doesn’t support out of the box. It’s not glamorous, but it’s saved me more than once when instances died unexpectedly and I needed to figure out why.

The setup takes a bit of effort, but once it’s running, you stop losing logs to terminations. That’s worth it in my book.

Bits Lovers

Bits Lovers

Professional writer and blogger. Focus on Cloud Computing.

Comments

comments powered by Disqus