AWS Macie: Find PII in S3 Before Regulators Do

Bits Lovers
Written by Bits Lovers on
AWS Macie: Find PII in S3 Before Regulators Do

When a fintech company discovered in late 2023 that 14 months of customer transaction exports — including names, account numbers, and partial SSNs — had been sitting in a public S3 bucket, the bucket wasn’t new. It had been created by a data team for a one-time reporting job and never cleaned up. Nobody knew it existed. The regulatory notification cost significantly more than the Macie subscription would have. That story repeats across industries with uncomfortable regularity.

AWS Macie uses machine learning and pattern matching to scan S3 buckets for sensitive data. It finds PII, financial data, health records, API credentials, and whatever custom data formats you define. It also continuously monitors bucket configurations and flags misconfigurations like public access or missing encryption. This guide covers what Macie actually detects, how to structure discovery jobs, custom identifiers for proprietary data, and the cost model.

What Macie Detects

Macie has two finding categories. Policy findings come from continuous monitoring of your S3 bucket configurations — no active scanning required. Sensitive data findings come from discovery jobs that inspect object content.

Policy finding types:

  • Policy:IAMUser/S3BlockPublicAccessDisabled — Block Public Access was turned off on a bucket
  • Policy:IAMUser/S3BucketEncryptionDisabled — default encryption removed from a bucket
  • Policy:IAMUser/S3BucketPublic — bucket ACL or policy makes it publicly readable/writable
  • Policy:IAMUser/S3BucketSharedExternally — bucket policy grants access to an external account
  • Policy:IAMUser/S3BucketReplicatedExternally — replication sends objects to an external account

These fire within minutes of a configuration change. You don’t need to run a discovery job to catch someone accidentally making a bucket public.

Sensitive data finding types identify categories of sensitive information found in object content:

  • SensitiveData:S3Object/Credentials — API keys, secret keys, access tokens, passwords
  • SensitiveData:S3Object/Financial — credit card numbers, bank account numbers, routing numbers
  • SensitiveData:S3Object/Personal — names + SSNs, passport numbers, driver’s licenses, health info
  • SensitiveData:S3Object/Multiple — multiple categories found in the same object
  • SensitiveData:S3Object/CustomIdentifier — matches a custom data identifier you defined

Macie ships with over 100 managed data identifiers covering data types for 30+ countries. US SSNs, UK National Insurance numbers, EU passport numbers, Australian TFNs — all detected out of the box without configuration.

Enabling Macie

# Enable Macie in your account
aws macie2 enable-macie \
  --finding-publishing-frequency FIFTEEN_MINUTES

# Enable for an AWS Organization (run from delegated admin account)
aws macie2 enable-organization-admin-account \
  --admin-account-id 999999999999

# Auto-enable for new organization member accounts
aws macie2 update-organization-configuration \
  --auto-enable

# Check status
aws macie2 get-macie-session

Once enabled, Macie immediately starts monitoring bucket configurations and generates policy findings within minutes. It builds an inventory of all S3 buckets in your account — bucket names, access controls, encryption settings, replication configuration, object counts, storage size.

Running Sensitive Data Discovery Jobs

The bucket inventory is passive. Sensitive data discovery requires a job that actively reads object content. You can run one-time jobs or scheduled jobs:

# One-time job scanning specific buckets
aws macie2 create-classification-job \
  --job-type ONE_TIME \
  --name "CustomerDataAudit-2026-06" \
  --s3-job-definition '{
    "bucketDefinitions": [
      {
        "accountId": "123456789012",
        "buckets": ["customer-exports", "data-warehouse-exports", "analytics-output"]
      }
    ],
    "scoping": {
      "includes": {
        "and": [
          {
            "simpleScopeTerm": {
              "comparator": "GT",
              "key": "OBJECT_SIZE",
              "values": ["0"]
            }
          }
        ]
      }
    }
  }' \
  --managed-data-identifier-selector ALL

For ongoing discovery, use SCHEDULED jobs that run daily or weekly on your most sensitive buckets. Scanning everything continuously is expensive — prioritize buckets that receive customer data, application logs, database exports, or audit data:

# Scheduled job running weekly on sensitive buckets
aws macie2 create-classification-job \
  --job-type SCHEDULED \
  --schedule-frequency '{"weeklySchedule":{"dayOfWeek":"MONDAY"}}' \
  --name "WeeklySensitiveBucketScan" \
  --s3-job-definition '{
    "bucketDefinitions": [
      {
        "accountId": "123456789012",
        "buckets": ["prod-customer-data", "audit-logs", "compliance-exports"]
      }
    ]
  }' \
  --managed-data-identifier-selector ALL

The managed-data-identifier-selector can be ALL (scan for everything), RECOMMENDED (high-confidence identifiers), NONE (only custom identifiers), or INCLUDE/EXCLUDE to list specific identifier types.

Custom Data Identifiers

Managed identifiers cover standard PII formats. Custom data identifiers let you define patterns for proprietary data — internal employee IDs, policy numbers, account formats specific to your business:

# Create a custom identifier for an internal account format: ACC-XXXXXXXX
aws macie2 create-custom-data-identifier \
  --name "InternalAccountNumber" \
  --description "Matches internal account format ACC-XXXXXXXX" \
  --regex "ACC-[0-9]{8}" \
  --keywords '["account", "ACC", "customer"]' \
  --maximum-match-distance 50 \
  --ignore-words '["ACCOUNT-TYPE", "ACC-TEST"]'

# The keywords field requires the regex match to appear within
# 50 characters of one of these keywords — reduces false positives

The keywords field is the most important tuning mechanism. Without it, a regex for an 8-digit number format would match far too broadly. Requiring a nearby keyword like “account” or “customer” dramatically reduces false positives.

Test your custom identifier before running a full job:

# Test against sample text
aws macie2 test-custom-data-identifier \
  --regex "ACC-[0-9]{8}" \
  --keywords '["account"]' \
  --maximum-match-distance 50 \
  --sample-text "Customer account ACC-12345678 was created on 2026-01-15"

The response shows how many matches Macie would find in that sample text. Iterate on the regex and keyword list until it matches correctly before running a discovery job.

Reviewing Findings

Findings appear in the Macie console and publish to EventBridge within 15 minutes (configurable). Each sensitive data finding includes:

  • The S3 bucket and object path where sensitive data was found
  • The category and types of sensitive data detected
  • Sample occurrences (Macie shows up to 15 examples of where in the object the data appears)
  • Severity based on the number of occurrences

Getting sample occurrences requires an additional API call — they’re not included in the finding by default to avoid logging sensitive data:

# Get finding details including sample occurrences
FINDING_ID="abc123def456"

aws macie2 get-sensitive-data-occurrences \
  --job-id $FINDING_ID

The samples show character offsets and a redacted preview, enough to understand what data was found without exposing the full sensitive content in Macie’s storage.

Automating Response to Sensitive Data Findings

Wire findings to EventBridge for automated notification or remediation:

import boto3
import json

def handler(event, context):
    finding = event['detail']
    finding_type = finding.get('type', '')
    
    # Handle public bucket policy findings
    if 'BucketPublic' in finding_type or 'BlockPublicAccessDisabled' in finding_type:
        bucket_name = finding['resourcesAffected']['s3Bucket']['name']
        
        s3 = boto3.client('s3')
        s3.put_public_access_block(
            Bucket=bucket_name,
            PublicAccessBlockConfiguration={
                'BlockPublicAcls': True,
                'IgnorePublicAcls': True,
                'BlockPublicPolicy': True,
                'RestrictPublicBuckets': True
            }
        )
        print(f"Auto-remediated: blocked public access on {bucket_name}")
    
    # Notify on sensitive data in unexpected buckets
    if finding_type.startswith('SensitiveData'):
        bucket = finding['resourcesAffected']['s3Bucket']['name']
        severity = finding['severity']['description']
        categories = [
            t['name'] for t in 
            finding.get('classificationDetails', {})
                   .get('result', {})
                   .get('sensitiveData', [])
        ]
        
        sns = boto3.client('sns')
        sns.publish(
            TopicArn='arn:aws:sns:us-east-1:123456789012:security-alerts',
            Subject=f"Macie [{severity}] Sensitive Data in {bucket}",
            Message=f"Categories: {categories}\nFinding: {json.dumps(finding, indent=2, default=str)}"
        )

For the public bucket case, automatic remediation makes sense — a publicly accessible bucket containing sensitive data is an emergency that shouldn’t wait for human review. For sensitive data findings in internal buckets, notification plus a ticket is usually the right response.

Cost Model

Macie pricing has two components. Bucket inventory and monitoring: $1.00 per bucket per month, covering continuous policy finding evaluation. Sensitive data discovery: $1.00 per GB of object data scanned.

A typical mid-sized account with 50 buckets costs $50/month just for the monitoring component. Running discovery jobs against 100 GB of data adds $100. Total: $150/month for continuous monitoring plus a reasonable monthly scan coverage.

The cost scales directly with the number of buckets and data volume you scan. To manage costs:

  • Use bucket tags or criteria in job definitions to exclude low-risk buckets (logs-only, temp data, public content)
  • Run discovery jobs on a sampling basis for large buckets — Macie’s sampling option scans a percentage of objects rather than everything
  • Use RECOMMENDED managed identifiers instead of ALL to reduce false positives and keep finding volume manageable

The 30-day free trial includes monitoring for up to 1,000 buckets and doesn’t charge for discovery jobs during the trial period. Run the trial, look at the cost estimate in the Macie console, and scope your ongoing configuration from there.

Macie works best as part of a broader security posture. The AWS Security Hub guide covers how Macie findings flow into Security Hub for unified visibility alongside findings from GuardDuty and Inspector. For the S3 bucket policies and encryption settings that Macie monitors for drift, the AWS IAM roles and policies guide covers the IAM side of locking down S3 access.

The specific buckets worth scanning first: application log buckets (often accidentally contain request bodies with PII), database export buckets (CSV exports from production databases), analytics staging buckets (where raw event data lands before processing), and any bucket with “backup” or “archive” in the name.

Bits Lovers

Bits Lovers

Professional writer and blogger. Focus on Cloud Computing.

Comments

comments powered by Disqus