AWS Glue ETL Guide: Data Catalog, Crawlers, and Spark-Based Transformations

Bits Lovers
Written by Bits Lovers on
AWS Glue ETL Guide: Data Catalog, Crawlers, and Spark-Based Transformations

AWS Glue is a managed ETL service built on Apache Spark, combined with a metadata catalog that integrates across Athena, EMR, Redshift Spectrum, and Lake Formation. The two parts are separable: you can use the Glue Data Catalog as your metadata layer without running a single Glue ETL job (Athena and Redshift Spectrum use it as their table catalog by default), or you can run Glue ETL jobs that write to S3 without using the catalog at all. Understanding which part solves your problem prevents the most common Glue mistake: running heavy Spark jobs on data small enough to process in Lambda with Pandas in 2 seconds.

This guide covers the Data Catalog and crawlers, Glue ETL job patterns in PySpark, job bookmarks for incremental processing, and a framework for deciding when Glue is the right tool.

The Glue Data Catalog

The Data Catalog stores table definitions (schema, location, partition information) in a Hive-compatible format. Athena queries it to understand your S3 data structure. Redshift Spectrum reads external table definitions from it. EMR can use it as the Hive metastore replacement.

# Create a Glue database (logical namespace for tables)
aws glue create-database \
  --database-input '{
    "Name": "analytics",
    "Description": "Analytics data lake tables"
  }'

# Create a table manually (or let a crawler do this)
aws glue create-table \
  --database-name analytics \
  --table-input '{
    "Name": "events",
    "StorageDescriptor": {
      "Columns": [
        {"Name": "event_id", "Type": "string"},
        {"Name": "user_id", "Type": "string"},
        {"Name": "event_type", "Type": "string"},
        {"Name": "timestamp", "Type": "bigint"},
        {"Name": "properties", "Type": "map<string,string>"}
      ],
      "Location": "s3://my-data-lake/events/",
      "InputFormat": "org.apache.hadoop.mapred.TextInputFormat",
      "OutputFormat": "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat",
      "SerdeInfo": {
        "SerializationLibrary": "org.openx.data.jsonserde.JsonSerDe"
      }
    },
    "PartitionKeys": [
      {"Name": "year", "Type": "string"},
      {"Name": "month", "Type": "string"},
      {"Name": "day", "Type": "string"}
    ],
    "TableType": "EXTERNAL_TABLE"
  }'

Once a table exists in the catalog, Athena can query it immediately:

-- Athena reads schema from Glue catalog
SELECT event_type, count(*) as count
FROM analytics.events
WHERE year = '2026' AND month = '06'
GROUP BY event_type
ORDER BY count DESC;

The catalog is region-scoped but can be shared across AWS accounts using Lake Formation resource sharing.

Crawlers: Schema Discovery

Crawlers inspect your data sources, infer schemas, and create or update tables in the catalog. They handle S3 data in JSON, CSV, Parquet, ORC, and Avro formats; JDBC connections to relational databases; and DynamoDB tables.

# Create a crawler IAM role first
CRAWLER_ROLE_ARN="arn:aws:iam::123456789012:role/GlueCrawlerRole"

# Create crawler for S3 data
aws glue create-crawler \
  --name events-crawler \
  --role $CRAWLER_ROLE_ARN \
  --database-name analytics \
  --targets '{
    "S3Targets": [{
      "Path": "s3://my-data-lake/events/",
      "Exclusions": ["**/_temporary/**", "**/.spark-staging/**"]
    }]
  }' \
  --schedule '{"ScheduleExpression": "cron(0 6 * * ? *)"}' \
  --schema-change-policy '{
    "UpdateBehavior": "UPDATE_IN_DATABASE",
    "DeleteBehavior": "LOG"
  }' \
  --configuration '{
    "Version": 1.0,
    "CrawlerOutput": {
      "Partitions": {"AddOrUpdateBehavior": "InheritFromTable"}
    },
    "Grouping": {
      "TableGroupingPolicy": "CombineCompatibleSchemas"
    }
  }'

# Run the crawler immediately
aws glue start-crawler --name events-crawler

# Wait for it to complete
aws glue get-crawler \
  --name events-crawler \
  --query 'Crawler.{State:State,LastCrawl:LastCrawl}'

UpdateBehavior: UPDATE_IN_DATABASE applies schema changes (new columns, type changes) automatically. DeleteBehavior: LOG logs removed tables without deleting them — prevents the catalog from destroying table definitions if S3 paths are temporarily empty or a partition is missing.

CombineCompatibleSchemas merges tables that share compatible schemas into a single catalog table. Without this, a crawler on s3://bucket/events/year=2026/ creates separate tables per partition prefix. With it, all compatible partitions become one table with partition columns.

Crawlers add cost: $0.44 per DPU-hour, with a minimum 10-minute billing unit. A crawler that runs in 3 minutes costs the same as one that runs in 10 minutes ($0.073/DPU at 2 DPU default). For small, static schemas, define tables manually rather than running daily crawlers — it’s cheaper and more predictable.

Glue ETL Jobs

Glue jobs are managed Spark clusters. You write a Python (PySpark) or Scala script; Glue provisions the cluster, runs the script, and tears it down.

# glue_etl_job.py
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from pyspark.sql import functions as F

# Get job parameters
args = getResolvedOptions(sys.argv, ['JOB_NAME', 'source_database', 'source_table', 'output_path'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

# Read from Glue Data Catalog (returns a DynamicFrame)
source_dynamic_frame = glueContext.create_dynamic_frame.from_catalog(
    database=args['source_database'],
    table_name=args['source_table'],
    transformation_ctx="source"  # Used by job bookmarks
)

# Convert to Spark DataFrame for richer transformation options
df = source_dynamic_frame.toDF()

# Transformations
df_transformed = df \
    .filter(F.col('event_type').isin(['purchase', 'refund'])) \
    .withColumn('date', F.from_unixtime(F.col('timestamp') / 1000, 'yyyy-MM-dd')) \
    .withColumn('amount_usd', F.col('amount').cast('double')) \
    .groupBy('date', 'event_type') \
    .agg(
        F.count('*').alias('event_count'),
        F.sum('amount_usd').alias('total_amount'),
        F.avg('amount_usd').alias('avg_amount')
    )

# Write output in Parquet format with partitioning
df_transformed.write \
    .mode('overwrite') \
    .partitionBy('date') \
    .parquet(args['output_path'])

# Commit the job (required for bookmarks to record progress)
job.commit()

Create the job with AWS CLI:

# Upload the script to S3 first
aws s3 cp glue_etl_job.py s3://my-glue-scripts/glue_etl_job.py

# Create the job
aws glue create-job \
  --name process-events \
  --role arn:aws:iam::123456789012:role/GlueJobRole \
  --command '{
    "Name": "glueetl",
    "ScriptLocation": "s3://my-glue-scripts/glue_etl_job.py",
    "PythonVersion": "3"
  }' \
  --default-arguments '{
    "--job-bookmark-option": "job-bookmark-enable",
    "--source_database": "analytics",
    "--source_table": "events",
    "--output_path": "s3://my-data-lake/aggregated/",
    "--TempDir": "s3://my-glue-temp/temporary/",
    "--enable-metrics": "",
    "--enable-continuous-cloudwatch-log": "true",
    "--enable-spark-ui": "true",
    "--spark-event-logs-path": "s3://my-glue-temp/spark-logs/"
  }' \
  --glue-version "4.0" \
  --number-of-workers 10 \
  --worker-type G.1X \
  --execution-property '{"MaxConcurrentRuns": 1}'

# Run the job
RUN_ID=$(aws glue start-job-run \
  --job-name process-events \
  --query 'JobRunId' \
  --output text)

# Monitor the run
aws glue get-job-run \
  --job-name process-events \
  --run-id $RUN_ID \
  --query 'JobRun.{State:JobRunState,Duration:ExecutionTime,DPU:MaxCapacity}'

G.1X workers provide 4 vCPUs and 16 GB RAM each. G.2X workers provide 8 vCPUs and 32 GB — use them for memory-intensive transformations. For smaller jobs, G.025X (Flex) workers are 75% cheaper but run on spare capacity and may take longer to start.

Job Bookmarks

Job bookmarks track which data has already been processed. Without bookmarks, every job run reads all data from the source. With bookmarks enabled, subsequent runs read only new data since the last successful run.

# Bookmark-aware source read — only reads new data
source_dynamic_frame = glueContext.create_dynamic_frame.from_catalog(
    database="analytics",
    table_name="raw_events",
    transformation_ctx="source"  # ctx name is how the bookmark tracks state
)

# The transformation_ctx must be consistent across runs
# If you change it, the bookmark resets

# For S3 sources with date-partitioned data, combine bookmarks with partition filters
source_dynamic_frame = glueContext.create_dynamic_frame.from_catalog(
    database="analytics",
    table_name="raw_events",
    push_down_predicate="(year == '2026' and month == '06')",  # Reduce data scan
    transformation_ctx="source_2026_06"
)

Bookmarks work by tracking the LastModifiedTime of S3 files or the maximum sort key for JDBC sources. They’re not perfect: if upstream data is backdated or reprocessed, bookmarks won’t catch it. For critical pipelines where idempotency matters, track processed state explicitly in DynamoDB rather than relying solely on bookmarks.

Glue Workflows for Orchestration

Workflows chain crawlers and jobs with triggers:

# Create a workflow
aws glue create-workflow --name daily-etl-workflow

# Add an on-demand trigger to start the crawler
aws glue create-trigger \
  --name start-crawler \
  --workflow-name daily-etl-workflow \
  --type ON_DEMAND \
  --actions '[{"CrawlerName": "events-crawler"}]'

# Add a conditional trigger: run ETL job when crawler succeeds
aws glue create-trigger \
  --name run-etl-after-crawl \
  --workflow-name daily-etl-workflow \
  --type CONDITIONAL \
  --start-on-creation \
  --predicate '{
    "Logical": "ANY",
    "Conditions": [{
      "LogicalOperator": "EQUALS",
      "CrawlerName": "events-crawler",
      "CrawlState": "SUCCEEDED"
    }]
  }' \
  --actions '[{"JobName": "process-events"}]'

# Start the workflow
aws glue start-workflow-run --name daily-etl-workflow

For complex orchestration beyond simple chains, Step Functions is a better choice. Glue Workflows work well for linear crawler → job sequences but become unwieldy for branching logic, retries, or error handling.

Python Shell Jobs for Small Data

Not every ETL task needs Spark. For data under a few hundred MB, Python shell jobs run as a single Lambda-like process without Spark overhead. Startup is faster and cost is lower.

# python_shell_job.py — runs without Spark
import boto3
import pandas as pd
import io

s3 = boto3.client('s3')

# Read CSV from S3
response = s3.get_object(Bucket='my-bucket', Key='daily-report.csv')
df = pd.read_csv(io.BytesIO(response['Body'].read()))

# Transform
df['revenue'] = df['quantity'] * df['unit_price']
summary = df.groupby('product_category')['revenue'].sum().reset_index()

# Write back
buffer = io.StringIO()
summary.to_csv(buffer, index=False)
s3.put_object(Bucket='my-bucket', Key='summary/daily-summary.csv', Body=buffer.getvalue())
aws glue create-job \
  --name small-data-job \
  --role arn:aws:iam::123456789012:role/GlueJobRole \
  --command '{
    "Name": "pythonshell",
    "ScriptLocation": "s3://my-glue-scripts/python_shell_job.py",
    "PythonVersion": "3.9"
  }' \
  --default-arguments '{
    "--additional-python-modules": "pandas==2.1.0"
  }' \
  --max-capacity 0.0625

MaxCapacity: 0.0625 (1/16 DPU) is the minimum for Python shell jobs. At $0.44/DPU-hour, a job running for 5 minutes costs 0.0625 × (5/60) × $0.44 = $0.002. Substantially cheaper than Spark for small data.

Pricing and When to Use Glue

Glue ETL pricing: $0.44 per DPU-hour for standard jobs. A 10-worker G.1X job (10 DPUs) running for 30 minutes costs 10 × 0.5 × $0.44 = $2.20 per run. Minimum billed duration is 1 minute for Glue 4.0 jobs.

Use Glue when:

  • Data exceeds what Lambda can handle (> 6 GB, > 15 minutes processing time)
  • You need Spark’s distributed processing for parallelism across partitions
  • Your data is in the Glue Data Catalog and you want native catalog integration
  • You’re building data lake pipelines where Parquet/ORC output goes to Athena

Use Lambda + Pandas instead when:

  • Data is under 200 MB per run
  • Processing completes in under 5 minutes
  • You need sub-minute invocation latency
  • Startup overhead of Spark cluster (2-3 minutes) would dominate total runtime

Use EMR when:

  • You need full Spark control (custom Spark configurations, specific versions)
  • Your job runs for hours on large clusters where managed startup overhead matters
  • You’re already running a persistent EMR cluster for other workloads

For streaming ETL rather than batch, the Kinesis guide covers Firehose’s built-in Lambda transformation that handles many streaming ETL patterns without a full Glue job. For querying the data Glue processes, AWS CloudWatch deep dive covers monitoring Glue job metrics including DPU utilization, memory pressure, and stage-level performance.

Bits Lovers

Bits Lovers

Professional writer and blogger. Focus on Cloud Computing.

Comments

comments powered by Disqus