Amazon Athena: The Things That Will Surprise You in Production (And How to Deal With Them)

Bits Lovers
Written by Bits Lovers on
Amazon Athena: The Things That Will Surprise You in Production (And How to Deal With Them)

Amazon Athena is one of those services that looks magical in demos and then hands you a surprise $3,000 bill on the first month you actually use it seriously.

I’ve run Athena in production across multiple companies. Here’s what nobody puts in the getting-started guides: the cost model, the Glue metastore quirks, the CTAS limitations that will break your data pipeline, and the query patterns that determine whether Athena is a bargain or a budget nightmare.

How Athena Actually Works (The Simplified Version Is a Lie)

The Athena marketing line: “Run SQL queries on data in S3.” That makes it sound like a regular database with S3 as storage. It’s not.

Athena is a query engine that sits on top of your S3 data. When you run a query, Athena:

  1. Computes a query plan
  2. Sends the plan to Presto (under the hood, though AWS has modified it)
  3. Presto reads data directly from S3
  4. Athena returns results

There is no persistent compute layer. No running servers (unless you enable thick connectors or use Athena workgroups with dedicated capacity). Every query spins up compute, runs, and shuts down. This is why Athena scales so well — and why costs are unpredictable if your queries aren’t controlled.

The consequence: Athena doesn’t cache your data. Every query reads from S3. If you run the same aggregation query every hour, you’re scanning the same data every hour. This is the foundation for everything below.

The Cost Model That Surprises People

Athena has two cost dimensions:

Query cost: $5 per TB of data scanned. Not per query — per TB scanned. A query that scans 500GB costs $2.50. A query that scans 10TB costs $50.

Storage cost: Your data lives in S3. S3 pricing applies. This is usually not the surprise.

The surprises come from data scanned, not data stored. Here’s how costs spiral:

Scenario 1: Unpartitioned large tables. You have a 50TB dataset with no partitions. A SELECT COUNT(*) scans all 50TB. Cost: $250. Run that twice a day in a dashboard refresh and you’re spending $500/month just to count rows.

Scenario 2: Regex SerDe for semi-structured data. If your data uses a RegexSerDe in Hive, Athena cannot prune partitions efficiently. It has to read data to evaluate the regex, which means full scans even when partition filters are applied. This one has burned several teams I’ve worked with.

Scenario 3: Wide tables with narrow queries. You have 200 columns. You write SELECT 5_columns FROM table WHERE .... Athena scans all 200 columns because most file formats (CSV, JSON) are row-based. Only Parquet and ORC support column pruning. With a 500-column CSV table, you’re scanning 10x more data than you need.

The lesson: in Athena, query design is cost management. Every query you write should be cost-conscious.

Glue Data Catalog: The Metadata Layer That Causes Problems

Athena uses the AWS Glue Data Catalog as its metastore. Your table definitions — schema, partitions, location, SerDe — live in Glue. This integration is mostly seamless, but there are edge cases that cause real production pain.

Schema Evolution: The Silent Breaker

AWS Glue supports schema evolution — the ability to add columns, change column order in certain cases, and evolve your schema over time. It mostly works.

The gotcha: if you’re using Parquet or ORC with column mapping enabled (which you should be for performance), adding columns in the middle of your table can cause silent data loss in query results. This is a known limitation.

The fix: always add columns at the end of your schema, never in the middle. If you need to reorder columns, recreate the table from scratch.

Partition Column Type Mismatches

This one is insidious. Say you have an S3 path like s3://bucket/table/year=2024/month=01/day=15/. In Glue, you define year, month, and day as partition columns.

If Glue has month defined as a string and your data has month=01 vs month=1, queries will silently return zero rows. No error. No warning. Just empty results.

The debugging process: check Glue table definition, check actual S3 paths, check your MSCK REPAIR TABLE status. This usually takes 20 minutes of head-scratching before you find the mismatch.

The fix: be consistent with partition column types and values from day one. Pad single-digit values with zeros. Define partition columns in Glue as string or bigint, not a mix.

The MSCK REPAIR TABLE Problem

When you add new partitions to S3 by uploading files directly (not through a Glue crawler), Athena doesn’t know about them until you run MSCK REPAIR TABLE table_name.

MSCK REPAIR TABLE works for partitions that follow the Hive-style naming convention (column=value). If your partition paths don’t follow this exactly, it won’t pick them up.

-- Works: detects partitions matching year=YYYY/month=MM/day=DD
MSCK REPAIR TABLE events;

-- Alternative for non-standard partition schemes:
ALTER TABLE events ADD PARTITION (year='2024', month='01')
LOCATION 's3://bucket/events/year=2024/month=01/';

The practical problem: in ETL pipelines that write new partitions directly to S3 (bypassing Glue), you need to either run MSCK REPAIR TABLE on a schedule or use ALTER TABLE ADD PARTITION to register each new partition. For high-frequency writes, this becomes a bottleneck.

Glue Crawlers: Convenient Until They’re Not

Glue crawlers automatically discover schema from your data. They’re great for initial setup. They’re terrible for production governance.

Crawlers infer types from a sample of data. If column 5 in your CSV has mixed types (integers in some rows, strings in others), the crawler picks one type and silently coerces the others. Or it marks the column as a string. Either way, your downstream queries break.

The crawler also doesn’t know your business logic. If column A is a user ID and column B is an event count, a crawler sees them as string and bigint and has no opinion about which columns should be required vs. nullable.

Recommendation: use crawlers for initial exploration and prototyping. Use manually defined Glue tables for production data. Define your schema, types, and partition columns explicitly. This is more work upfront but prevents silent type coercion and schema drift.

CTAS: The Limitations That Break Pipelines

CTAS (CREATE TABLE AS SELECT) is Athena’s way of doing ETL. You run a query and write the results to a new table. It’s powerful. It’s also limited in ways that bite you regularly.

CTAS Limitations in Athena

  • No INSERT INTO existing tables. CTAS creates new tables. If you want to append data to an existing table, you need to write to a staging table and then use INSERT INTO with a SELECT from staging. Or use a UNION of existing data plus new data in your CTAS.

  • CTAS output format restrictions. If you specify format='Parquet', Athena generates Parquet files. If you specify format='ORC', it generates ORC. What you cannot do is generate multiple output formats in a single CTAS.

  • CTAS compression. compressed='true' with format='Parquet' doesn’t let you specify the codec. Athena picks one (typically SNAPPY). If you need a specific codec for downstream compatibility, you can’t control it with CTAS.

  • CTAS and encryption. If your source table is encrypted with a CMEK (customer-managed KMS key), your CTAS table needs to specify the same or compatible encryption settings. If you forget, you get a permissions error that doesn’t clearly indicate it’s a KMS issue.

  • Maximum CTAS file size. Athena writes files in the 128-256MB range by default. For very wide tables with high cardinality, this leads to small files (the “small files problem”). Small files kill query performance because each file requires a separate read operation.

The Small Files Problem

This is Athena’s most common performance issue. If your source data is partitioned by a high-cardinality column (like user_id) and you write many small output files per partition, you end up with thousands of tiny files. Querying this data means Athena opens thousands of files per query, which adds overhead that dwarfs the actual data scanning time.

A query that should take 10 seconds takes 10 minutes because you’re spending 9 minutes opening and closing files.

Solutions:

-- Option 1: Use a bucket() function in CTAS to control file size
CREATE TABLE events_bucketed
WITH (
    format = 'Parquet',
    partitioned_by = ARRAY['year', 'month'],
    bucketed_by = ARRAY['event_type'],
    bucket_count = 50
)
AS
SELECT event_id, event_type, timestamp, year, month
FROM events_raw
WHERE year = '2024'
-- Option 2: After CTAS, run optimization to coalesce small files
-- (Athena has no built-in optimization, so you need to manually:
-- 1. Create new table with same schema
-- 2. INSERT INTO new_table SELECT * FROM old_table (distributes across fewer files)
-- 3. Drop old table, rename new table)

AWS introduced Athena engine version 3 in 2021 which handles some small file issues better, but it doesn’t fully solve the problem. If you’re writing data frequently with fine-grained partitioning, budget time for periodic compaction.

Query Patterns That Save Money

Pattern 1: Always Filter Before Aggregation

-- WRONG: Full table scan, then aggregation
SELECT event_type, COUNT(*)
FROM events
GROUP BY event_type

-- RIGHT: Filter first (Athena pushes predicates down to storage)
SELECT event_type, COUNT(*)
FROM events
WHERE year = '2024' AND month = '01'
GROUP BY event_type

The right version might scan 1% of the data. The wrong version scans everything. At 100TB scale, this is $500 vs $5.

Pattern 2: Use Approximate Aggregations for Large Datasets

-- Exact count (scans all rows)
SELECT COUNT(DISTINCT user_id) FROM events

-- Approximate (uses HyperLogLog, ~2% error but 10x faster and cheaper)
SELECT APPROX_DISTINCT(user_id) FROM events

For dashboards and analytics where 2% error is acceptable, APPROX_DISTINCT is dramatically cheaper than COUNT(DISTINCT).

Pattern 3: Partition and Cluster Wisely

When you create a table, partition by columns you filter on frequently. Then cluster by high-cardinality columns within partitions.

CREATE TABLE events (
    event_id STRING,
    event_type STRING,
    user_id STRING,
    payload STRING
)
WITH (
    format = 'Parquet',
    partitioned_by = ARRAY['year', 'month'],
    bucketed_by = ARRAY['event_type'],
    bucket_count = 16
)

With bucketing: within each partition, data is sorted into 16 buckets by event_type. Queries that filter by event_type within a partition read significantly less data because Athena can skip buckets.

Pattern 4: UNION ALL Instead of UNION

-- UNION (Athena deduplicates, scans all rows from both)
SELECT id FROM table_a UNION SELECT id FROM table_b

-- UNION ALL (No deduplication, reads both tables, returns all rows)
SELECT id FROM table_a UNION ALL SELECT id FROM table_b

If you don’t need deduplication, UNION ALL is faster because Athena doesn’t need to compare rows across tables.

Glue Schema Registry: When Your Serialized Data Changes

If you’re storing JSON or Avro data and using Schema Registry to manage schema evolution, there are a few Athena-specific behaviors to know.

Athena reads the schema from the table definition, not from Schema Registry. If you evolve your schema in Schema Registry (adding optional fields, renaming columns with aliases), Athena doesn’t automatically pick up the changes.

You need to:

  1. Update the Glue table definition to reflect the new schema
  2. Recreate or ALTER the table to add new columns
  3. Backfill historical data if the new fields need to be queryable for old partitions

This is where schema-on-read sounds great (“just read whatever’s there”) but actually creates complexity. If you have optional fields in your JSON, old partitions might not have those fields. SELECT new_column FROM old_partition returns nulls, not errors. This is correct behavior but can lead to subtle bugs in analytics if analysts don’t know to check for nulls.

Athena Workgroups: Cost Control Infrastructure

For anything beyond experimentation, set up Athena workgroups with controls:

{
  "Name": "analytics-workgroup",
  "Description": "Production analytics queries",
  "WorkGroupConfiguration": {
    "ResultConfiguration": {
      "OutputLocation": "s3://athena-results-bucket/analytics/",
      "EncryptionConfiguration": {
        "EncryptionOption": "SSE_S3"
      }
    },
    "EnforceWorkGroupConfiguration": true,
    "PublishCloudWatchMetricsEnabled": true
  },
  "LowerCaseMaxDataExtractedPerQuery": "1GB",
  "BytesScannedCutoffPerQuery": 107374182400
}

Key settings:

  • EnforceWorkGroupConfiguration: true — Blocks queries that would exceed configured limits
  • PublishCloudWatchMetricsEnabled: true — Makes query-level cost data visible in CloudWatch
  • BytesScannedCutoffPerQuery — Hard limit per query. Prevents runaway scans.

Without workgroup controls, a single badly-written query can scan petabytes and generate a $5,000 bill. With workgroup limits, the query fails but you don’t get the bill.

Set the cutoff to something your business can absorb as an anomaly. If your largest legitimate query scans 500GB, set the cutoff to 1TB with a 2-hour lookback window.

Real Production Scenarios

Scenario 1: Log Analysis Pipeline

You’re analyzing CloudTrail logs across 200 AWS accounts. Raw logs in S3: 50TB, 2 years of data. Partitioned by account_id/year/month/day.

The pain points:

  • Cross-account queries are slow because each account has its own partition subtree
  • The _ct manifest files Athena generates can accumulate and cause listing overhead
  • Columnar format (Parquet) conversion of raw CloudTrail JSON took 40 hours of Glue ETL

The fix: partition aggressively, convert to Parquet once, set up a Athena View over the raw + Parquet tables so you can query both during the migration period. After 6 months, deprecate the raw table.

Scenario 2: Financial Analytics

You’re running financial queries on transactional data. Accuracy matters. Cost matters more.

The approach: daily CTAS jobs that aggregate from a transaction table into a summary table. The summary table has 1/100th the row count. All dashboards query the summary table. Full-detail queries are rare and budgeted separately.

This pattern — pre-aggregate expensive queries into materialized summaries — is the most reliable way to keep Athena costs predictable.

Scenario 3: Ad-Hoc Exploration Without Blowing the Budget

Data scientists want to explore raw data without reading everything. The approach: enable Athena workgroup query result encryption (already standard), set a per-query bytes cutoff at 10GB, and create a separate workgroup for heavy analytical queries with a higher limit but required approval.

The result: data scientists can explore freely within a budget. Heavy queries require explicit authorization.

The Bottom Line

Athena is a genuinely useful service for the right use case: ad-hoc SQL queries on large S3 datasets without managing infrastructure. It’s not a replacement for a data warehouse. It’s not a replacement for a database. It’s a query engine, and it behaves like one.

The rules that keep it running smoothly:

  1. Partition everything by time. Always. No exceptions.
  2. Convert to Parquet or ORC for any table larger than 10GB.
  3. Set up workgroup controls before going to production.
  4. Monitor bytes scanned per query in CloudWatch.
  5. Pre-aggregate expensive queries into summary tables.
  6. Never query raw JSON or CSV in production. Ever.
  7. Glue schema changes require backfill planning.

Get these right and Athena is cheap, fast, and reliable. Skip them and you’ll spend more time debugging cost overruns than running queries.

Bits Lovers

Bits Lovers

Professional writer and blogger. Focus on Cloud Computing.

Comments

comments powered by Disqus