Customize Amazon Nova Models with Bedrock Fine-Tuning
Training a custom Nova model on Bedrock costs less than you probably expect. A supervised fine-tuning job on roughly 5,000 conversation examples — enough data to meaningfully shift model behavior — runs around $2.18 in training compute. The bottleneck isn’t money; it’s data quality and the patience to iterate on evaluation. This post covers the full workflow from data format to invocation, including the constraints that will catch you off guard.
Which Nova Models Support Fine-Tuning
Amazon Nova Micro, Nova Lite, and Nova Pro all support supervised fine-tuning (SFT) on Bedrock. Nova Premier does not — it’s a closed-weight model without a customization path. If your use case requires Premier-tier reasoning, prompt engineering and RAG are your only tuning options.
The three tunable models have different tradeoffs:
- Nova Micro — text-only, fastest inference (~60ms), cheapest per token. Best for high-volume classification, extraction, and structured output tasks where latency matters.
- Nova Lite — multimodal (text + images), moderate speed. Good when your training data includes image-text pairs, like product catalog or document understanding.
- Nova Pro — multimodal, largest context (300K tokens), highest capability ceiling. Use when the base model struggles with your task even after extensive prompting.
Fine-tuning Nova Micro or Lite makes more economic sense than fine-tuning Pro, because base Pro is already capable enough for most tasks. Fine-tune Micro to make a fast, cheap model behave like Pro on your specific domain — that’s the cost optimization play.
One hard constraint: tool use is not supported in fine-tuned Nova models. If your application relies on function calling (weather lookups, database queries, API calls), fine-tuning breaks that capability. You’ll need to use the base model for tool-calling workflows and consider fine-tuning only for the pure language generation parts of your pipeline.
Training Data Format
Bedrock Nova fine-tuning requires JSONL with a specific schema. Each line is one training example:
{"schemaVersion": "bedrock-conversation-2024", "system": [{"text": "You are a technical support assistant for AWS services. Answer questions accurately and concisely."}], "messages": [{"role": "user", "content": [{"text": "How do I check the status of an ECS service?"}]}, {"role": "assistant", "content": [{"text": "Use the describe-services command: aws ecs describe-services --cluster your-cluster --services your-service --query 'services[0].{status:status,running:runningCount,desired:desiredCount}'"}]}]}
Key points about the format:
schemaVersionmust be exactly"bedrock-conversation-2024"— the field name changed between preview and GA, older examples you find online may use a different keysystemis a list with a single text object, not a plain stringmessagesfollows the same turn structure as the Bedrock Converse API — alternating user/assistant turns- Multi-turn conversations are supported: you can have [user, assistant, user, assistant] in a single training example, which teaches the model how to handle follow-up questions
- Images in training data use the same base64 format as the Converse API, with
imagecontent blocks alongsidetextblocks
Validation script to check your dataset before uploading:
import json
import sys
def validate_nova_dataset(filepath):
errors = []
line_count = 0
with open(filepath, 'r') as f:
for line_num, line in enumerate(f, 1):
line = line.strip()
if not line:
continue
try:
example = json.loads(line)
except json.JSONDecodeError as e:
errors.append(f"Line {line_num}: Invalid JSON — {e}")
continue
# Check required fields
if example.get('schemaVersion') != 'bedrock-conversation-2024':
errors.append(f"Line {line_num}: Missing or wrong schemaVersion")
if 'messages' not in example:
errors.append(f"Line {line_num}: Missing messages field")
continue
messages = example['messages']
# Check alternating turns
for i, msg in enumerate(messages):
expected_role = 'user' if i % 2 == 0 else 'assistant'
if msg.get('role') != expected_role:
errors.append(f"Line {line_num}: Message {i} has role '{msg.get('role')}', expected '{expected_role}'")
# Last message must be from assistant (supervised learning target)
if messages and messages[-1].get('role') != 'assistant':
errors.append(f"Line {line_num}: Last message must be from assistant")
line_count += 1
print(f"Validated {line_count} examples")
if errors:
print(f"Found {len(errors)} errors:")
for e in errors[:20]: # Show first 20 errors
print(f" {e}")
return False
else:
print("Dataset valid.")
return True
validate_nova_dataset('training_data.jsonl')
AWS recommends at least 100 examples for SFT to have any measurable effect, and at least 1,000 for reliable behavior change. Around 5,000 high-quality examples is the sweet spot for most domain adaptation tasks. Beyond 10,000, returns diminish unless you’re doing style transfer or teaching a very narrow, high-precision task.
Uploading Data and Creating the Job
# 1. Upload training data to S3
aws s3 cp training_data.jsonl s3://your-bedrock-training-bucket/nova-finetune/training_data.jsonl
# Optional: upload validation data (10-15% of training size recommended)
aws s3 cp validation_data.jsonl s3://your-bedrock-training-bucket/nova-finetune/validation_data.jsonl
import boto3
import time
bedrock = boto3.client('bedrock', region_name='us-east-1')
# Create fine-tuning job
response = bedrock.create_model_customization_job(
jobName='nova-micro-support-bot-v1',
customModelName='nova-micro-support-bot-v1',
roleArn='arn:aws:iam::123456789012:role/BedrockCustomizationRole',
baseModelIdentifier='amazon.nova-micro-v1:0',
customizationType='FINE_TUNING',
trainingDataConfig={
'dataSource': {
's3DataSource': {
'dataLocationS3Uri': 's3://your-bedrock-training-bucket/nova-finetune/training_data.jsonl'
}
}
},
validationDataConfig={
'validators': [{
'dataSource': {
's3DataSource': {
'dataLocationS3Uri': 's3://your-bedrock-training-bucket/nova-finetune/validation_data.jsonl'
}
}
}]
},
outputDataConfig={
'dataLocationS3Uri': 's3://your-bedrock-training-bucket/nova-finetune/output/'
},
hyperParameters={
'epochCount': '3',
'learningRateMultiplier': '1.0',
'batchSize': '8',
}
)
job_arn = response['jobArn']
print(f"Job started: {job_arn}")
# Poll for completion
while True:
status = bedrock.get_model_customization_job(jobIdentifier=job_arn)
state = status['status']
print(f"Status: {state}")
if state in ('Completed', 'Failed', 'Stopped'):
break
time.sleep(60)
if state == 'Completed':
custom_model_arn = status['outputModelArn']
print(f"Custom model ARN: {custom_model_arn}")
Training time varies by dataset size and model. Nova Micro with 5,000 examples on 3 epochs completes in about 20-30 minutes. Nova Pro with the same dataset takes 45-90 minutes. You’re billed for the GPU-hours consumed during training, which is what produces the ~$2.18 figure for a 5K-example Micro job.
Hyperparameter Guidance
Three hyperparameters control the fine-tuning job:
epochCount (1-5): How many times the model sees your full dataset. Start with 3. Lower if your model overfits (high training accuracy, poor generalization); higher if training loss is still decreasing at epoch 3. More than 5 epochs on small datasets almost always overfits.
learningRateMultiplier (0.1-2.0): Scales the base learning rate. Default of 1.0 works for most cases. If the model loses general capabilities (it can only answer training-domain questions), reduce to 0.3-0.5. If convergence is slow and validation loss is still high after 3 epochs, increase to 1.5.
batchSize (1, 2, 4, 8): Larger batches train faster but use more memory and can reduce gradient noise benefits. 8 is the recommended default. Drop to 4 if you’re getting OOM errors on validation (rare, but happens with very long conversations in your training data).
Invoking a Fine-Tuned Model
Fine-tuned models on Bedrock use on-demand inference — you don’t provision throughput in advance for models customized after July 2025. Call the custom model ARN directly through the standard Converse API:
import boto3
import json
bedrock_rt = boto3.client('bedrock-runtime', region_name='us-east-1')
# Use the custom model ARN from the completed training job
CUSTOM_MODEL_ARN = 'arn:aws:bedrock:us-east-1:123456789012:custom-model/amazon.nova-micro-v1:0/nova-micro-support-bot-v1'
def invoke_fine_tuned_model(user_message, system_prompt=None):
messages = [{'role': 'user', 'content': [{'text': user_message}]}]
request = {
'modelId': CUSTOM_MODEL_ARN,
'messages': messages,
'inferenceConfig': {
'maxTokens': 512,
'temperature': 0.3, # Lower temperature for more consistent outputs
}
}
if system_prompt:
request['system'] = [{'text': system_prompt}]
response = bedrock_rt.converse(**request)
return response['output']['message']['content'][0]['text']
# Test
result = invoke_fine_tuned_model(
"My ECS tasks keep stopping with exit code 137",
system_prompt="You are a technical support assistant for AWS services."
)
print(result)
The custom model ARN format is arn:aws:bedrock:{region}:{account}:custom-model/{base-model-id}/{custom-model-name}. You can also retrieve it from list_custom_models:
aws bedrock list-custom-models \
--query 'modelSummaries[?modelName==`nova-micro-support-bot-v1`].modelArn' \
--output text
Evaluating After Fine-Tuning
Don’t deploy without evaluation. The training loss curve tells you whether the model learned your dataset, not whether it learned what you actually wanted.
A minimal evaluation setup: take 50-100 examples not in your training set, run both the base model and your fine-tuned model on the same prompts, and compare outputs on your actual quality criteria (correct format, factual accuracy, tone, brevity — whatever matters for your use case). Human review on this sample set is more reliable than automated metrics for most practical fine-tuning tasks.
Bedrock provides training and validation loss metrics in the output S3 prefix after the job completes. Check for overfitting — if training loss drops sharply while validation loss plateaus or rises after epoch 2, you need either more data or fewer epochs.
# Check training metrics from the output
aws s3 ls s3://your-bedrock-training-bucket/nova-finetune/output/
# Download and examine the training_metrics.json file
aws s3 cp s3://your-bedrock-training-bucket/nova-finetune/output/training_metrics.json /tmp/
python3 -c "
import json
with open('/tmp/training_metrics.json') as f:
metrics = json.load(f)
for epoch in metrics.get('epoch_metrics', []):
print(f\"Epoch {epoch['epoch']}: train_loss={epoch['trainingLoss']:.4f}, val_loss={epoch.get('validationLoss', 'N/A')}\")
"
Cost Breakdown
Training cost for SFT on Nova Micro is charged per token processed during training. At roughly $0.0004 per 1,000 training tokens, 5,000 examples averaging ~450 tokens each (900 tokens per example × 5,000 × 3 epochs) totals about 13.5 million training tokens — around $5.40 plus the $2.18 GPU-hour base fee. The actual total depends on your conversation lengths and epoch count.
Storage of the custom model weights costs $1.95/month while the model is stored. There’s no inference-specific additional fee beyond standard Nova Micro per-token pricing — on-demand custom model invocations bill at the same rate as the base model.
The economics work like this: if fine-tuning reduces your average output tokens by 40% (because a task-specific model is less verbose than a general model), and you run 1 million Micro invocations per month, you’re saving about $64/month on output tokens. The fine-tuning job pays for itself in the first month.
For more context on how Nova models fit into larger architectures, the Bedrock Agents and MCP DevOps guide covers orchestrating multiple Nova models in a pipeline. If you’re building applications where the fine-tuned model queries a database, the Aurora Serverless v2 with Bedrock AI queries post shows how to wire up the retrieval layer that feeds into Bedrock invocations.
Comments