Building Production-Ready Systems
Definition
A production-ready system is one you'd trust to:
- Handle millions of requests per day
- Survive infrastructure failures
- Recover quickly from bugs
- Scale without manual intervention
- Cost less than the value it creates
Simple Explanation
What it is
Production-ready means a system you can trust with real users, real money, and real downtime risks.
Why we need it
If you launch without reliability and recovery plans, small incidents turn into major outages.
Benefits
- Higher uptime with fewer surprises.
- Faster recovery when failures happen.
- Predictable costs as traffic grows.
Tradeoffs
- More engineering work up front.
- Stricter processes around deployments and monitoring.
Real-world examples (architecture only)
- Canary deploy -> Automatic rollback -> Stable release.
- Multi-AZ setup -> Service continues during outage.
The Reliability Hierarchy
Tier 1: Can it handle normal traffic? (Availability)
Tier 2: Can it survive failures? (Resilience)
Tier 3: Can it recover quickly? (Recoverability)
Tier 4: Is it optimized? (Performance)
Tier 5: Is it profitable? (Cost)
Don't build Tier 5 without Tier 1.
Availability (99.9%)
3 nines = 99.9% uptime = 43 minutes downtime/month (maximum).
Multi-AZ Design
LambdaFunction:
# AWS automatically runs in ≥2 AZs
# If AZ fails, another AZ takes over
# Automatic, built-in
DynamoDBTable:
BillingMode: PAY_PER_REQUEST
# Automatically replicated across 3+ AZs
# Automatic, built-in
Health Checks
HealthCheckAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
MetricName: HealthCheckStatus
Threshold: 1
ComparisonOperator: LessThanThreshold
If health check fails, Route 53 routes to backup region.
Resilience (Fault Tolerance)
System continues operating even when parts fail.
Retry Logic
import time
MAX_RETRIES = 3
def backoff(attempt):
return (2 ** attempt) * 0.1
def with_retry(fn):
for attempt in range(MAX_RETRIES):
try:
return fn()
except Exception:
if attempt == MAX_RETRIES - 1:
RuntimeVersion: syn-python-selenium-1.0
raise
Handler: api_canary.handler
import requests
def handler(event, context): response = requests.get("https://api.example.com/items", timeout=5) if response.status_code != 200: raise RuntimeError("Canary failed with status %s" % response.status_code) with concurrent.futures.ThreadPoolExecutor() as executor: future = executor.submit(fn) return future.result(timeout=timeout_seconds)
result = with_timeout(fetch_from_api, 5)
### Graceful Degradation
```python
def get_recommendations(user_id):
try:
return ml.get_recommendations(user_id)
except Exception:
logger.warning("ML service down, using cached recommendations")
return cache.get(f"recommendations:{user_id}")
Recoverability (Mean Time to Recovery)
How quickly can you recover from failures?
Backup Strategy
DatabaseBackup:
Type: AWS::Backup::BackupVault
Properties:
BackupVaultName: production-backups
BackupPlan:
Type: AWS::Backup::BackupPlan
Properties:
ResourceAssignments:
- Resources:
- !GetAtt MyDatabase.Arn
RuleResourceType: RDS
Rules:
- RuleNameDaily
TargetBackupVault: !Ref DatabaseBackup
ScheduleExpression: 'cron(0 2 * * ? *)' # 2 AM daily
StartWindowMinutes: 60
CompletionWindowMinutes: 120
Lifecycle:
DeleteAfterDays: 30
MoveToColdStorage: false
Daily backups can be restored in < 1 hour if needed.
Deployment Verification
Don't deploy blindly:
Canary:
Type: AWS::CloudWatch::SyntheticCanary
Properties:
Name: api-health
RuntimeVersion: syn-python-selenium-1.0
Code:
Handler: api_canary.handler
Script: |
import requests
def handler(event, context):
response = requests.get("https://api.example.com/items", timeout=5)
if response.status_code != 200:
raise RuntimeError("Canary failed with status %s" % response.status_code)
Canary tests run every minute. If deployment breaks API, you know immediately.
Rollback Plan
Keep previous version deployable:
# Current version
sam deploy --stack-name myapp
# If issues detected, rollback to previous
aws cloudformation update-stack \
--stack-name myapp \
--use-previous-template
Rollback takes < 1 minute.
Performance (P99 < 500ms)
User experience depends on tail latency, not average.
SLOs
Define Service Level Objectives:
API read: P99 < 100ms (95% of time)
API write: P99 < 500ms (99% of time)
Database: P99 < 50ms (99.9% of time)
Continuous Monitoring
P99AlarmRead:
Type: AWS::CloudWatch::Alarm
Properties:
MetricName: Duration
Statistic: p99
Period: 300
Threshold: 100
ComparisonOperator: GreaterThanThreshold
Alert if P99 exceeds SLO.
Cost Efficiency
Profitable or it's not production-ready.
Cost Per Request
Revenue per request: $0.10
Cost per request: $0.02
Profit per request: $0.08 (80% margin)
Monitor this metric obsessively.
Cost Breakdown
Requests: $50
Compute: $80
Database: $120
Storage: $20
Monitoring: $10
Total: $280/month
Cost per DAU (10,000 DAU): $0.0093
Revenue per DAU: $0.50
ROI: 53x
Checklist
Before Launch
- Multi-AZ deployed
- Backups tested and restorable
- Failover tested
- Load tested to 2x expected
- Error budget defined
- Alarms set (not just dashboards)
- Runbooks written
- On-call rotation established
- Cost optimized
- Security reviewed
Week 1 Production
- Monitor 24/7 (or auto-page)
- Track all alerts
- Fix issues immediately
- Update runbooks based on learnings
Month 1 Production
- Review cost trends
- Identify bottlenecks
- Optimize top issues
- Plan improvements
Real Example: Production System
SLA Target: 99.95% availability
Architecture:
- Lambda (auto-scaling, multi-AZ)
- API Gateway (DDoS protected)
- DynamoDB (on-demand, 3+ AZ replicated)
- Route 53 (geo-routing + failover)
- CloudFront (edge caching)
- CloudWatch + Alarms (24/7 monitoring)
- SNS → PagerDuty (auto page on-call)
Cost: $5,000/month
Revenue: $200,000/month
Result: Highly reliable, highly profitable
Lessons Learned
From operating production serverless systems:
- Logs are everything — Can't debug without them
- Alarms catch issues — Dashboards don't
- Test failover — It will happen
- Calculate TCO — Operations cost == code cost
- Automate everything — Manual ops doesn't scale
- Monitor the business — Not just infrastructure
- Optimize continuously — Performance decays over time
Key Takeaway
Production systems aren't built, they're grown. Start simple, add reliability incrementally, monitor obsessively, and iterate without fear.
Capstone Project: Design a Serverless System for 1M DAU
Requirements:
- Handle 1 million daily active users
- REST API (CRUD operations)
- Real-time notifications
- Multi-region deployment
- 99.95% SLA
- < $0.01 cost per DAU
Deliverables:
- System architecture diagram
- SAM template (production-ready)
- Runbook (incident response)
- Cost analysis
- Load test results
- Disaster recovery plan
This is your capstone. You've completed Level 4. You're ready.