Building Production-Ready Systems

Definition

A production-ready system is one you'd trust to:

Handle millions of requests per day
Survive infrastructure failures
Recover quickly from bugs
Scale without manual intervention
Cost less than the value it creates

Simple Explanation

What it is

Production-ready means a system you can trust with real users, real money, and real downtime risks.

Why we need it

If you launch without reliability and recovery plans, small incidents turn into major outages.

Benefits

Higher uptime with fewer surprises.
Faster recovery when failures happen.
Predictable costs as traffic grows.

Tradeoffs

More engineering work up front.
Stricter processes around deployments and monitoring.

Real-world examples (architecture only)

Canary deploy -> Automatic rollback -> Stable release.
Multi-AZ setup -> Service continues during outage.

The Reliability Hierarchy

Tier 1: Can it handle normal traffic? (Availability)
Tier 2: Can it survive failures? (Resilience)
Tier 3: Can it recover quickly? (Recoverability)
Tier 4: Is it optimized? (Performance)
Tier 5: Is it profitable? (Cost)

Don't build Tier 5 without Tier 1.

Availability (99.9%)

3 nines = 99.9% uptime = 43 minutes downtime/month (maximum).

Multi-AZ Design

LambdaFunction:
  # AWS automatically runs in ≥2 AZs
  # If AZ fails, another AZ takes over
  # Automatic, built-in

DynamoDBTable:
  BillingMode: PAY_PER_REQUEST
  # Automatically replicated across 3+ AZs
  # Automatic, built-in

Health Checks

HealthCheckAlarm:
  Type: AWS::CloudWatch::Alarm
  Properties:
    MetricName: HealthCheckStatus
    Threshold: 1
    ComparisonOperator: LessThanThreshold

If health check fails, Route 53 routes to backup region.

Resilience (Fault Tolerance)

System continues operating even when parts fail.

Retry Logic

import time

MAX_RETRIES = 3


def backoff(attempt):
  return (2 ** attempt) * 0.1


def with_retry(fn):
  for attempt in range(MAX_RETRIES):
    try:
      return fn()
    except Exception:
      if attempt == MAX_RETRIES - 1:
        RuntimeVersion: syn-python-selenium-1.0
        raise
          Handler: api_canary.handler

import requests

def handler(event, context): response = requests.get("https://api.example.com/items", timeout=5) if response.status_code != 200: raise RuntimeError("Canary failed with status %s" % response.status_code) with concurrent.futures.ThreadPoolExecutor() as executor: future = executor.submit(fn) return future.result(timeout=timeout_seconds)

result = with_timeout(fetch_from_api, 5)

### Graceful Degradation

```python
def get_recommendations(user_id):
  try:
    return ml.get_recommendations(user_id)
  except Exception:
    logger.warning("ML service down, using cached recommendations")
    return cache.get(f"recommendations:{user_id}")

Recoverability (Mean Time to Recovery)

How quickly can you recover from failures?

Backup Strategy

DatabaseBackup:
  Type: AWS::Backup::BackupVault
  Properties:
    BackupVaultName: production-backups

BackupPlan:
  Type: AWS::Backup::BackupPlan
  Properties:
    ResourceAssignments:
      - Resources:
          - !GetAtt MyDatabase.Arn
        RuleResourceType: RDS
        Rules:
          - RuleNameDaily
            TargetBackupVault: !Ref DatabaseBackup
            ScheduleExpression: 'cron(0 2 * * ? *)'  # 2 AM daily
            StartWindowMinutes: 60
            CompletionWindowMinutes: 120
            Lifecycle:
              DeleteAfterDays: 30
              MoveToColdStorage: false

Daily backups can be restored in < 1 hour if needed.

Deployment Verification

Don't deploy blindly:

Canary:
  Type: AWS::CloudWatch::SyntheticCanary
  Properties:
    Name: api-health
    RuntimeVersion: syn-python-selenium-1.0
    Code:
      Handler: api_canary.handler
      Script: |
        import requests

        def handler(event, context):
            response = requests.get("https://api.example.com/items", timeout=5)
            if response.status_code != 200:
                raise RuntimeError("Canary failed with status %s" % response.status_code)

Canary tests run every minute. If deployment breaks API, you know immediately.

Rollback Plan

Keep previous version deployable:

# Current version
sam deploy --stack-name myapp

# If issues detected, rollback to previous
aws cloudformation update-stack \
  --stack-name myapp \
  --use-previous-template

Rollback takes < 1 minute.

Performance (P99 < 500ms)

User experience depends on tail latency, not average.

SLOs

Define Service Level Objectives:

API read: P99 < 100ms (95% of time)
API write: P99 < 500ms (99% of time)
Database: P99 < 50ms (99.9% of time)

Continuous Monitoring

P99AlarmRead:
  Type: AWS::CloudWatch::Alarm
  Properties:
    MetricName: Duration
    Statistic: p99
    Period: 300
    Threshold: 100
    ComparisonOperator: GreaterThanThreshold

Alert if P99 exceeds SLO.

Cost Efficiency

Profitable or it's not production-ready.

Cost Per Request

Revenue per request: $0.10
Cost per request: $0.02
Profit per request: $0.08 (80% margin)

Monitor this metric obsessively.

Cost Breakdown

Requests: $50
Compute: $80
Database: $120
Storage: $20
Monitoring: $10
Total: $280/month

Cost per DAU (10,000 DAU): $0.0093
Revenue per DAU: $0.50
ROI: 53x

Checklist

Before Launch

Week 1 Production

Monitor 24/7 (or auto-page)
Track all alerts
Fix issues immediately
Update runbooks based on learnings

Month 1 Production

Review cost trends
Identify bottlenecks
Optimize top issues
Plan improvements

Real Example: Production System

SLA Target: 99.95% availability

Architecture:

Lambda (auto-scaling, multi-AZ)
API Gateway (DDoS protected)
DynamoDB (on-demand, 3+ AZ replicated)
Route 53 (geo-routing + failover)
CloudFront (edge caching)
CloudWatch + Alarms (24/7 monitoring)
SNS → PagerDuty (auto page on-call)

Cost: $5,000/month

Revenue: $200,000/month

Result: Highly reliable, highly profitable

Lessons Learned

From operating production serverless systems:

Logs are everything — Can't debug without them
Alarms catch issues — Dashboards don't
Test failover — It will happen
Calculate TCO — Operations cost == code cost
Automate everything — Manual ops doesn't scale
Monitor the business — Not just infrastructure
Optimize continuously — Performance decays over time

Key Takeaway

Production systems aren't built, they're grown. Start simple, add reliability incrementally, monitor obsessively, and iterate without fear.

Capstone Project: Design a Serverless System for 1M DAU

Requirements:

Handle 1 million daily active users
REST API (CRUD operations)
Real-time notifications
Multi-region deployment
99.95% SLA
< $0.01 cost per DAU

Deliverables:

System architecture diagram
SAM template (production-ready)
Runbook (incident response)
Cost analysis
Load test results
Disaster recovery plan

This is your capstone. You've completed Level 4. You're ready.

Definition​

Simple Explanation​

What it is​

Why we need it​

Benefits​

Tradeoffs​

Real-world examples (architecture only)​

The Reliability Hierarchy​

Availability (99.9%)​

Multi-AZ Design​

Health Checks​

Resilience (Fault Tolerance)​

Retry Logic​

Recoverability (Mean Time to Recovery)​

Backup Strategy​

Deployment Verification​

Rollback Plan​

Performance (P99 < 500ms)​

SLOs​

Continuous Monitoring​

Cost Efficiency​

Cost Per Request​

Cost Breakdown​

Checklist​

Before Launch​

Week 1 Production​

Month 1 Production​

Real Example: Production System​

Lessons Learned​

Key Takeaway​

Capstone Project: Design a Serverless System for 1M DAU​