Skip to main content

Building Production-Ready Systems

Definition

A production-ready system is one you'd trust to:

  • Handle millions of requests per day
  • Survive infrastructure failures
  • Recover quickly from bugs
  • Scale without manual intervention
  • Cost less than the value it creates

Simple Explanation

What it is

Production-ready means a system you can trust with real users, real money, and real downtime risks.

Why we need it

If you launch without reliability and recovery plans, small incidents turn into major outages.

Benefits

  • Higher uptime with fewer surprises.
  • Faster recovery when failures happen.
  • Predictable costs as traffic grows.

Tradeoffs

  • More engineering work up front.
  • Stricter processes around deployments and monitoring.

Real-world examples (architecture only)

  • Canary deploy -> Automatic rollback -> Stable release.
  • Multi-AZ setup -> Service continues during outage.

The Reliability Hierarchy

Tier 1: Can it handle normal traffic? (Availability)
Tier 2: Can it survive failures? (Resilience)
Tier 3: Can it recover quickly? (Recoverability)
Tier 4: Is it optimized? (Performance)
Tier 5: Is it profitable? (Cost)

Don't build Tier 5 without Tier 1.

Availability (99.9%)

3 nines = 99.9% uptime = 43 minutes downtime/month (maximum).

Multi-AZ Design

LambdaFunction:
# AWS automatically runs in ≥2 AZs
# If AZ fails, another AZ takes over
# Automatic, built-in

DynamoDBTable:
BillingMode: PAY_PER_REQUEST
# Automatically replicated across 3+ AZs
# Automatic, built-in

Health Checks

HealthCheckAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
MetricName: HealthCheckStatus
Threshold: 1
ComparisonOperator: LessThanThreshold

If health check fails, Route 53 routes to backup region.

Resilience (Fault Tolerance)

System continues operating even when parts fail.

Retry Logic

import time

MAX_RETRIES = 3


def backoff(attempt):
return (2 ** attempt) * 0.1


def with_retry(fn):
for attempt in range(MAX_RETRIES):
try:
return fn()
except Exception:
if attempt == MAX_RETRIES - 1:
RuntimeVersion: syn-python-selenium-1.0
raise
Handler: api_canary.handler

import requests

def handler(event, context): response = requests.get("https://api.example.com/items", timeout=5) if response.status_code != 200: raise RuntimeError("Canary failed with status %s" % response.status_code) with concurrent.futures.ThreadPoolExecutor() as executor: future = executor.submit(fn) return future.result(timeout=timeout_seconds)

result = with_timeout(fetch_from_api, 5)


### Graceful Degradation

```python
def get_recommendations(user_id):
try:
return ml.get_recommendations(user_id)
except Exception:
logger.warning("ML service down, using cached recommendations")
return cache.get(f"recommendations:{user_id}")

Recoverability (Mean Time to Recovery)

How quickly can you recover from failures?

Backup Strategy

DatabaseBackup:
Type: AWS::Backup::BackupVault
Properties:
BackupVaultName: production-backups

BackupPlan:
Type: AWS::Backup::BackupPlan
Properties:
ResourceAssignments:
- Resources:
- !GetAtt MyDatabase.Arn
RuleResourceType: RDS
Rules:
- RuleNameDaily
TargetBackupVault: !Ref DatabaseBackup
ScheduleExpression: 'cron(0 2 * * ? *)' # 2 AM daily
StartWindowMinutes: 60
CompletionWindowMinutes: 120
Lifecycle:
DeleteAfterDays: 30
MoveToColdStorage: false

Daily backups can be restored in < 1 hour if needed.

Deployment Verification

Don't deploy blindly:

Canary:
Type: AWS::CloudWatch::SyntheticCanary
Properties:
Name: api-health
RuntimeVersion: syn-python-selenium-1.0
Code:
Handler: api_canary.handler
Script: |
import requests

def handler(event, context):
response = requests.get("https://api.example.com/items", timeout=5)
if response.status_code != 200:
raise RuntimeError("Canary failed with status %s" % response.status_code)

Canary tests run every minute. If deployment breaks API, you know immediately.

Rollback Plan

Keep previous version deployable:

# Current version
sam deploy --stack-name myapp

# If issues detected, rollback to previous
aws cloudformation update-stack \
--stack-name myapp \
--use-previous-template

Rollback takes < 1 minute.

Performance (P99 < 500ms)

User experience depends on tail latency, not average.

SLOs

Define Service Level Objectives:

API read: P99 < 100ms (95% of time)
API write: P99 < 500ms (99% of time)
Database: P99 < 50ms (99.9% of time)

Continuous Monitoring

P99AlarmRead:
Type: AWS::CloudWatch::Alarm
Properties:
MetricName: Duration
Statistic: p99
Period: 300
Threshold: 100
ComparisonOperator: GreaterThanThreshold

Alert if P99 exceeds SLO.

Cost Efficiency

Profitable or it's not production-ready.

Cost Per Request

Revenue per request: $0.10
Cost per request: $0.02
Profit per request: $0.08 (80% margin)

Monitor this metric obsessively.

Cost Breakdown

Requests: $50
Compute: $80
Database: $120
Storage: $20
Monitoring: $10
Total: $280/month

Cost per DAU (10,000 DAU): $0.0093
Revenue per DAU: $0.50
ROI: 53x

Checklist

Before Launch

  • Multi-AZ deployed
  • Backups tested and restorable
  • Failover tested
  • Load tested to 2x expected
  • Error budget defined
  • Alarms set (not just dashboards)
  • Runbooks written
  • On-call rotation established
  • Cost optimized
  • Security reviewed

Week 1 Production

  • Monitor 24/7 (or auto-page)
  • Track all alerts
  • Fix issues immediately
  • Update runbooks based on learnings

Month 1 Production

  • Review cost trends
  • Identify bottlenecks
  • Optimize top issues
  • Plan improvements

Real Example: Production System

SLA Target: 99.95% availability

Architecture:

  • Lambda (auto-scaling, multi-AZ)
  • API Gateway (DDoS protected)
  • DynamoDB (on-demand, 3+ AZ replicated)
  • Route 53 (geo-routing + failover)
  • CloudFront (edge caching)
  • CloudWatch + Alarms (24/7 monitoring)
  • SNS → PagerDuty (auto page on-call)

Cost: $5,000/month

Revenue: $200,000/month

Result: Highly reliable, highly profitable

Lessons Learned

From operating production serverless systems:

  1. Logs are everything — Can't debug without them
  2. Alarms catch issues — Dashboards don't
  3. Test failover — It will happen
  4. Calculate TCO — Operations cost == code cost
  5. Automate everything — Manual ops doesn't scale
  6. Monitor the business — Not just infrastructure
  7. Optimize continuously — Performance decays over time

Key Takeaway

Production systems aren't built, they're grown. Start simple, add reliability incrementally, monitor obsessively, and iterate without fear.


Capstone Project: Design a Serverless System for 1M DAU

Requirements:

  • Handle 1 million daily active users
  • REST API (CRUD operations)
  • Real-time notifications
  • Multi-region deployment
  • 99.95% SLA
  • < $0.01 cost per DAU

Deliverables:

  • System architecture diagram
  • SAM template (production-ready)
  • Runbook (incident response)
  • Cost analysis
  • Load test results
  • Disaster recovery plan

This is your capstone. You've completed Level 4. You're ready.