Skip to main content

Observability in Serverless

Overview of tracing, metrics, logging, and alerting strategies for serverless applications, with multi-cloud considerations and recommended tooling.


Simple Explanation

What it is

Observability is how you see what your serverless system is doing through logs, metrics, and traces.

Why we need it

When something breaks, you need evidence, not guesses. Observability makes that possible.

Benefits

  • Faster diagnosis of errors and slow requests.
  • Clear visibility across services and regions.
  • Better reliability because problems are found early.

Tradeoffs

  • More setup for dashboards and alerts.
  • Ongoing cost for log storage and tracing.

Real-world examples (architecture only)

  • Trace shows slow database call -> Fix query.
  • Alert triggers on error spike -> Rollback.

Trace flow Alert flow


What This Lesson Covers

  • Logging strategy and structured logs
  • Metrics that matter (latency, errors, saturation)
  • Distributed tracing and correlation IDs
  • Alerting thresholds and SLOs
  • Multi-cloud visibility patterns

Core Concepts

  1. Logs: Details of a single event
  2. Metrics: Aggregated trends over time
  3. Traces: End-to-end request flow
  4. SLOs: Reliability targets
  5. Alerting: Automated responses to anomalies

Structured Logging (Python)

import json
from datetime import datetime


def log_event(level, message, **data):
print(json.dumps({
"timestamp": datetime.utcnow().isoformat(),
"level": level,
"message": message,
**data,
}))


def handler(event, context):
log_event("INFO", "Request received", requestId=context.aws_request_id)
# ...

Correlation IDs

Add a request ID to every log line so you can trace one request across services.

def get_request_id(event, context):
return event.get("headers", {}).get("x-request-id") or context.aws_request_id

Metrics That Matter

  • Error rate (percentage of failed requests)
  • Latency (p50, p95, p99)
  • Throttles (rate limiting)
  • Cost drivers (duration, memory, external calls)

Tracing Strategy

  • Sample a percentage of requests
  • Always trace errors
  • Add annotations (userId, region, tier)

Multi-Cloud Observability

Patterns:

  • Standardize log format (JSON)
  • Use consistent metric names
  • Mirror dashboards across clouds

Alerting & SLOs

Define reliability targets and alert on breaches:

  • API read: p99 < 100ms
  • API error rate: < 1%
  • Queue delay: < 2 minutes

Project

Design an observability plan for a serverless API.

Deliverables:

  • List the logs you will capture
  • Define 3 key metrics and alert thresholds
  • Describe how you will trace a request end-to-end

Email your work to maarifaarchitect@gmail.com.


References